sysadvent: Day 22 - Metadata Rich, Rule Based Object Store

By: John Constable (@kript)
Edited by: Jason Yee (@gitbisect)

iRODS? WhyRODS?

I’ve got lots of data where I work. Petabytes of the stuff. You’ve got lots too, maybe even more!

The problem, as you might know (or are just finding out!) is that unstructured data is hard to search, organise, and use to collaborate. As the volume of data grows, its characteristics will change over time as well, meaning that the organising principles you needed when you started is likely going to be different than the ones you will want a few years and petabytes later.

Also, once you’ve got all the data under some kind of management, you’ll want some assurance that the software will be around for a while and that you can expand or build on it yourself if you need or want to.

In this article, I’m going to show you iRODS, the Integrated Rule-Oriented Data System. It’s been around for years, currently on version 4.2.6, and there is active planning for future releases. Don’t take my word for it—check out their github!.

iRODS? I’ll get started!

Before we dive in, I need to define some terms and give you a bit of background.

Zones are the key bit of infrastructure. Each Zone can stand on its own and is an independent, uniquely named group of servers.

There are two types of servers: the Provider, which runs the Zone and connects to the database that holds the catalog, and the Consumer, which talks to Provider and can serve up additional storage resources. You need at least one Provider, but can start with no Consumers.Both server roles can offer storage resources to be used by the rest of the Zone to store files known as Objects.

Core competencies

The iRODS project (of which, I should add, I contribute bug reports and the odd bit of documentation to, but am not employed by and do not claim to represent) likes to talk about the software in four core competencies: data virtualisation, data discovery, workflow automation, and secure collaboration.

Data Virtualisation

Data in iRODS is stored as Collections that mimic UNIX directory structures, but can be spread over multiple filesystems, or even a mix of different filesystems and other back ends such as S3 buckets or Ceph datastores.

Objects can have multiple replicas stored on different locations, systems, or even storage types. Objects are usually accessed through the command line client, although APIs and other methods such as web and/or webDAV, CyberDuck, and more are available. Storage locations can be queried, but users do not need to understand their architecture in order to access Objects. The location of Objects and method for retrieving them is managed by iRODS. This provides a consistent experience for interacting with the system.

Data discovery

iRODS provides the usual metadata on Objects and Collections out of the box: filename, size, location of each replica, and so forth. You can add metadata manually or in more automated ways using workflows (Can’t wait? Skip forward a bit.) that can be triggered when adding the file or by other activities.

Once added, the metadata can be listed and searched.

The iRODS catalog itself can also be searched with a SQL-like syntax allowing identification of Objects or Collections with particular properties or locations, both in the catalog and on disk, or to provide information on the usage of the system. You can extend this with your own SQL queries if you want, but a lot is available before you need to.

Workflow Automation

iRODS has a Rule Engine that specifies actions to be taken when data is uploaded, downloaded, or accessed. These rules can be written in the existing rule language (a little baroque in my opinion) or you can write your own in Python.

As well as such tasks as changing ownership, setting checksums, or enforcing policy about where an object is stored, rules can extract data from Objects as they are uploaded and attach them as metadata for later searching.

Want the location of your photos automatically extracted and tagged? You can do that! Want to run an entire genomic pipeline from upload to analysis? You can do that too—although not out of the box, some assembly will be required. And for any BioInformaticians reading this, please tell your IT people first. We can help!

Secure Collaboration

Access Control is managed by ACLs on Objects and Collections, and governed by users and groups. In addition to the assorted password options, there are also Tickets which can be granted for time limited access.

There is also the ability to federate between iRODS Zones, so users of one Zone, can be given access to another Zone, each with its own policies, ACLs, auditing, and authentication.

You can encrypt all communication with SSL if you so desire.

iRODS? MyRODS!

Enough talk! Let’s get a server running!

I’ve chosen the simplest setup to start: a Provider server to run our Zone, connecting to a Postgres database on the same server.

Installing the database back end

On our Ubuntu Xenial system (Ubuntu Bionic Support, like winter, is coming), we first install Postgres (other databases supported are MySQL and Oracle), then setup the iRODS User.

This isn’t a tutorial on packaging or databases, so I’ll just point you at the manual.

Now we’re ready to install the application itself.

Installing iRODS as a Provider

RENCI, the developers of iRODS, provide a package repository. Let’s add that, together with its public key

$ wget -qO - https://packages.irods.org/irods-signing-key.asc | sudo apt-key add -
$ echo "deb [arch=amd64] https://packages.irods.org/apt/ $(lsb_release -sc) main" | sudo tee /etc/apt/sources.list.d/renci-irods.list
$ sudo apt-get update

Install the iRODS Provider server package along with the plugin for Postgres

$ sudo apt-get install irods-server irods-database-plugin-postgres

Next, set up iRODS and point it at our database by running the setup script. Follow the prompts and provide the information as requested, there is more information available in the manual.

$ python /var/lib/irods/scripts/setup_irods.py

The setup script will ask for a Zone key, negotiation key and a control key. These are strings (up to 32 characters in length) used for inter and intra-zone security. We touch on federation at the end of this post, but be sure to note your keys and read the federation documentation for more in depth coverage.

Lets become our irods user and run our first command!

$ sudo su - irods
irods@ubuntu-xenial:~$ ils
/tempZone/home/rods:

So far so good! We have an iRODS server, we’ve connected to it, and we can list the Collection we’re in (our home directory by default). We have not uploaded anything yet, so there’s not much to see. Let’s change that with some other commands!

iRODS? I show you this in action!

irods@ubuntu-xenial:~$ iput my-file.txt
irods@ubuntu-xenial:~$ ils my-file.txt
  /tempZone/home/rods/my-file.txt
irods@ubuntu-xenial:~$ ils -l my-file.txt
  rods           0 demoResc        59 2019-11-17.22:11 & my-file.txt

Ok, so we have uploaded a text file and using the long listing of the ils command, we see

The user who created the file
The replica id (for when you have more than one copy of the same file, usually created by rules or special resource types)
The name of the resource the file was uploaded to
The size of the file
The timestamp of when the file was uploaded
Whether the replica is good—the ‘&’ is the moniker for a good replica
The name of the file

There’s more to it than just an object store though!

iRODS? I, checksum!

From cosmic rays to bit flips in memory to RAID controller firmware issues, small changes can occur in files. It’s helpful to have a way to detect this and know if it’s your download that went wrong or if the file on the disk is affected in some way.

Lets fight entropy by protecting against silent corruption with -K. K?

irods@ubuntu-xenial:~$ iput -K my-file.txt
irods@ubuntu-xenial:~$ ils -L my-file.txt
  rods           0 demoResc       158 2019-11-17.22:19 & my-file.txt
 sha2:Er3LKQ5YiO1+njYKQnBywhaxW/ajzavY9/qD++8znL0= generic /var/lib/irods/Vault/home/rods/my-file.txt

Now we have a new field: a checksum! The -K argument tells iput to checksum the file then upload and verify it . The upload is only complete when this returns successfully.

Downloading Files: iget

iget is the command to download files, however you can also use this to verify the file as you download it.

Let’s simulate silent corruption by changing the definition.txt file we uploaded on disk and then attempt to download it again

irods@ubuntu-xenial:/tmp/test$ ils -L definition.txt
  rods           0 demoResc       158 2019-11-17.22:19 & definition.txt
 sha2:Er3LKQ5YiO1+njYKQnBywhaxW/ajzavY9/qD++8znL0= generic /var/lib/irods/Vault/home/rods/definition.txt
irods@ubuntu-xenial:/tmp/test$ echo "not the definition you were looking for" > /var/lib/irods/Vault/home/rods/definition.txt
irods@ubuntu-xenial:/tmp/test$ iget definition.txt
irods@ubuntu-xenial:/tmp/test$ cat definition.txt
not the definition you were looking for
irods@ubuntu-xenial:/tmp/test$ rm definition.txt
irods@ubuntu-xenial:/tmp/test$ iget -K definition.txt
remote addresses: 127.0.1.1 ERROR: rcDataObjGet: checksum mismatch error for ./definition.txt, status = -314000 status = -314000 USER_CHKSUM_MISMATCH
remote addresses: 127.0.1.1 ERROR: getUtil: get error for ./definition.txt status = -314000 USER_CHKSUM_MISMATCH

iRODS? iRule!

iRODS Rules automate data management tasks. You can automate entire workflows, or call rules at many stages of the object lifecycle—each stage is called a Policy Enforcement Point (PEP).

Example checksum rule

In our iput example earlier, we only got a checksum on the file when we uploaded it with the -K flag. However, we might not want our end users to remember to do this, but we still want one, as having a checksum is usually beneficial all round.

We’re going to use a built in rule to make this happen on every upload.

First we need to configure the Provider server to load the rule. iRODS configurations that aren’t held in the database are held in JSON files in the /etc/irods directory. The one you will mostly be working with is server_config.json.

The PEP for post processing uploaded files is called acPostProcForPut and we’re going to use an already present function for making checksums.

By convention, the default rules are left in place and any changes are added in a new file that is included before the defaults. This allows undefined behaviour in one file to fall through to the next one until it hits the defaults.

First, we’ll create a rules file to add a rule to the PEP that is called upon completion of an upload (or put) operation.

In the /etc/irods/customrules.re file we’ll add:

acPostProcForPut {
 msiDataObjChksum($objPath, "", *checksumOut);
}

Then we need to tell iRODS to use those rules before the defaults. Each new iRODS connection causes a new agent to be started, which reads the config from the files. So the change is live as soon as it’s made.

In our /etc/irods/server_config.json file we have the rule_engine stanza

     "rule_engines": [
         {
             "instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
             "plugin_name": "irods_rule_engine_plugin-irods_rule_language",
             "plugin_specific_configuration": {
                     "re_data_variable_mapping_set": [
                         "core"
                     ],
                     "re_function_name_mapping_set": [
                         "core"
                     ],
                     "re_rulebase_set": [
                         "core"
                     ],
                     "regexes_for_supported_peps": [
                         "ac[^ ]*",
                         "msi[^ ]*",
                         "[^ ]*pep_[^ ]*_(pre|post|except)"
                     ]
             },
             "shared_memory_instance": "irods_rule_language_rule_engine"
         },

We want to change the re_rulebase_set to include our customrules.re file.

Note that the .re extension is left off, but is required for the server to find the file in the directory.

                 "re_rulebase_set": [{
                             "filename" : "customrules"
                     },
                     {
                             "filename" : "core"
                     }
                 ],

Now let’s test that checksum rule. Note that the irods superuser does not have rules applied to tasks, so we’ll use another user, john, to test.

john@ubuntu-xenial:~$ iput my-file.txt
john@ubuntu-xenial:~$ ils -L
/tempZone/home/john:
  john           0 demoResc        29 2019-11-23.21:41 & my-file.txt
 sha2:PPV9kd8elf4mA0OGbrK+I7qENRhTtws2okP2RV2mbMc= generic /var/lib/irods/Vault/home/john/my-file.txt

You notice that here, we have not used the -K flag, but iRODS generates the checksum anyway, because of the msiDataObjChksum service call we added to the post-upload rule acPostProcForPut.

More information about the Rule Engine and the Dynamic Policy Enforcement Points can be found in the manual.

iRODS? I can find my data!

Let’s see how can we apply metadata to the files to find them later.

First, some files. I’m going to upload some books from Project Gutenburg

Shaving Made Easy: What the Man Who Shaves Ought to Know by Anonymous

Shavings: A Novel by Joseph Crosby Lincoln

#upload the books
irods@ubuntu-xenial:~$ iput ShavingMadeEasy.mobi 
irods@ubuntu-xenial:~$ iput Shavings.mobi

Now we have the files, let’s add some metadata about them

irods@ubuntu-xenial:~$ imeta add -d ShavingMadeEasy.mobi Author Anonymous
irods@ubuntu-xenial:~$ imeta add -d Shavings.mobi Author "Joseph Crosby Lincoln"

So what does it look like now we’ve set it

irods@ubuntu-xenial:~$ imeta ls -ld ShavingMadeEasy.mobi
AVUs defined for dataObj ShavingMadeEasy.mobi:
attribute: Author
value: Anonymous
units:
time set: 2019-11-27.21:29:53
----
attribute: Title
value: Shaving Made Easy
units:
time set: 2019-11-27.21:30:34

Now that we have some metadata, we can search on it. There is a query syntax which allows wildcards, string, and numeric searching. Be aware, the search is case sensitive.

Let’s find all the files that have a metadata field of ‘Author’ set, and which starts with ‘A’

irods@ubuntu-xenial:~$ imeta qu -d Author like A%
collection: /tempZone/home/rods
dataObj: ShavingMadeEasy.mobi

How about searching within the string, in this case for part of an Author’s name

irods@ubuntu-xenial:~$ imeta qu -d Author like %Crosby%
collection: /tempZone/home/rods
dataObj: Shavings.mobi

Finally, find all the files where the Author metadata has been set

irods@ubuntu-xenial:~$ imeta qu -d Author like %
collection: /tempZone/home/rods
dataObj: ShavingMadeEasy.mobi
----
collection: /tempZone/home/rods
dataObj: Shavings.mobi

iRODS? I put it where?

I could write an entire article on the iquest command! This powerful command allows you to find files across the entire Zone, no matter which resource they are in, with a SQL-like query language.

For example, how about a one line command to show you which users have uploaded files,how much, and distributed over which resources?

irods@ubuntu-xenial:~$ iquest "User %-9.9s uses %14.14s bytes in %8.8s files in '%s'" "SELECT USER_NAME, sum(DATA_SIZE),count(DATA_NAME),RESC_NAME"
'%s'" "SELECT USER_NAME,sum(DATA_SIZE),count(DATA_NAME),RESC_NAME"
User john   uses         261 bytes in     9 files in 'demoResc'
User rods   uses         217 bytes in     4 files in 'demoResc'

iRODS? I have more to say!

In addition to the above, here are some other things you might want to look into once you have your Zone up and running.

Federation

This is linking two or more Zones together, and allows users from one Zone to be granted access to Objects in another. One way to use this is a ‘hub and spoke’ design, where one Zone is used as a hub for authentication and users then connect on to other Zones—so authentication only needs to be handled in one place and differing policies, security models, and designs can be used on each sub-zone.

Capabilities

An iRODS Capability a pre-built set of rules and configurations designed around particular use cases. Some examples are;

Automated Ingest Framework - watching a file-system and automatically registering new files into iRODS, making them available to rules based workflow, or just visibility of further metadata tagging or retrieval.
Storage Tiering - rule based migration between different resource types
Indexing and Publishing - indexing Collections into external search systems such as Elasticsearch

iRODS? Your RODS!

While quick to set-up, iRODS provides powerful and flexible tools for automating your data management. Next time you’re shaving that yak cataloging your files or S3 buckets, I hope you’ll give it a try!

December 22, 2019

Day 22 - Metadata Rich, Rule Based Object Store - Introduction to iRODS

iRODS? WhyRODS?

iRODS? I’ll get started!

Core competencies

Data Virtualisation

Data discovery

Workflow Automation

Secure Collaboration

iRODS? MyRODS!

Installing the database back end

Installing iRODS as a Provider

iRODS? I show you this in action!

iRODS? I, checksum!

Downloading Files: iget

iRODS? iRule!

Example checksum rule

iRODS? I can find my data!

iRODS? I put it where?

iRODS? I have more to say!

Federation

Capabilities

iRODS? Your RODS!

No comments :

What is sysadvent?

Blog Archive

December 22, 2019

Day 22 - Metadata Rich, Rule Based Object Store - Introduction to iRODS

iRODS? WhyRODS?

iRODS? I’ll get started!

Core competencies

Data Virtualisation

Data discovery

Workflow Automation

Secure Collaboration

iRODS? MyRODS!

Installing the database back end

Installing iRODS as a Provider

iRODS? I show you this in action!

iRODS? I, checksum!

Downloading Files: iget

iRODS? iRule!

Example checksum rule

iRODS? I can find my data!

iRODS? I put it where?

iRODS? I have more to say!

Federation

Capabilities

iRODS? Your RODS!

No comments :

What is sysadvent?

Subscribe

Blog Archive