Skip to contents

In this vignette we’ll look at the use of metadata with the rirods package. This guide is meant to be useful both for users familiar with iRODS that want to understand the R client better, and for R users who are not familiar with iRODS metadata.

Setup

In the background we have already started an iRODS session in the demo server; our home directory “/tempZone/home/rods” is empty, as ils() shows:

ils()
#> This collection does not contain any objects or collections.

For illustration purposes, we’ll create some data objects (i.e. files). First, we simulate a study with a small dataframe and a linear model.

set.seed(1234)
fake_data <- data.frame(x = rnorm(20, mean = 1))
fake_data$y <- fake_data$x * 2 + 3 - rnorm(20, sd = 0.6)
m <- lm(y ~ x, data = fake_data)
m
#> 
#> Call:
#> lm(formula = y ~ x, data = fake_data)
#> 
#> Coefficients:
#> (Intercept)            x  
#>       3.249        2.130

Then we store the dataframe as csv and the linear model as RDS objects on iRODS. The csv file must be stored locally first, but the other two can be directly streamed to iRODS.

data_path <- "data.csv"
lm_path <- "analysis/linear_model.rds"
write.csv(fake_data, data_path) # write locally
iput(data_path, data_path) # transfer to iRODS
imkdir("analysis") # create directory
# save directly as rds
isaveRDS(m, lm_path)

If we add metadata=TRUE to the ils() call, we will see that these new data objects have no metadata attached to them.

ils(metadata=TRUE)
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis :
#> list()
#> 
#> /tempZone/home/rods/data.csv :
#> list()
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                  logical_path        type
#>  /tempZone/home/rods/analysis  collection
#>  /tempZone/home/rods/data.csv data_object
ils("analysis", metadata=TRUE)
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis/linear_model.rds :
#> list()
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                                   logical_path        type
#>  /tempZone/home/rods/analysis/linear_model.rds data_object

Metadata in iRODS

In iRODS, metadata is registered as attribute name-value-unit triples (aka AVUs) attached to collections or data objects. To add an AVU with rirods we can use the imeta() function, which takes three main arguments: the path to the collection or data object, its entity type (“data_object”, which is the default, or “collection”), and a list of operations. These operations themselves must be named lists or vectors with an operation —which indicates whether we want to “add” or “remove” an AVU— and the values for the attribute (name), value and, optionally, units.

For example, let’s say we want to include the number of rows of our fake_data as a metadata field “nrow”. We could do something like this1:

imeta(data_path, operations = list(
  list(operation = "add", attribute = "nrow", value = as.character(nrow(fake_data)))
  ))
filter_ils(data_path, ils(metadata=TRUE))
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/data.csv :
#>  attribute value units
#>       nrow    20      
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                  logical_path        type
#>  /tempZone/home/rods/data.csv data_object

We can also have several AVUs with the same attribute name and different values or units for the same item. For example, we might want to code the number of rows and columns as a metadata field “size”. Since the old AVU is not necessary any more, we can remove it by providing a “remove” operation.

imeta(data_path, operations = list(
  list(operation = "add", attribute = "size", value = as.character(nrow(fake_data)), units = "rows"),
  list(operation = "add", attribute = "size", value = as.character(length(fake_data)), units = "columns"),
  list(operation = "remove", attribute = "nrow", value = as.character(nrow(fake_data)))
  ))
filter_ils(data_path, ils(metadata=TRUE))
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/data.csv :
#>  attribute value   units
#>       size     2 columns
#>       size    20    rows
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                  logical_path        type
#>  /tempZone/home/rods/data.csv data_object

Multiple operations for one item

Since dataframes are lists of lists, the operations argument of imeta() can also be a dataframe. Say, for example, that we have a standard set of metadata fields that we would like to add to the linear model:

lm_meta <- data.frame(
    attribute = c("size", "size", "data_file", "model_type"),
    value = c(as.character(nrow(fake_data)), 1, data_path, "linear regression"),
    units= c("observations", "predictors", "", "")
)
lm_meta
#>    attribute             value        units
#> 1       size                20 observations
#> 2       size                 1   predictors
#> 3  data_file          data.csv             
#> 4 model_type linear regression

We can then just add a column with the operation name and add it to our model data object:

lm_meta$operation <- "add"
imeta(lm_path, operations = lm_meta)
filter_ils("linear_model", ils("analysis", metadata=TRUE))
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis/linear_model.rds :
#>   attribute             value        units
#>   data_file          data.csv             
#>  model_type linear regression             
#>        size                 1   predictors
#>        size                20 observations
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                                   logical_path        type
#>  /tempZone/home/rods/analysis/linear_model.rds data_object

Working with multiple items

If we want to add metadata to several items, however, we need to run one imeta() call per item, or loop over them with a function such as purrr:::pmap():

file_md <- data.frame(
  path = c(data_path, lm_path),
  type = c("dataframe", "lm"),
  responsible = c("abby", "bob")
)
pmap(file_md, function(path, type, responsible) {
  imeta(path, operations = list(
    list(operation = "add", attribute = "type", value = type),
    list(operation = "add", attribute = "responsible", value = responsible)
  ))
})
ils(metadata=TRUE)
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis :
#> data frame with 0 columns and 0 rows
#> 
#> /tempZone/home/rods/data.csv :
#>    attribute     value   units
#>  responsible      abby        
#>         size         2 columns
#>         size        20    rows
#>         type dataframe        
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                  logical_path        type
#>  /tempZone/home/rods/analysis  collection
#>  /tempZone/home/rods/data.csv data_object
ils("analysis", metadata=TRUE)
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis/linear_model.rds :
#>    attribute             value        units
#>    data_file          data.csv             
#>   model_type linear regression             
#>  responsible               bob             
#>         size                 1   predictors
#>         size                20 observations
#>         type                lm             
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                                   logical_path        type
#>  /tempZone/home/rods/analysis/linear_model.rds data_object

Collections

Adding metadata to a collection follows the same procedure, but we do need to specify the entity type. The reason we did not specify it for data objects is that it’s the default value.

imeta(
  "analysis",
  "collection",
  operations = list(
    list(operation = "add", attribute = "dataset", value = data_path)
  ))
ils(metadata=TRUE)
#> 
#> ========
#> metadata
#> ========
#> /tempZone/home/rods/analysis :
#>  attribute    value units
#>    dataset data.csv      
#> 
#> /tempZone/home/rods/data.csv :
#>    attribute     value   units
#>  responsible      abby        
#>         size         2 columns
#>         size        20    rows
#>         type dataframe        
#> 
#> 
#> ==========
#> iRODS Zone
#> ==========
#>                  logical_path        type
#>  /tempZone/home/rods/analysis  collection
#>  /tempZone/home/rods/data.csv data_object

Querying

We can query our collections and data objects based on their metadata with iquery() and providing a GenQuery statement with the format "SELECT COL1, COL2, COLN... (WHERE CONDITION)". In this statement, “COL 1, COL2, COLN…” are names of columns in a database, i.e. the properties we want to obtain, and the optional condition after “WHERE” provides a filter based on the metadata of collections and data objects.

For example, the query below asks for the names of the parent collection and data objects of all the data objects that we have access to:

iquery("SELECT COLL_NAME, DATA_NAME")
#>                      COLL_NAME        DATA_NAME
#> 1          /tempZone/home/rods         data.csv
#> 2 /tempZone/home/rods/analysis linear_model.rds

The output is a dataframe with one row per result and one column per information piece we requested (in this case the name of the collection “COLL_NAME” and the name of the data object “DATA_NAME”). Note how the query goes through all the levels of our file system.

The query below filters collections with a metadata attribute name (“META_COLL_ATTR_NAME”) beginning with “data” and retrieves the names of the collection and its data objects (“COLL_NAME” and “DATA_NAME”) as well as the value of said metadata item (“META_COLL_ATTR_VALUE”).

iquery("SELECT COLL_NAME, DATA_NAME, META_COLL_ATTR_VALUE WHERE META_COLL_ATTR_NAME LIKE 'data%'")
#>                      COLL_NAME        DATA_NAME META_COLL_ATTR_VALUE
#> 1 /tempZone/home/rods/analysis linear_model.rds             data.csv

We could also retrieve other type of information such as the size of a data object or the creation/modification time of a collection, a data object or their metadata. For instance, the query below filters the data objects that have a metadata attribute “size” (“META_DATA_ATTR_NAME = ‘size’”) and retrieves their actual size in bytes (“DATA_SIZE”) as well as the value and units of the metadata attribute (“META_DATA_ATTR_VALUE” and “META_DATA_ATTR_UNITS”).

iquery("SELECT DATA_NAME, DATA_SIZE, META_DATA_ATTR_VALUE, META_DATA_ATTR_UNITS WHERE META_DATA_ATTR_NAME = 'size'")
#>          DATA_NAME DATA_SIZE META_DATA_ATTR_VALUE META_DATA_ATTR_UNITS
#> 1         data.csv       798                    2              columns
#> 2         data.csv       798                   20                 rows
#> 3 linear_model.rds      3926                    1           predictors
#> 4 linear_model.rds      3926                   20         observations

Columns ending in “SIZE” are parsed to numbers; in the same way, columns ending in “TIME” have the class “POSIXct”, i.e. as datetime objects. As an example, the query below retrieves parent collection’s name (“COLL_NAME”) and the name (“DATA_NAME”), creation time (“DATA_CREATE_TIME”) and size in bytes (“DATA_SIZE”) of all data objects whose parent collection name ends in “analysis” and that are less than 8000 bytes in size.

iq <- iquery("SELECT COLL_NAME, DATA_NAME, DATA_CREATE_TIME, DATA_SIZE WHERE COLL_NAME LIKE '%analysis' AND DATA_SIZE < '8000'")
iq
#>                      COLL_NAME        DATA_NAME    DATA_CREATE_TIME DATA_SIZE
#> 1 /tempZone/home/rods/analysis linear_model.rds 2023-11-20 17:27:53      3926
class(iq$DATA_CREATE_TIME)
#> [1] "POSIXct" "POSIXt"
class(iq$DATA_SIZE)
#> [1] "numeric"

There are a number of columns that could be used for selection of filtering. The ones that you’ll probably find most useful are shown in the table below:

Attribute Collection Data object
Entity level
id COLL_ID DATA_ID
name COLL_NAME DATA_NAME
creation time COLL_CREATE_TIME DATA_CREATE_TIME
modification time COLL_MODIFY_TIME DATA_MODIFY_TIME
size DATA_SIZE
Metadata level
attribute name META_COLL_ATTR_NAME META_DATA_ATTR_NAME
value META_COLL_ATTR_VALUE META_DATA_ATTR_VALUE
units META_COLL_ATTR_UNITS META_DATA_ATTR_UNITS
id META_COLL_ID META_DATA_ID
creation time META_COLL_CREATE_TIME META_DATA_CREATE_TIME
modification time META_COLL_MODIFY_TIME META_DATA_MODIFY_TIME

A final tip is that if you request the name of the parent collection and of the data object themselves, you can concatenate them to obtain their logical paths:

iq$PATH <- file.path(iq$COLL_NAME, iq$DATA_NAME)
iq
#>                      COLL_NAME        DATA_NAME    DATA_CREATE_TIME DATA_SIZE
#> 1 /tempZone/home/rods/analysis linear_model.rds 2023-11-20 17:27:53      3926
#>                                            PATH
#> 1 /tempZone/home/rods/analysis/linear_model.rds

Now you are ready to describe all your data with iRODS metadata and find anything and everything with ils() and iquery().