In this vignette we’ll look at the use of metadata with the rirods package. This guide is meant to be useful both for users familiar with iRODS that want to understand the R client better, and for R users who are not familiar with iRODS metadata.
Setup
In the background we have already started an iRODS session in the
demo server; our home directory “/tempZone/home/rods” is empty, as
ils()
shows:
ils()
#> This collection does not contain any objects or collections.
For illustration purposes, we’ll create some data objects (i.e. files). First, we simulate a study with a small dataframe and a linear model.
set.seed(1234)
fake_data <- data.frame(x = rnorm(20, mean = 1))
fake_data$y <- fake_data$x * 2 + 3 - rnorm(20, sd = 0.6)
m <- lm(y ~ x, data = fake_data)
m
#>
#> Call:
#> lm(formula = y ~ x, data = fake_data)
#>
#> Coefficients:
#> (Intercept) x
#> 3.249 2.130
Then we store the dataframe as csv and the linear model as RDS objects on iRODS. The csv file must be stored locally first, but the other two can be directly streamed to iRODS.
data_path <- "data.csv"
lm_path <- "analysis/linear_model.rds"
write.csv(fake_data, data_path) # write locally
iput(data_path, data_path) # transfer to iRODS
imkdir("analysis") # create directory
# save directly as rds
isaveRDS(m, lm_path)
If we add metadata=TRUE
to the ils()
call,
we will see that these new data objects have no metadata attached to
them.
ils(metadata=TRUE)
#> No metadata
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path
#> /tempZone/home/rods/analysis
#> /tempZone/home/rods/data.csv
ils("analysis", metadata=TRUE)
#> No metadata
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path
#> /tempZone/home/rods/analysis/linear_model.rds
Metadata in iRODS
In iRODS, metadata is registered as attribute name-value-unit triples
(aka AVUs) attached to collections or data objects. To add an
AVU with rirods we can use the imeta()
function, which takes three main arguments: the path to the collection
or data object, its entity type (“data_object”, which is the default, or
“collection”), and a list of operations. These operations themselves
must be named lists or vectors with an operation
—which
indicates whether we want to “add” or “remove” an AVU— and the values
for the attribute (name), value and, optionally, units.
For example, let’s say we want to include the number of rows of our
fake_data
as a metadata field “nrow”. We could do something
like this1:
imeta(data_path, operations = list(
list(operation = "add", attribute = "nrow", value = as.character(nrow(fake_data)))
))
filter_ils(data_path, ils(metadata=TRUE))
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value units
#> /tempZone/home/rods/data.csv nrow 20
We can also have several AVUs with the same attribute name and different values or units for the same item. For example, we might want to code the number of rows and columns as a metadata field “size”. Since the old AVU is not necessary any more, we can remove it by providing a “remove” operation.
imeta(data_path, operations = list(
list(operation = "add", attribute = "size", value = as.character(nrow(fake_data)), units = "rows"),
list(operation = "add", attribute = "size", value = as.character(length(fake_data)), units = "columns"),
list(operation = "remove", attribute = "nrow", value = as.character(nrow(fake_data)))
))
filter_ils(data_path, ils(metadata=TRUE))
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value units
#> /tempZone/home/rods/data.csv size 2 columns
#> size 20
Multiple operations for one item
Since dataframes are lists of lists, the operations
argument of imeta()
can also be a dataframe. Say, for
example, that we have a standard set of metadata fields that we would
like to add to the linear model:
lm_meta <- data.frame(
attribute = c("size", "size", "data_file", "model_type"),
value = c(as.character(nrow(fake_data)), 1, data_path, "linear regression"),
units = c("observations", "predictors", "", "")
)
lm_meta
#> attribute value units
#> 1 size 20 observations
#> 2 size 1 predictors
#> 3 data_file data.csv
#> 4 model_type linear regression
We can then just add a column with the operation name and add it to our model data object:
lm_meta$operation <- "add"
imeta(lm_path, operations = lm_meta)
filter_ils("linear_model", ils("analysis", metadata=TRUE))
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value
#> /tempZone/home/rods/analysis/linear_model.rds data_file data.csv
#> model_type linear regression
#> size 1
#> size 20
#> units
#>
#>
#>
#>
Working with multiple items
If we want to add metadata to several items, however, we need to run
one imeta()
call per item, or loop over them with a
function such as purrr:::pmap()
:
file_md <- data.frame(
path = c(data_path, lm_path),
type = c("dataframe", "lm"),
responsible = c("abby", "bob")
)
pwalk(file_md, function(path, type, responsible) {
imeta(path, operations = list(
list(
operation = "add",
attribute = "type",
value = type
),
list(
operation = "add",
attribute = "responsible",
value = responsible
)
))
})
ils(metadata=TRUE)
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value units
#> /tempZone/home/rods/analysis <NA> <NA> <NA>
#> /tempZone/home/rods/data.csv responsible abby
#> size 2
#> size 20
#> type dataframe
ils("analysis", metadata=TRUE)
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value
#> /tempZone/home/rods/analysis/linear_model.rds data_file data.csv
#> model_type linear regression
#> responsible bob
#> size 1
#> size 20
#> type lm
#> units
#>
#>
#>
#>
#>
#>
Collections
Adding metadata to a collection follows the same procedure, but we do need to specify the entity type. The reason we did not specify it for data objects is that it’s the default value.
imeta(
"analysis",
operations = list(
list(operation = "add", attribute = "dataset", value = data_path)
))
ils(metadata=TRUE)
#>
#> ==========
#> iRODS Zone
#> ==========
#> logical_path attribute value units
#> /tempZone/home/rods/analysis dataset data.csv
#> /tempZone/home/rods/data.csv dataset data.csv
#> dataset data.csv
#> dataset data.csv
#> dataset data.csv
Querying
We can query our collections and data objects based on their metadata
with iquery()
and providing a GenQuery statement with the
format "SELECT COL1, COL2, COLN... (WHERE CONDITION)"
. In
this statement, “COL 1, COL2, COLN…” are names of columns in a database,
i.e. the properties we want to obtain, and the optional condition after
“WHERE” provides a filter based on the metadata of collections and data
objects.
For example, the query below asks for the names of the parent collection and data objects of all the data objects that we have access to:
iquery("SELECT COLL_NAME, DATA_NAME")
#> COLL_NAME DATA_NAME
#> 1 /tempZone/home/rods data.csv
#> 2 /tempZone/home/rods/analysis linear_model.rds
#> 3 /tempZone/trash/home/rods 200numbers.rds
The output is a dataframe with one row per result and one column per information piece we requested (in this case the name of the collection “COLL_NAME” and the name of the data object “DATA_NAME”). Note how the query goes through all the levels of our file system.
The query below filters collections with a metadata attribute name (“META_COLL_ATTR_NAME”) beginning with “data” and retrieves the names of the collection and its data objects (“COLL_NAME” and “DATA_NAME”) as well as the value of said metadata item (“META_COLL_ATTR_VALUE”).
iquery("SELECT COLL_NAME, DATA_NAME, META_COLL_ATTR_VALUE WHERE META_COLL_ATTR_NAME LIKE 'data%'")
#> COLL_NAME DATA_NAME META_COLL_ATTR_VALUE
#> 1 /tempZone/home/rods/analysis linear_model.rds data.csv
We could also retrieve other type of information such as the size of a data object or the creation/modification time of a collection, a data object or their metadata. For instance, the query below filters the data objects that have a metadata attribute “size” (“META_DATA_ATTR_NAME = ‘size’”) and retrieves their actual size in bytes (“DATA_SIZE”) as well as the value and units of the metadata attribute (“META_DATA_ATTR_VALUE” and “META_DATA_ATTR_UNITS”).
iquery("SELECT DATA_NAME, DATA_SIZE, META_DATA_ATTR_VALUE, META_DATA_ATTR_UNITS WHERE META_DATA_ATTR_NAME = 'size'")
#> DATA_NAME DATA_SIZE META_DATA_ATTR_VALUE META_DATA_ATTR_UNITS
#> 1 data.csv 798 2 columns
#> 2 data.csv 798 20 rows
#> 3 linear_model.rds 3927 1 predictors
#> 4 linear_model.rds 3927 20 observations
Columns ending in “SIZE” are parsed to numbers; in the same way, columns ending in “TIME” have the class “POSIXct”, i.e. as datetime objects. As an example, the query below retrieves parent collection’s name (“COLL_NAME”) and the name (“DATA_NAME”), creation time (“DATA_CREATE_TIME”) and size in bytes (“DATA_SIZE”) of all data objects whose parent collection name ends in “analysis” and that are less than 8000 bytes in size.
iq <- iquery("SELECT COLL_NAME, DATA_NAME, DATA_CREATE_TIME, DATA_SIZE WHERE COLL_NAME LIKE '%analysis' AND DATA_SIZE < '8000'")
iq
#> COLL_NAME DATA_NAME DATA_CREATE_TIME DATA_SIZE
#> 1 /tempZone/home/rods/analysis linear_model.rds 2024-03-16 19:24:32 3927
class(iq$DATA_CREATE_TIME)
#> [1] "POSIXct" "POSIXt"
class(iq$DATA_SIZE)
#> [1] "numeric"
There are a number of columns that could be used for selection of filtering. The ones that you’ll probably find most useful are shown in the table below:
Attribute | Collection | Data object |
---|---|---|
Entity level | ||
id | COLL_ID | DATA_ID |
name | COLL_NAME | DATA_NAME |
creation time | COLL_CREATE_TIME | DATA_CREATE_TIME |
modification time | COLL_MODIFY_TIME | DATA_MODIFY_TIME |
size | DATA_SIZE | |
Metadata level | ||
attribute name | META_COLL_ATTR_NAME | META_DATA_ATTR_NAME |
value | META_COLL_ATTR_VALUE | META_DATA_ATTR_VALUE |
units | META_COLL_ATTR_UNITS | META_DATA_ATTR_UNITS |
id | META_COLL_ID | META_DATA_ID |
creation time | META_COLL_CREATE_TIME | META_DATA_CREATE_TIME |
modification time | META_COLL_MODIFY_TIME | META_DATA_MODIFY_TIME |
A final tip is that if you request the name of the parent collection and of the data object themselves, you can concatenate them to obtain their logical paths:
iq$PATH <- file.path(iq$COLL_NAME, iq$DATA_NAME)
iq
#> COLL_NAME DATA_NAME DATA_CREATE_TIME DATA_SIZE
#> 1 /tempZone/home/rods/analysis linear_model.rds 2024-03-16 19:24:32 3927
#> PATH
#> 1 /tempZone/home/rods/analysis/linear_model.rds
Now you are ready to describe all your data with iRODS metadata and
find anything and everything with ils()
and
iquery()
.