For maintainers#
This section covers instructions and descriptions of helper functionality for maintainers of the ABCD-J data catalog, or for users who have the necessary technical expertise to contribute metadata to catalog themselves.
The catalog setup#
Note
Coming soon! This section will explain the design of the ABCD-J catalog, how it is put together using different tools, and the requirements for maintaining or contributing to the catalog.
Adding/updating catalog metadata#
Note
Coming soon! This section will explain how to use provided Python scripts to extract and add new metadata to the ABCD-J catalog, or how to update existing metadata
Generating a file list#
As mentioned in the user Instructions, the optional files
category lists
one or more files that form part of the dataset, with recognized properties:
path[POSIX]
(required)size[bytes]
(optional)checksum[md5]
(optional)url
(optional)
While such a list of files can be created manually, this becomes tedious and time consuming for datasets with many files. It is possible to use scripts to automatically generate a full file list for a given dataset. Such scripts need to generate a TSV file with the recognized properties and in the format specified in the user Instructions.
Since datasets can be stored in different formats, hosted in different locations, and be accessed
in different ways, it is unlikely that a single script can generalize a way in which to generate
a file list. We supply a script create_tabby_filelist.py
that currently supports two options
via arguments:
A folder with files, stored on local or server storage
A DataLad dataset, stored on local or server storage
Note
Customizations to support e.g. git repositories or compressed files are also possible
Hence, it is firstly important to identify whether the script will be run on a DataLad dataset, or some files on a file system.
Before running the scripts#
First ensure that you have a recent version of Python on your machine.
Then make sure you have the data-catalog
code available locally:
git clone https://github.com/abcd-j/data-catalog.git
For running the script on a folder with files, you do not need to install any further requirements.
For running the script on a DataLad dataset, please install requirements with pip
cd data-catalog
pip install -r requirements.txt
Running the scripts#
The create_tabby_filelist.py script can then be run as follows:
python3 code/create_tabby_filelist.py --method <glob | tree> --output <path-to-output-directory> <path-to-dataset-location>
where:
<glob | tree>
should be the selected method:glob
for a folder with files, ortree
for a DataLad dataset<path-to-output-directory>
is where the TSV file namedfiles@tby-ds1.tsv
will be written to<path-to-dataset-location>
is the location of the dataset (folder with files or DataLad dataset)
This will generate the correct TSV file at location <path-to-output-directory>/files@tby-ds1.tsv
, excluding the
values for the url
.
If your dataset has a specific download URL for each file, this can then be added to the TSV file. This process can be done manually, or with another script. Since URLs vary substantially, there is no general script that would be able to do this for any files. However, an example script can be found at add_file_urls.py, which could be custimized to suit your own file url schema.