Add metadata specification#77
Conversation
| @@ -0,0 +1,297 @@ | |||
| :orphan: | |||
There was a problem hiding this comment.
I assume this is a placeholder, but this page should be made far more accessible on the website eventually.
|
|
||
| ## Metadata Organisation Description | ||
|
|
||
| At each level of the project, a metadata file can be included that describes that level: |
There was a problem hiding this comment.
Just seen that this comes later, but I think some intro would be better higher up in case people haven't come across yaml files.
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
|
My only main comment about the spec is that we should have a standardised way to add:
|
Co-authored-by: Adam Tyson <code@adamltyson.com>
|
Hi @adamltyson @niksirbi I've updated the metadata spec based on our discussion yesterday. It reads much cleaner with the new format and is much easier to explain so that's a good sign it is an improvement. Some points that came up when I was writing it: 1) how to handle BIDS BIDS has two types of metadata, the In some cases this maps OK (e.g. their participant spec). I just left a note on our spec to use PascalCase to refer to these keys, even though they are snake_case on the BIDS spec. This works because instead of tables, we use inheritance to cover multiple subjects. In other cases this does not map very well. For example, in the ephys spec there are many detailed The solution for now is to only list the metadata keys from BIDS 2) Indicate we are not FAIR Related to the above, we should probably explicitly state that this metadata spec is not intended for full FAIR compliance and if you want this you should use another spec e.g. BIDS. As part of the datashuttle review there was the nice suggestion to explicitly state where we are not FAIR. Maybe we can include a page on the NB website covering this, then we can link to it in the metadata spec page as well as mention it in the paper? 3) Removed 'rawdata' section I removed the 4) Where the page should go Currently this is an orphan page linked to from the metadata spec. To improve findability we could have two pages 'Folder Specification' and 'Metadata Specification' both linked from the top bar? LMK what you think of these points / the updated spec |
|
That all sounds good to me. |
That looks like the pragmatic choice. We can open an issue a about that ephys .tsv case so we don't forget about it (in case it comes up again in the future).
Agreed. We can do this with a 'warning' or 'important' admonition on the home page or near the top of the specification page.
👍🏼
👍🏼 Optionally, we could also add cards on the home page, as entry points to the 'Folder structure' and 'Metadata' sections (and hide the homepage TOC instead). |
|
@JoeZiminski re our offline discussion earlier, this is the comment I was referring to about adding high-level metadata (ignore the AIND bit, that's dealt with here). I think this is valuable, and should be easy to add to the spec, as it's so general. |
niksirbi
left a comment
There was a problem hiding this comment.
Thanks @JoeZiminski.
This is conceptually much simpler to understand than the previous draft. I like it.
I have left some comments scaterred throughout, feel free to address as you see fit.
| ### `behav` | ||
|
|
||
| See the | ||
| [BIDS Behavioural Experiments Specification](https://bids-specification.readthedocs.io/en/stable/modality-specific-files/behavioral-experiments.html). |
There was a problem hiding this comment.
There are actually two relecant BIDS specs here. Apart from the established beh datatype you link to, there is a new proposal underway for Behavioural audio/video recordings, which may be way more relevant for systems neuro users.
See https://bids.neuroimaging.io/extensions/beps/bep_047.html
| ``` | ||
|
|
||
| (datatype-keys)= | ||
| ## Datatype keys |
There was a problem hiding this comment.
I'm slightly worried about the suggested datatype keys listed here per datatype. There is a risk of them getting out of sync with the upstream linked BIDS / BEP specs. The dataset description, subject and session keys should not change much (if at all) so we should keep the suggested sets for those.
But datatypes are more fragile (especially the BEPs ones not yet merged).
How about, for each datatype we begin by linking to the relevant BIDS/BEP pages, as you already do, but then in a snippet we provide just a few example fields for each datatype, making it clear that this is not the full set?
There was a problem hiding this comment.
hmm this is true, my hesitation with linking to the spec is that it's quite confusing in terms of what are valid metadata keys, as they have the PascalCase keys for their .json (which we copy) but the snake case keys for the .json (which we don't). Rather than explaining this distinction and telling them to only use PascalCase it seems more natural to provide a list of keys and tell them to find them in the spec for more information on those specific keys.
Maybe we just include keys from merged BEPs (with the exception of electrophysiology as its nearly merged) and assume the key names themselves will not change? If they add keys but we miss them its not the end of the world?
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
This PR adds a metadata specification to NeuroBlueprint.
While the PR contains proposed spec, below reviews the BIDS metadata specification to make sure we are not duplicating efforts or missing anything useful. This is further to #30, in that review BIDS, Allen and openMINDS were reviewed. While Allen and openMINDS contain specific parts that may be of use (e.g. metadata fields for particular datatypes not covered by BIDS) in general the BIDS metadata organization approach is most intuitive and closest to our existing standard.
BIDS review
Every project has a top level mandatory
dataset_description.jsonin the project root that contains information about all animals as key-value pairs.An optional
participants.tsvfile that contains a high-level overview of all participants in the studysessions.tsvthat describes sessions for each subject. Again, this gives a high-level overview of all sessions within a subjectThese
.tsvfiles can be accompanied by additional.jsondescriptive files that describe in more detail what each key is (i.e. metadata on the metadata).Acquired or derivatives data files can be accompanied by sidecar
.jsonproviding metadata fields specific to that datatype. e.g. for behaviour or microscopy. For example, the BEP for behavior specifies different types of data file (e.g._events.tsv,_physio.tsv,_stim.tsv) and each may be accompanied a respective metadata file.Inheritance: in bids, inheritance works based on the suffix. e.g.
Proposal
The BIDS specification is extremely thorough and is the perfect metadata specification for its aim, to allow full description of a published dataset to allow completely automated analysis. I think it is also tailored to fully automated a data collection pipelines that will write accompanying metadata
.jsons(e.g. as is done in imaging experiments).The emphasis for NeuroBlueprint is a little different. For the most part, researchers asking about metadata would like to add additional notes or information to their project during acquisition, rather than have a comprehensive metadata standard. As such, the emphasis is to have an way for people to easily add information ad-hoc to the project, in a lab-notebook style (i.e. easy for those without coding experience). However, this should be extensible to include more detailed metadata if required.
For the high-level metadata, the plan is to include optional
<>_metadata.ymlfiles at theproject,rawdata,sub,sesand level..jsonare also supported, butymlis preferred as it is more human readable. This will essentially contain the information that could be put into BIDSdataset_description.tsv,participants.tsvetc, but as key-value pairs in the folder rather than as a single.tsv. I think this is preferred for our case as it is easier to write these ad-hoc during acquisition, rather than trying to maintain a large.tsvtable. We can construct.tsvtables easily from these metadata.ymlfor BIDS compatibility if required.For the low-level, we also have
<datatype>_metadata.ymlwith required fields that contain relevant fields, this can be included at high-levels (e.g. aephysentry in therawdata_metadata.ymlthat applies to the entire project) for an easy project overview. I think this is a good starting place.Can we just use BIDS?
I think it is worth thinking about whether we could adopt the BIDS spec outright, or with minor changes (e.g. suggest using
.ymlrather than.jsonto keep things human readable). Some parts we could use:dataset_description.json,participants.tsvandsessions.tsvinstead of<project>_metadata.yml/rawdata_metadata.yml,sub_metadata.yml,ses_metadata.ymlrespectively.I think that for an acquisition-focused spec, it makes sense to have the metadata file for each subject located in the subject folder. Use of a single
.ymlper subject is easier to maintain that writing information into a.tsvshared between subjects. That being said, I can see the advantage of just adding a new line toparticipants.tsvwhen you acquire the data, or newsession.tsvwhen adding a new session. It does make it harder to automate though..json) and inheritance based on suffixesI don't think we can directly adopt these, because we are less strict on the format of data included in the datatype folders. We also do not have suffixes for inheritance. Therefore having a simpler
<datatype>_metadata.ymlworks better for our spec. Because of this simplicity, we will defiantly run into cases we can't handle, or researchers who want to write more detailed metadata. In these cases I suggest we point them to the relevant datatype BIDS spec and they can adopt that directly. After all, NeuroBlueprint is supposed to be a stepping stone towards BIDS anyway.Where the spec will break down (datatype level)
If there are multiple runs within a session, they cannot be covered in one file. There may also be lots of different data-types within a folder, that might be hard to combine into one metadata file (e.g. the
events,stim,physiofor behavior). We can have subfields, but this will lead to more complex metadata files. We do not mandate the format of the acquired data within thedatatypefolder, which also makes it difficult to have data-specific metadata.Maybe we stick with this for now, and each sub-team looks into what metadata fields can be included, and how complicated these files are likely to get?