Skip to content

Add metadata specification#77

Open
JoeZiminski wants to merge 30 commits into
mainfrom
add_metdata_spec
Open

Add metadata specification#77
JoeZiminski wants to merge 30 commits into
mainfrom
add_metdata_spec

Conversation

@JoeZiminski

@JoeZiminski JoeZiminski commented Sep 15, 2025

Copy link
Copy Markdown
Member

This PR adds a metadata specification to NeuroBlueprint.

While the PR contains proposed spec, below reviews the BIDS metadata specification to make sure we are not duplicating efforts or missing anything useful. This is further to #30, in that review BIDS, Allen and openMINDS were reviewed. While Allen and openMINDS contain specific parts that may be of use (e.g. metadata fields for particular datatypes not covered by BIDS) in general the BIDS metadata organization approach is most intuitive and closest to our existing standard.

BIDS review

  • Every project has a top level mandatory dataset_description.json in the project root that contains information about all animals as key-value pairs.

  • An optional participants.tsv file that contains a high-level overview of all participants in the study

image
  • An optional sessions.tsv that describes sessions for each subject. Again, this gives a high-level overview of all sessions within a subject
image
  • These .tsv files can be accompanied by additional .json descriptive files that describe in more detail what each key is (i.e. metadata on the metadata).

  • Acquired or derivatives data files can be accompanied by sidecar .json providing metadata fields specific to that datatype. e.g. for behaviour or microscopy. For example, the BEP for behavior specifies different types of data file (e.g. _events.tsv, _physio.tsv, _stim.tsv) and each may be accompanied a respective metadata file.

  • Inheritance: in bids, inheritance works based on the suffix. e.g.

image

Proposal

The BIDS specification is extremely thorough and is the perfect metadata specification for its aim, to allow full description of a published dataset to allow completely automated analysis. I think it is also tailored to fully automated a data collection pipelines that will write accompanying metadata .jsons (e.g. as is done in imaging experiments).

The emphasis for NeuroBlueprint is a little different. For the most part, researchers asking about metadata would like to add additional notes or information to their project during acquisition, rather than have a comprehensive metadata standard. As such, the emphasis is to have an way for people to easily add information ad-hoc to the project, in a lab-notebook style (i.e. easy for those without coding experience). However, this should be extensible to include more detailed metadata if required.

For the high-level metadata, the plan is to include optional <>_metadata.yml files at the project, rawdata, sub, ses and level. .json are also supported, but yml is preferred as it is more human readable. This will essentially contain the information that could be put into BIDS dataset_description.tsv, participants.tsv etc, but as key-value pairs in the folder rather than as a single .tsv. I think this is preferred for our case as it is easier to write these ad-hoc during acquisition, rather than trying to maintain a large .tsv table. We can construct .tsv tables easily from these metadata .yml for BIDS compatibility if required.

For the low-level, we also have <datatype>_metadata.yml with required fields that contain relevant fields, this can be included at high-levels (e.g. a ephys entry in the rawdata_metadata.yml that applies to the entire project) for an easy project overview. I think this is a good starting place.

Can we just use BIDS?

I think it is worth thinking about whether we could adopt the BIDS spec outright, or with minor changes (e.g. suggest using .yml rather than .json to keep things human readable). Some parts we could use:

  • dataset_description.json, participants.tsv and sessions.tsv instead of <project>_metadata.yml / rawdata_metadata.yml, sub_metadata.yml, ses_metadata.yml respectively.

I think that for an acquisition-focused spec, it makes sense to have the metadata file for each subject located in the subject folder. Use of a single .yml per subject is easier to maintain that writing information into a .tsv shared between subjects. That being said, I can see the advantage of just adding a new line to participants.tsv when you acquire the data, or new session.tsv when adding a new session. It does make it harder to automate though.

  • datatype level metadata (sidecar .json) and inheritance based on suffixes

I don't think we can directly adopt these, because we are less strict on the format of data included in the datatype folders. We also do not have suffixes for inheritance. Therefore having a simpler <datatype>_metadata.yml works better for our spec. Because of this simplicity, we will defiantly run into cases we can't handle, or researchers who want to write more detailed metadata. In these cases I suggest we point them to the relevant datatype BIDS spec and they can adopt that directly. After all, NeuroBlueprint is supposed to be a stepping stone towards BIDS anyway.

Where the spec will break down (datatype level)

If there are multiple runs within a session, they cannot be covered in one file. There may also be lots of different data-types within a folder, that might be hard to combine into one metadata file (e.g. the events, stim, physio for behavior). We can have subfields, but this will lead to more complex metadata files. We do not mandate the format of the acquired data within the datatype folder, which also makes it difficult to have data-specific metadata.

Maybe we stick with this for now, and each sub-team looks into what metadata fields can be included, and how complicated these files are likely to get?

@JoeZiminski JoeZiminski changed the title Add placeholder page. Add metadata specification Sep 15, 2025
@JoeZiminski JoeZiminski marked this pull request as draft September 15, 2025 16:17
@JoeZiminski JoeZiminski marked this pull request as ready for review May 20, 2026 16:32
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md
@@ -0,0 +1,297 @@
:orphan:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is a placeholder, but this page should be made far more accessible on the website eventually.

Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated

## Metadata Organisation Description

At each level of the project, a metadata file can be included that describes that level:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just seen that this comes later, but I think some intro would be better higher up in case people haven't come across yaml files.

Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
JoeZiminski and others added 2 commits May 21, 2026 13:27
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
JoeZiminski and others added 3 commits May 21, 2026 13:28
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
Co-authored-by: Adam Tyson <code@adamltyson.com>
@adamltyson

Copy link
Copy Markdown
Member

My only main comment about the spec is that we should have a standardised way to add:

  • General notes (at any level), e.g. "power cut half way through, expt stopped"
  • High level experiment tags, e.g. "electrophysiology", "neuropixels", "decision making". These will be useful for future work to search these datasets
  • High level "abstract", e.g. "this dataset contains combined neuropixels recordings in primary visual cortex alongside video behavioural data as a mouse is shown a series of ..... The aim is to discover the neural basis of ...." This should probably be outside of the other yaml files?

@JoeZiminski

Copy link
Copy Markdown
Member Author

Hi @adamltyson @niksirbi I've updated the metadata spec based on our discussion yesterday. It reads much cleaner with the new format and is much easier to explain so that's a good sign it is an improvement. Some points that came up when I was writing it:

1) how to handle BIDS .json vs. .tsv format

BIDS has two types of metadata, the .json format of key-value pairs (PascalCase) which maps nicely to our approach. They also have the .tsv table format (snake_case), where the metadata 'keys' are columns of a table.

In some cases this maps OK (e.g. their participant spec). I just left a note on our spec to use PascalCase to refer to these keys, even though they are snake_case on the BIDS spec. This works because instead of tables, we use inheritance to cover multiple subjects.

In other cases this does not map very well. For example, in the ephys spec there are many detailed .tsv metadata, for example if you have multiples probes or electrodes. Putting these into the YAML would be very messy (e.g. for multiple electrodes you would need an entry for each one, all with duplicated fields).

The solution for now is to only list the metadata keys from BIDS .json and ignore the .tsv cases, unless explicitly stated (e.g. for the subject case). I think if people want such detailed metadata (e.g. for every individual electrode, detailed probe metadata) they should use BIDS. This could be explicitly stated somewhere (maybe in the datatype section at the top).

2) Indicate we are not FAIR

Related to the above, we should probably explicitly state that this metadata spec is not intended for full FAIR compliance and if you want this you should use another spec e.g. BIDS. As part of the datashuttle review there was the nice suggestion to explicitly state where we are not FAIR. Maybe we can include a page on the NB website covering this, then we can link to it in the metadata spec page as well as mention it in the paper?

3) Removed 'rawdata' section

I removed the rawdata entry from the metadata as the previous rawdata_metadata.yaml was really just a place to include datatype entries at a higher folder level. Now to do this people can put metadata.yaml at the rawdata folder level. So now there is no need for an explicit rawdata entry in metadata files.

4) Where the page should go

Currently this is an orphan page linked to from the metadata spec. To improve findability we could have two pages 'Folder Specification' and 'Metadata Specification' both linked from the top bar?

LMK what you think of these points / the updated spec

@adamltyson

Copy link
Copy Markdown
Member

That all sounds good to me.

@niksirbi

niksirbi commented Jun 3, 2026

Copy link
Copy Markdown
Member

The solution for now is to only list the metadata keys from BIDS .json and ignore the .tsv cases, unless explicitly stated (e.g. for the subject case). I think if people want such detailed metadata (e.g. for every individual electrode, detailed probe metadata) they should use BIDS. This could be explicitly stated somewhere (maybe in the datatype section at the top).

That looks like the pragmatic choice. We can open an issue a about that ephys .tsv case so we don't forget about it (in case it comes up again in the future).

Related to the above, we should probably explicitly state that this metadata spec is not intended for full FAIR compliance and if you want this you should use another spec e.g. BIDS. As part of the datashuttle review there was openjournals/joss-reviews#9642 (comment) to explicitly state where we are not FAIR. Maybe we can include a page on the NB website covering this, then we can link to it in the metadata spec page as well as mention it in the paper?

Agreed. We can do this with a 'warning' or 'important' admonition on the home page or near the top of the specification page.

I removed the rawdata entry from the metadata as the previous rawdata_metadata.yaml was really just a place to include datatype entries at a higher folder level. Now to do this people can put metadata.yaml at the rawdata folder level. So now there is no need for an explicit rawdata entry in metadata files.

👍🏼

Currently this is an orphan page linked to from the metadata spec. To improve findability we could have two pages 'Folder Specification' and 'Metadata Specification' both linked from the top bar?

👍🏼 Optionally, we could also add cards on the home page, as entry points to the 'Folder structure' and 'Metadata' sections (and hide the homepage TOC instead).

@adamltyson

Copy link
Copy Markdown
Member

@JoeZiminski re our offline discussion earlier, this is the comment I was referring to about adding high-level metadata (ignore the AIND bit, that's dealt with here). I think this is valuable, and should be easy to add to the spec, as it's so general.

@niksirbi niksirbi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JoeZiminski.

This is conceptually much simpler to understand than the previous draft. I like it.

I have left some comments scaterred throughout, feel free to address as you see fit.

Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md Outdated
Comment thread docs/source/metadata.md
### `behav`

See the
[BIDS Behavioural Experiments Specification](https://bids-specification.readthedocs.io/en/stable/modality-specific-files/behavioral-experiments.html).

@niksirbi niksirbi Jun 3, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two relecant BIDS specs here. Apart from the established beh datatype you link to, there is a new proposal underway for Behavioural audio/video recordings, which may be way more relevant for systems neuro users.

See https://bids.neuroimaging.io/extensions/beps/bep_047.html

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could skip this for now re: here?

Comment thread docs/source/metadata.md
```

(datatype-keys)=
## Datatype keys

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried about the suggested datatype keys listed here per datatype. There is a risk of them getting out of sync with the upstream linked BIDS / BEP specs. The dataset description, subject and session keys should not change much (if at all) so we should keep the suggested sets for those.

But datatypes are more fragile (especially the BEPs ones not yet merged).

How about, for each datatype we begin by linking to the relevant BIDS/BEP pages, as you already do, but then in a snippet we provide just a few example fields for each datatype, making it clear that this is not the full set?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this is true, my hesitation with linking to the spec is that it's quite confusing in terms of what are valid metadata keys, as they have the PascalCase keys for their .json (which we copy) but the snake case keys for the .json (which we don't). Rather than explaining this distinction and telling them to only use PascalCase it seems more natural to provide a list of keys and tell them to find them in the spec for more information on those specific keys.

Maybe we just include keys from merged BEPs (with the exception of electrophysiology as its nearly merged) and assume the key names themselves will not change? If they add keys but we miss them its not the end of the world?

JoeZiminski and others added 8 commits June 4, 2026 15:42
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants