|
| 1 | +# Benchmark project structure |
| 2 | + |
| 3 | +This page describes the expected folder structure and file naming conventions for pose estimation benchmark datasets. |
| 4 | + |
| 5 | +:::{note} |
| 6 | +We mark requirements with italicised *keywords* that should be interpreted as described by the [Network Working Group](https://www.ietf.org/rfc/rfc2119.txt). In decreasing order of requirement, these are: *must*, *should*, and *may*. |
| 7 | +::: |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +A benchmark dataset is organised into a `Train` and a `Test` split. Each split contains one or more **projects** (i.e. datasets contributed by different groups). Each project contains one or more **sessions**. A session centres on a single video file (the **session video**), from which **frames** (individually sampled images) and optionally **clips** (short video segments) are extracted. In the `Train` split, frames and clips are accompanied by keypoint annotations. |
| 12 | + |
| 13 | +The current scope is limited to **single-animal pose estimation** from a **single camera view**. Support for multi-camera setups is planned for a future version. |
| 14 | + |
| 15 | +## Folder structure |
| 16 | + |
| 17 | +``` |
| 18 | +. |
| 19 | +├── Train/ |
| 20 | +│ └── <ProjectName>/ |
| 21 | +│ └── sub-<subjectID>_ses-<sessionID>/ |
| 22 | +│ ├── Frames/ |
| 23 | +│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.png |
| 24 | +│ │ ├── ... |
| 25 | +│ │ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>_framelabels.json |
| 26 | +│ ├── Clips/ (optional) |
| 27 | +│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4 |
| 28 | +│ │ ├── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_cliplabels.json |
| 29 | +│ │ └── ... |
| 30 | +│ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4 |
| 31 | +└── Test/ |
| 32 | + └── <ProjectName>/ |
| 33 | + └── sub-<subjectID>_ses-<sessionID>/ |
| 34 | + ├── Frames/ |
| 35 | + │ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.png |
| 36 | + ├── Clips/ (optional) |
| 37 | + │ └── sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4 |
| 38 | + └── sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4 |
| 39 | +``` |
| 40 | + |
| 41 | +:::{note} |
| 42 | +The `Test` split follows the same structure as `Train`, but label files (`framelabels.json` and `cliplabels.json`) *must* not be included so that they can be used for evaluation. |
| 43 | +::: |
| 44 | + |
| 45 | +### Train / Test |
| 46 | + |
| 47 | +* The top level *must* contain a `Train` and a `Test` folder. |
| 48 | +* Each split *must* contain at least one project folder. |
| 49 | + |
| 50 | +### Project |
| 51 | + |
| 52 | +* Each project *must* have exactly one project-level folder within a given split. |
| 53 | +* The project folder name *should* be descriptive and without spaces (e.g. `SWC-plusmaze`, `IBL-headfixed`, `AIND-openfield`). |
| 54 | + |
| 55 | +### Session |
| 56 | + |
| 57 | +* Each session *must* have exactly one session-level folder within a project. |
| 58 | +* Session folder names *must* be formatted as `sub-<subjectID>_ses-<sessionID>`. |
| 59 | +* `<subjectID>` and `<sessionID>` *must* be strictly alphanumeric (i.e. only `A-Z`, `a-z`, `0-9`). |
| 60 | +* A session folder *must* contain exactly one session video file at its root. |
| 61 | +* A session folder *must* contain a `Frames` folder. |
| 62 | +* A session folder *may* contain a `Clips` folder. |
| 63 | + |
| 64 | +:::{admonition} Examples |
| 65 | +:class: tip |
| 66 | + |
| 67 | +* valid: `sub-M708149_ses-20200317`, `sub-001_ses-01` |
| 68 | +* invalid: |
| 69 | + * `mouse-M708149_ses-20200317`: the first key should be `sub`. |
| 70 | + * `sub-M708149_20200317`: missing the `ses` key. |
| 71 | + * `sub-M70_8149_ses-20200317`: underscores are not allowed within values (ambiguous parsing). |
| 72 | + * `sub-M70-8149_ses-2020-03-17`: hyphens are not allowed within values (ambiguous parsing). |
| 73 | +::: |
| 74 | + |
| 75 | +### Session video |
| 76 | + |
| 77 | +* All video files (session videos and clips) *should* be in MP4 format (H.264 codec, yuv420p pixel format). Contributors *should* re-encode their videos to this format before submission (see [SLEAP documentation](https://docs.sleap.ai/latest/help/#usage) for guidance). |
| 78 | +* Session video filenames *must* follow the pattern: `sub-<subjectID>_ses-<sessionID>_cam-<camID>.mp4`. |
| 79 | + |
| 80 | +### Frames |
| 81 | + |
| 82 | +The `Frames` folder contains individually sampled images and their annotations. |
| 83 | + |
| 84 | +* Frames *must* be extracted from the session video. |
| 85 | +* Frame images *must* be in PNG format. |
| 86 | +* Frame image filenames *must* follow the pattern: `sub-<subjectID>_ses-<sessionID>_cam-<camID>_frame-<frameID>.png`. |
| 87 | +* `<frameID>` *must* be the 0-based index of the frame in the session video. |
| 88 | +* `<frameID>` *must* be padded to a consistent width across all frame files within a session (e.g. `0000`, `1000`). |
| 89 | +* In the `Train` split, a single label file *must* be provided per camera view, named `sub-<subjectID>_ses-<sessionID>_cam-<camID>_framelabels.json`. At present, only one camera view is included, so the split contains exactly one such label file. See [Label format](#label-format) for details. |
| 90 | + |
| 91 | +### Clips |
| 92 | + |
| 93 | +A session *may* include a `Clips` folder containing short video segments and their annotations. |
| 94 | + |
| 95 | +* Clips *must* be extracted from the session video and *must* have the same file format. |
| 96 | +* Clip filenames *must* follow the pattern: `sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>.mp4`. |
| 97 | +* `<frameID>` in the `start` field *must* be the 0-based index of the first frame of the clip in the session video, padded to a consistent width (e.g. `0500`, `1000`). |
| 98 | +* `<nFrames>` in the `dur` field *must* be the duration of the clip in number of frames (e.g. `5`, `30`). |
| 99 | +* In the `Train` split, a single label file *must* be provided per clip, named `sub-<subjectID>_ses-<sessionID>_cam-<camID>_start-<frameID>_dur-<nFrames>_cliplabels.json`. See [Label format](#label-format) for details. |
| 100 | + |
| 101 | +## File naming |
| 102 | + |
| 103 | +All filenames follow a key-value pair convention, similar to the [BIDS standard](https://bids-specification.readthedocs.io/en/stable/02-common-principles.html) and [NeuroBlueprint](https://neuroblueprint.neuroinformatics.dev/latest/specification.html). |
| 104 | + |
| 105 | +* Filenames *must* consist of key-value pairs separated by underscores, with keys and values separated by hyphens. A filename *may* end with an additional suffix (not a key-value pair) before the extension: |
| 106 | + ``` |
| 107 | + <key>-<value>_<key>-<value>.<extension> |
| 108 | + <key>-<value>_<key>-<value>_<suffix>.<extension> |
| 109 | + ``` |
| 110 | + The recognised suffixes are `framelabels` (for frame label files) and `cliplabels` (for clip label files). |
| 111 | +* The following keys are used: |
| 112 | + |
| 113 | +| Key | Description | Examples | |
| 114 | +|---------|------------------------------------------------|-----------------| |
| 115 | +| `sub` | Subject identifier | `sub-001`, `sub-M708149` | |
| 116 | +| `ses` | Session identifier | `ses-02`, `ses-25`, `ses-20200317` | |
| 117 | +| `cam` | Camera identifier | `cam-topdown`, `cam-side2` | |
| 118 | +| `frame` | 0-based frame index in the session video | `frame-0000`, `frame-0500`, `frame-1000` | |
| 119 | +| `start` | 0-based frame index of the first frame of a clip in the session video | `start-0000`, `start-0500`, `start-1000` | |
| 120 | +| `dur` | Clip duration in number of frames | `dur-5`, `dur-30` | |
| 121 | + |
| 122 | +* The keys `sub`, `ses`, and `cam` *must* appear in every filename, in that order. |
| 123 | +* Key values *must* be strictly alphanumeric for `sub`, `ses` and `cam` (i.e. only `A-Z`, `a-z`, `0-9`). |
| 124 | +* Key values *must* be strictly numeric for `frame`, `start` and `dur` (i.e. only `0-9`). |
| 125 | +* Filenames *must* not contain spaces. |
| 126 | + |
| 127 | +## Label format |
| 128 | + |
| 129 | +* Labels (also referred to as annotations) are only included in the `Train` split, and *must* be stored in the same folder as the corresponding frames or clips. |
| 130 | +* Annotations *must* be stored in [COCO keypoints format](https://cocodataset.org/), with some additional requirements described below. Each label file is a JSON file with `images`, `annotations`, and `categories` arrays. Image, annotation and category `id` values *must* be unique integers within a label file. |
| 131 | + |
| 132 | +:::{note} |
| 133 | +Annotation and category `id` values *should* be 1-indexed. This convention follows sleap-io's [`save_coco`](https://io.sleap.ai/latest/reference/sleap_io/io/coco/) function and avoids conflicts with models that treat category `0` as background. |
| 134 | + |
| 135 | +Image `id` indexing differs between frame and clip labels — see below for details. |
| 136 | +::: |
| 137 | + |
| 138 | +### Frame labels (`framelabels.json`) |
| 139 | + |
| 140 | +* There *must* be one `framelabels.json` per camera view within the `Frames` folder. |
| 141 | +* Each entry in the `images` array *must* have an `id` equal to the integer frame index in the session video (matching the `<frameID>` in the corresponding image filename). |
| 142 | +* Each entry in the `images` array *must* have a `file_name` that matches the full filename (including the `.png` extension) of an existing frame image in the `Frames` folder. |
| 143 | + |
| 144 | +:::{admonition} Example |
| 145 | +:class: tip |
| 146 | + |
| 147 | +For a session with 5 labelled frames sampled from different parts of the video, the `images` array would be: |
| 148 | + |
| 149 | +```json |
| 150 | +[ |
| 151 | + {"id": 1000, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01000.png", "width": 1300, "height": 1028}, |
| 152 | + {"id": 2300, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-02300.png", "width": 1300, "height": 1028}, |
| 153 | + {"id": 3500, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-03500.png", "width": 1300, "height": 1028}, |
| 154 | + {"id": 7200, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-07200.png", "width": 1300, "height": 1028}, |
| 155 | + {"id": 9800, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-09800.png", "width": 1300, "height": 1028} |
| 156 | +] |
| 157 | +``` |
| 158 | + |
| 159 | +Here each `id` is the frame index in the session video (matching the `<frameID>` in the filename), and each `file_name` includes the `.png` extension. |
| 160 | +::: |
| 161 | + |
| 162 | +### Clip labels (`cliplabels.json`) |
| 163 | + |
| 164 | +* There *must* be one `cliplabels.json` per clip. |
| 165 | +* The `images` array *must* contain an entry for every frame in the clip, in consecutive, monotonically increasing order (covering the entire clip duration). |
| 166 | +* Clip labels follow the same COCO keypoints format as frame labels, but with different conventions for image `id` and `file_name` values: |
| 167 | + * Each image `id` *must* be the **0-based index of the frame within the clip** (i.e. `0`, `1`, `2`, ...), not the index in the session video. |
| 168 | + * Each `file_name` *must* follow the same pattern as frame image filenames, but **without the `.png` extension**. The `frame` field in the `file_name` *must* hold the index of that frame in the **session video**. |
| 169 | + |
| 170 | +This means that each entry in the `images` array encodes two pieces of information: the `id` gives the local position within the clip, while the `frame` field in `file_name` gives the global position in the session video. |
| 171 | + |
| 172 | +:::{admonition} Example |
| 173 | +:class: tip |
| 174 | + |
| 175 | +For a clip starting at frame 1000 with a duration of 5 frames, the `images` array would be: |
| 176 | + |
| 177 | +```json |
| 178 | +[ |
| 179 | + {"id": 0, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01000", "width": 1300, "height": 1028}, |
| 180 | + {"id": 1, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01001", "width": 1300, "height": 1028}, |
| 181 | + {"id": 2, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01002", "width": 1300, "height": 1028}, |
| 182 | + {"id": 3, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01003", "width": 1300, "height": 1028}, |
| 183 | + {"id": 4, "file_name": "sub-M708149_ses-20200317_cam-topdown_frame-01004", "width": 1300, "height": 1028} |
| 184 | +] |
| 185 | +``` |
| 186 | + |
| 187 | +Here `id: 0` through `id: 4` are the local clip indices, while `frame-01000` through `frame-01004` in the `file_name` values refer to the original frame positions in the session video. |
| 188 | +::: |
| 189 | + |
| 190 | +### Visibility encoding |
| 191 | + |
| 192 | +* Keypoint visibility *must* use ternary encoding: |
| 193 | + * `0`: not labelled |
| 194 | + * `1`: labelled but not visible (occluded) |
| 195 | + * `2`: labelled and visible |
| 196 | + |
| 197 | +## Example |
| 198 | + |
| 199 | +Below is a concrete example project structure (only the `Train` split is shown): |
| 200 | + |
| 201 | +``` |
| 202 | +Train/ |
| 203 | +└── SWC-plusmaze/ |
| 204 | + └── sub-M708149_ses-20200317/ |
| 205 | + ├── Frames/ |
| 206 | + │ ├── sub-M708149_ses-20200317_cam-topdown_frame-01000.png |
| 207 | + │ ├── sub-M708149_ses-20200317_cam-topdown_frame-02300.png |
| 208 | + │ ├── sub-M708149_ses-20200317_cam-topdown_frame-03500.png |
| 209 | + │ ├── sub-M708149_ses-20200317_cam-topdown_frame-07200.png |
| 210 | + │ ├── sub-M708149_ses-20200317_cam-topdown_frame-09800.png |
| 211 | + │ └── sub-M708149_ses-20200317_cam-topdown_framelabels.json |
| 212 | + ├── Clips/ |
| 213 | + │ ├── sub-M708149_ses-20200317_cam-topdown_start-01000_dur-5.mp4 |
| 214 | + │ └── sub-M708149_ses-20200317_cam-topdown_start-01000_dur-5_cliplabels.json |
| 215 | + └── sub-M708149_ses-20200317_cam-topdown.mp4 |
| 216 | +``` |
0 commit comments