Case: Dataset¶

The INRIA Aerial Image Labeling dataset is comprised of 360 RGB tiles of 5000×5000px with a spatial resolution of 30cm/px on 10 cities across the globe.

Configuration¶

To share your dataset on the SharingHub, you need to set up your GitLab repository to include the topics sharinghub:dataset from Settings and General:

To make your dataset usable by others, you need to create a README.md file. This file should begin with a YAML section describing your dataset's metadata, followed by a markdown section:

The markdown part of your README must contain all useful information about the dataset: how to use it and in what context, how it was created etc...
The YAML section is delimited by three --- at the top of your file and at the end of the section. It contains the metadata presented in the [Reference].

Structure¶

The repository tree:

.
├── data
│   ├── test
│   │   ├── gt.dvc
│   │   └── img.dvc
│   ├── train.dvc
│   └── validation.dvc
├── inria_dataset.py
├── README.md
├── requirements.txt
└── samples
    ├── ground_truth.png
    └── image.png

You may notice the "dvc" extensions, this is because we use DVC to store the files. Learn more in the tutorial "Dataset with DVC".

Metadata¶

Here's the project metadata:

README.md Metadata

assets:
- "*.zip"
- "*.py"

gsd: 0.3

label:
  type: vector
  properties: null
  description: "Ground truth data for two semantic classes: 'building' and 'not building'"
  classes:
    - name: Other
      classes: [0]
    - name: Building
      classes: [255]
  tasks:
    - Semantic Segmentation

Let’s break down the project's metadata.

assets: define the files in the repository that we want to share with SharingHub. [Ref]
gsd: pure STAC property. [Ref]
label: a STAC extension, adapted to the dataset use-case. [Ref]