Case: Dataset¶
The INRIA Aerial Image Labeling dataset is comprised of 360 RGB tiles of 5000×5000px with a spatial resolution of 30cm/px on 10 cities across the globe.
Configuration¶
To share your dataset on the SharingHub, you need to set up your GitLab repository to include the topics sharinghub:dataset
from Settings and General:
To make your dataset usable by others, you need to create a README.md
file. This file should begin with a YAML section describing your dataset's metadata, followed by a markdown section:
- The markdown part of your README must contain all useful information about the dataset: how to use it and in what context, how it was created etc...
- The YAML section is delimited by three
---
at the top of your file and at the end of the section. It contains the metadata presented in the [Reference].
Structure¶
The repository tree:
.
├── data
│ ├── test
│ │ ├── gt.dvc
│ │ └── img.dvc
│ ├── train.dvc
│ └── validation.dvc
├── inria_dataset.py
├── README.md
├── requirements.txt
└── samples
├── ground_truth.png
└── image.png
You may notice the "dvc" extensions, this is because we use DVC to store the files. Learn more in the tutorial "Dataset with DVC".
Metadata¶
Here's the project metadata:
README.md Metadata
assets:
- "*.zip"
- "*.py"
gsd: 0.3
label:
type: vector
properties: null
description: "Ground truth data for two semantic classes: 'building' and 'not building'"
classes:
- name: Other
classes: [0]
- name: Building
classes: [255]
tasks:
- Semantic Segmentation
Let’s break down the project's metadata.
assets
: define the files in the repository that we want to share with SharingHub. [Ref]gsd
: pure STAC property. [Ref]label
: a STAC extension, adapted to the dataset use-case. [Ref]