Releasing a dataset
A dataset becomes part of a published OpenSanctions distribution by being added to a collection's children: list.
A merged crawler runs in isolation: it produces data, but that data reaches no published product until the dataset joins a collection. Release a dataset once its crawler has passed review, an initial run, and deduplication against existing data.
How collections work
Collections are datasets that combine the entities of other datasets. They are defined as YAML files in datasets/_collections/, each with a children: list naming its member datasets:
type: collection
title: Politically Exposed Persons Datasets
children:
- ad_consell_general
- al_kuvendi
- am_hetq_officials
# ...
Membership is declared in the collection's children: list, and nowhere else. The dataset's own metadata carries no reverse pointer, so releasing a dataset means editing the collection rather than the dataset.
Collections nest. The default collection lists the topical collections (sanctions, peps, crime, and so on) as its children, and those in turn list the individual datasets. Membership rolls up recursively, so adding a dataset to a topical collection also includes it in default, the broadest public distribution. Edit a topical collection; do not add an ordinary dataset directly to default.
Choosing a collection
Add the dataset to the most appropriate topical collection based on the kind of entities it produces. The quickest way to decide is to find a similar dataset and see which collection it is included in.
| Collection | Holds |
|---|---|
peps |
Parliaments, elected officials, PEP/asset declarations |
sanctions |
Consolidated sanctions lists |
eu_sanctions |
European Union sanctions sources |
crime |
Criminal, investigative, and leaked-records entities |
wanted |
Wanted-persons lists |
debarment |
Debarred companies and individuals |
regulatory |
Regulatory watchlists and investor alerts |
enforcement |
Enforcement actions, litigation, and regulatory notices |
maritime |
Vessels, port-state-control, and other maritime sources |
securities |
Sanctioned or flagged securities issuers |
special_interest |
Special-interest watchlists |
A dataset may belong to more than one collection if it genuinely fits. If the topic is ambiguous, ask the team rather than guessing.
Steps
-
Add the dataset name to the chosen collection's
children:list indatasets/_collections/<COLLECTION>.yml. Replace<COLLECTION>with the collection name, and keep the list alphabetically sorted. -
Bump
coverage.startin the dataset's own YAML to the release date (today). This field records the date the dataset first entered thedefaultcollection, which is usually later than when the crawler was scaffolded. Do not set it to the date the source data begins covering. -
Verify that the dataset is now part of a collection:
The check warns "<name> has no collections" for any non-collection dataset that is not a child of any collection. After releasing, the dataset you added no longer appears in the output. Warnings for other unreleased datasets are expected, so release only the datasets you intend to.
What not to release
- Test or scratch datasets.
_externals/and_analysis/datasets, which are wired into the pipeline differently. Leave these to the maintainers.