Skip to content

Dataset metadata

Excellent dataset metadata is a relatively low-effort way to demonstrate the transparency which underpins OpenSanctions. Write it considering the perspective of data users ranging from startup software developers and business analysts, to investigative journalists and researchers.

Remember to give the context that people from different countries need to make sense of systems they are not entirely familiar with. Share what you learned when figuring out what a source dataset represents.

Use the .yml extension.

Properties:

Basics

  • title - As close as possible to an official title for what this dataset contains. If it is a subset of its source data, try to capture that. e.g. Plural Legislators - if the Plural portal includes committees but the dataset only captures the legislators. Prefix the dataset's name with the country name as Country (and not Country's).
  • entry_point e.g. crawler.py:crawl_peps - the file name, optionally followed by a method name called by the zavod crawl command. Defaults to the crawler.py:crawl calling an entry point in the dataset directory.
  • prefix - The prefix used by entity id helpers, e.g. gb-coh or ofac - try to make this short but unique across datasets, unless you would like different datasets to intentionally generate overlapping keys. See the entity ID guide for the shape and stability rules that apply to IDs.
  • summary - A short (50-90 char) line that complements the title to identify the source. Add detail the title doesn't already convey rather than repeating it. Shown in search results, so keep it to one clear line.
  • description - One to three paragraphs describing the characteristics of the source and the dataset: what it contains, what it includes or excludes, and the context a reader needs to make sense of it. Write for a compliance or domain-expert reader, not an engineer. Describe the source, not the routine ETL mechanics; don't narrate fetching, parsing, or lookups. Note significant limitations that affect how a reader should trust or interpret the data, such as a dataset maintained manually because the source is a PDF or behind an access block, or coverage that updates irregularly. Skip minor field-level gaps such as missing dates of birth; those belong in the data itself, not the prose.
  • url - the home page or most authoritative place where someone can read about this particular dataset at its source. E.g If a source publishes 5 different datasets, try to link to the page describing the data actually contained in this dataset.

Data Coverage

  • coverage
    • frequency - e.g. daily, weekly, monthly, never. This represents how often it is expected that this dataset will be updated. It conveys to users how often to expect updates, and will also be used to generate a crawling schedule unless a specific schedule is defined.
    • start - The date the dataset was first included in the default collection — i.e. the date the crawler was added to OpenSanctions. Use today's date when scaffolding a new crawler. A string in the format YYYY-MM-DD. Do not set this to the date the source data begins covering (e.g. an election date or the start of a parliamentary term).
    • end - The end date of a dataset which covers only a specific period in time, e.g. for a dataset specific to a data dump or parliamentary term. A string in the format YYYY-MM-DD. Future dates imply an expected end to the maintenance and coverage period of the dataset. Past end dates result in the datasets last_change date being fixed to that date, while its last_exported date remains unchanged.
    • schedule - string - a cron style schedule defining what time and frequency a crawler should run, e.g 30 */6 * * *
    • Data sources that don't receive updates are marked never and must have their schedule defined otherwise (e.g. usually coverage.schedule: @monthly just to keep consistent with FTM updates). You may want to set disabled: true for sources that are not available any more so that the metadata can get published without attempting to crawl the source.

Deployment

  • deploy
    • premium - boolean - whether its compute instance may be evicted, restarting the job. Set to true for jobs running for several hours.

Continuous Integration

  • ci_test - boolean, default true. If true, the crawler is run when its python or yaml is modified in CI. Set to false for extremely slow crawlers, or those that require credentials, and then take extra care when modifying them.

Exports

  • exports - An array of strings matching the export formats, e.g. "targets.nested.json". The default is best for most cases.
  • load_statements - Whether the statements should be loaded to a SQL table after the run. Usually false for collections and enrichment targets like company registries, and true for normal datasets and enrichers.

Tags

tags are a controlled vocabulary used to categorize datasets by shared attributes such as legal basis, list type, target country, or sector. They support cross-referencing within specific scopes, such as distinguishing between sanctions, PEPs, and regulatory actions, and enable users to select the most relevant datasets for a given country, sector, or risk category.

Currently, tags cover the following dimensions: - list type (e.g. list.sanction, list.pep); - issuer and jurisdiction (e.g. issuer.west, juris.eu); - target countries (e.g. target.ru, target.us) - sectors (e.g. sector.financial, sector.maritime) - risk themes (e.g. risk.klepto).

You can find a full overview of available tags here.

Publisher

  • publisher
    • name - The publisher's official name. If this is by default in a primary non-english language from the originating country, use that language here, and the english form in publisher.name_en.
    • name_en - Their name in English, ideally the official form, otherwise a translation.
    • acronym - Add if there's an official acronym, e.g. check in their domain name, footer, about page.
    • description - This can be one to two paragraphs of text. Use the publisher description field to explain to someone from a country other than the publisher who the publisher is, and why they do what they do.
    • url - The home page of their official website
    • country - The Alpha-2 or two-letter ISO 3166-1
    • official - true if the publisher is an authority overseeing the subject data, generally a government entity releasing their sanctions list or legislator data, otherwise false.

Source data

  • data
    • url- The link to a bulk download or API base URL or endpoint - ideally something you can use within the crawler via context.data_url to request the data, and which ideally returns a useful response when followed by dataset users. It's not the end of the world if you make other requests to expand the data available to the crawler.
    • format a string defining the format of the data at that URL, e.g. JSON, HTML, XML. A Zip file containing thousands of YAML files might be more usefully annoted with YAML than ZIP because it conveys the structural syntax of the data.

Date formatting

  • dates - date formatting used by helpers.apply_date and apply_dates but also accessible via the context for use in helpers.parse_date. See the date parsing guide for usage patterns and worked examples.
  • formats: Array of date format strings for parsing dates into partial ISO dates
  • months: Map where values like März are translated into keys like "3" so that it could then be parsed by a format string like %m

HTTP options

HTTP requests for GET requests are automatically retried for connection and HTTP errors. Some of this retry behaviour can be configured from the dataset metadata if needed.

  • http
    • user_agent: string, defaults to the value of the FTM_USER_AGENT setting. Set a custom value for the User-Agent header if needed.
    • backoff_factor: float, default 1. Scales the exponential backoff.
    • max_retries: integer in seconds, default 3
    • retry_methods: List of strings, default ['DELETE', 'GET', 'HEAD', 'OPTIONS', 'PUT', 'TRACE']
    • retry_statuses: List of integers of HTTP error codes to retry, default [413, 429, 500, 502, 503, 504].

Data assertions

Data assertions are intended to "smoke test" the data. Assertions are checked on export. If assertions aren't met, warnings are emitted.

Data assertions are checked when running zavod run (and zavod validate is useful when developing a crawler).

Data assertions are useful to communicate our expectations about what's in a dataset. min validations set a baseline for what should be in the dataset and are fatal to the export if they fail. max validations emit a warning when the dataset has grown beyond the validity of our earlier baseline (or if something's gone horribly wrong and emitted way more than expected)

It's a good idea to add assertions at the start of writing a crawler, and then see whether those expectations are met when the crawler is complete. A good rule of thumb for datasets that change over time is minima 10% below the expected number to allow normal variation, unless there's a known hard minimum, and a maximum around twice the expected number of entities to leave room to grow.

A basic assertion block can look like this:

assertions:
  min:
    schema_entities:
      Person: 160  # at least 160 Person entities
      Position: 30  # at least 30 Position entities
    entities_with_prop:
      Company:
        taxNumber: 10  # at least 10 Companies have a tax number set
  max:
    schema_entities:
      Person: 400  # at most 400 Person entities
      Position: 80  # at most 80 Position entities

Assertion types

schema_entities asserts on the number of entities of a given schema.

country_entities asserts on the number of entities associated with a country in any of its properties. All properties with type country are considered (among them the usual suspects such as country, jurisdiction and citizenship). Countries are given as ISO 3166-1 Alpha-2 country codes.

countries asserts on the number of distinct countries expected to appear in the dataset.

entities_with_prop asserts on the number of entities of a given schema that have a given property set.

property_fill_rate asserts on the proportion of entities of a given schema that have a given property set, expressed as a float between 0 and 1.

assertions:
  min:
    property_fill_rate:
      Person:
        birthDate: 0.7  # at least 70% of Persons have a birth date