Developing a crawler
A crawler is a small Python script that will import data from a web origin and
store it as entities as a data source. zavod
defines a framework for
crawlers to retrieve data, parse it and emit structured data about people or
companies into a database.
Please note
Before you contribute a crawler, please consider if you are willing
to remain involved in its maintenance after having the code included in
zavod
. Maintaining a crawler is an ongoing commitment, and the OpenSanctions
team does not automatically assume that responsibility for code contributed by
others. See our general inclusion critera.
- Make sure you have installed zavod and set the required environment variables, specifically
ZAVOD_RESOLVER_PATH
andZAVOD_SYNC_POSITIONS
. - File a GitHub issue to discuss the suggested source.
- Create a YAML metadata description for the new source.
- Create a Python script to fetch and process the data.
- Address any data normalisation issues the framework might report.
Data source metadata
Before programming a crawler script, you need to create a YAML file with some basic metadata to describe the new dataset. That information includes the dataset name (which is normally derived from the YAML file name), information about the source publisher and the source data URL.
The metadata file must also include a reference to the entry point, the Python code that should be executed in order to crawl the source.
Create a new YAML file at the path datasets/cc/source/cc_source.yml
replacing cc
with the relevant ISO 3166-2 country code, and source
with an acronym or short name for the source, separating name parts using underscores. Other codes may be derived from standard acronyms instead of country codes for regions that span beyond one country.
Important
Metadata is essential to making our data useable. We will not merge additional crawlers which don't have metadata, or where the descriptions are cryptic.
Read your metadata as if it's your first time, and ask yourself if other readers will understand the scope and limitations of the dataset. For simple crawlers, writing the metadata can take as much time as writing the code.
The contents of the new metadata file should look something like this. This is a brief example. See the full metadata documentation for all the required fields:
name: eu_fsf_demo
title: "Financial Sanctions Files (FSF)"
url: https://eeas.europa.eu/
load_db_uri: ${OPENSANCTIONS_DATABASE_URI}
coverage:
frequency: daily
start: 2024-03-19
# The description should be extensive, and can use markdown for formatting:
description: >
As part of the Common Foreign Security Policy thr European Union publishes
a sanctions list that is implemented by all member states.
# The Python module in the same director that contains the crawler code:
entry_point: crawler.py
# A prefix will be used to mint entity IDs. Keep it short.
prefix: eu-fsf
# This section provides information about the original publisher of the data,
# often a government authority:
publisher:
name: European Union External Action Service
acronym: EEAS
official: true
description: The EEAS is the EU's diplomatic service, and carries out the EU's foreign and security policy.
country: eu
url: https://eeas.europa.eu/topics/sanctions-policy/8442/consolidated-list-of-sanctions_en
# Information about the data, including a deep link to a downloadable file, if
# one exists.
data:
url: https://webgate.ec.europa.eu/europeaid/fsd/fsf/public/files/xmlFullSanctionsList_1_1/content
format: XML
Running a dataset crawler
Once that YAML file is stored in the correct folder, you should be able to run command-line operations against the dataset. For example (if your metadata file is named eu_fsf_demo.yml
):
$ zavod crawl datasets/eu/fsf/eu_fsf_demo.yml
....
2023-08-01 12:36:24 [warning ] No backfill bucket configured [zavod.archive]
2023-08-01 12:36:24 [info ] Running dataset [eue_fsf_demo] dataset=eue_fsf_demo path=/home/you/opensanctions/data/datasets/eue_fsf_demo
2023-08-01 12:36:24 [error ] Runner failed: Could not load entry point: crawler [eue_fsf_demo] dataset=eue_fsf_demo
Don't worry about the backfill bucket warning - that is not needed when developing crawlers. It is used in production to automatically track when data was previously seen and updated.
The Runner failed: Could not load entry point: crawler
error indicates that it looked for our crawler and couldn't find it. Adding the crawler script is the next step.
Dry run mode
You can switch zavod to dry run during crawler development by adding the -d
(or --dry-run
) flag on the command line. A dry run will not store any of the emitted data, and disable the generation of correct timestamps, which is slow.
zavod crawl -d datasets/eu/fsf/eu_fsf_demo.yml
Developing a crawler script
In order to actually feed data into the data source, we need to write a crawler script. The script location is specified in the YAML metadata file as entry_point:
. This also means you could reference the same script for multiple data sources, for example in a scenario where two data sources use the API, except with some varied parameters.
In our example above, we'd create a file in datasets/eu/fsf/crawler.py
with a crawler skeleton:
Running the crawler (zavod crawl datasets/eu/fsf/eu_fsf_demo.yml
) should now produce a log line with the message Hello, World!
You'll notice that the crawl()
function receives a Context
object. Think of it as a sort of sidekick: it helps you to create, store and document data in your crawler.
Fetching and storing resources
Many crawlers will start off by downloading a source data file, like a CSV table or a
XML document. The context
provides utility methods that let you fetch a file and store it into the crawlers working directory. Files stored to the crawler home directory and exported as resources will later be uploaded and published to the web.
def crawl(context):
# Fetch the source data URL specified in the metadata to a local path:
source_path = context.fetch_resource('source.xml', context.dataset.data.url)
with open(source_path, 'r') as fh:
print(len(fh.read()))
# You can also register the file as a resource with the dataset that
# will be included in the exported metadata index:
context.export_resource(source_path, title="Source data XML file")
Other crawlers might not be as lucky: instead of fetching their source data as a single bulk file, they might need to crawl a large number of web pages to collect the necessary data. For this, access to a pre-configured Python requests
session object is provided:
from lxml import html
def crawl(context):
response = context.http.get(context.dataset.data.url)
# Parse the HTTP response into an lxml DOM:
doc = html.fromstring(response.text)
# Query the DOM for specific elements to extract data from:
for element in doc.findall('.//div[@class="person"]'):
context.log.info("Element", element=element)
Responses from the context.http
session can also be cached using built-in helper methods:
from lxml import html
def crawl(context):
# Fetch, cache and parse the HTTP response into an lxml DOM:
doc = context.fetch_html(context.dataset.data.url, cache_days=7)
# Query the DOM for specific elements to extract data from:
for element in doc.findall('.//div[@class="person"]'):
context.log.info("Element", element=element)
Creating and emitting entities
The goal of each crawler is to produce data about persons and other entities of interest. To enable this, the context
provides a number of helpers that construct and store entities:
def crawl(context):
# Create an entity object to which other information can be assigned:
entity = context.make("Person")
# Each entity needs an ID which is unique within the source database, and
# ideally consistent over time.
# This is often ideally derived from its ID in the source database,
# or a string with the above properties. See Patterns below.
entity.id = context.make_id('Joseph Biden')
# Assign some property values:
entity.add('name', 'Joseph Robinette Biden Jr.')
entity.add('alias', 'Joe Biden')
entity.add('birthDate', '1942-11-20')
# Invalid property values ('never' is not a date) will produce a log
# error:
entity.add('deathDate', 'never')
# Store or update the entity in the database:
context.emit(entity, target=True)
The entity object is based on the entity proxy in FollowTheMoney, so we suggest you also check out the FtM documentation on entity construction. Some additional utility methods are added in the Entity
class in zavod
.
Verifying your output
Now that you're extracting data, it's a good idea to start verifying your output. Start by exportng your crawler's data:
This will log a number of different file types that are exported by default. A nice way
to explore the output is using the JSON command line utility jq
and your favourite text
pager like less
together to browse and search within the
targets.nested.json
and statistics.json
outputs using a command like
Good things to check are
- The number of entities produced of each type are as expected for your dataset
- Spot checking some specific persons, companies, and relations between them, as relevant to your data
- Any warnings in the crawler output
Add your crawler to a collection
Our data is mostly used within a broader collection of datasets, and less often by accessing a specific dataset directly.
Add your crawler to the most appropriate collection based on the kind of entities it is adding. Look for similar datasets and see which collection they are directly included in.
Broader collections include more specific collections and/or specific crawlers.
Next steps
You may now want to level up your crawler by looking at
- helpers for common tasks,
- common patterns for building crawlers,
- the PEPs guide if you're crawling Politically Exposed Persons and their Relatives and Close Associates.
- the merge checklist we'll follow before merging a new crawler
Checklist
When contributing a new data source, or some other change, make sure of the following:
- You've created a metadata YAML file with detailed descriptions and links to the source URL.
- Your code should run after doing a simple
pip install
of the codebase. Include additional dependencies in thesetup.py
. Don't use non-Python dependencies likeHeadless Chrome
orSelenium
. - The output data for your crawler should be Follow The Money objects. If you need more fields added to the ontology, submit a pull request upstream. Don't include left-over data in an improvised way.
- Include verbose logging in your crawler. Make sure that new fields or enum values
introduced upstream (e.g. a new country code or sanction program) will cause a warning
to be emitted. Warnings are checked regularly to identify when a crawler needs attention.
Info and lower level logs are useful for debugging with the
-v
flag. - Make sure your Python code is linted and formatted with
black
. - Make sure your yaml is linted with yamllint.