Skip to content

Common Patterns

The following are some patterns that have proved useful:

Common crawler code structure

Our typical crawler structure consists of

  1. a crawl function as the entrypoint which
    • fetches the data
    • converts it into an iterable of dicts, one per record
    • a loop over those records, calling...
  2. a function called once per record, e.g. crawl_item or crawl_person which
    • unpacks the record dict
    • ensures the necessary cleaning takes place
    • creates one or more entities for the record (LegalEntity, Sanction, Position, relations, etc)
    • emits the created entities

We have a number of helpers to turn common formats into an iterable of dicts:

We typically from zavod import helpers as h.

When concise-enough to fit on a single line and only used once, we pop and add values on the same line:

entity.add("name", row.pop("full_name"))
entity.add("birthPlace", row.pop("place_of_birth"))

The method entity.add works seamlessly with both a single string and a list of strings. In the long run, however, we want to make the typing of entity.add more strict to accept only one argument at a time. With this in mind, it's generally better to add values individually if they are already in that form, rather than forcing them into a list unnecessarily.

python for name in h.multi_split(names, SPLITS): entity.add(name)

Code structuring nitpicks

  • Ruff can help with sorting imports in ascending order, ensuring consistency across your codebase. The convention is to group standard library imports first, followed by third-party imports, and then project-specific imports.

    Each group should be separated by a blank line for clarity. For example (don't include the comments):

    # Standard library imports
    import os
    import sys
    
    # Third-party imports
    from normality import collapse_spaces, stringify
    
    # Local application imports
    from zavod import helpers as h
    

    The project-specific imports (like from zavod import helpers as h) should appear under all other imports and be separated by a blank line for clarity.

    To enforce this convention, run the following Ruff command:

    ruff check --fix --select I /path/to/crawler.py
    
  • Define and precompile regular expressions as constants at the top of the module.

    REGEX_DETAILS = re.compile(r"your_regex_pattern_here")
    
  • When naming functions for data extraction or processing tasks, it's important to be specific and clear. For example, use crawl_entity() instead of a generic name like process_data().

    def crawl_entity():  # Better than process_data()
        pass
    

    Note

    We typically use the crawl_thing convention (e.g., crawl_person, crawl_row, crawl_index) for functions that lead to entities being emitted (directly or via a nested crawl_ function call).

  • To improve readability and maintainability, break down deeply nested logic into smaller, focused functions.

    for link in main_grid.xpath(".//a/@href"):
        # Break down the handling of different data types into separate functions
        if data_type == "vessel":
            # A separate function to handle vessel data processing
            crawl_vessel(context, link, program)
        elif data_type == "legal_entity":
            # A separate function to handle legal entity data processing
            crawl_legal_entity(context, link, program)
    

    It's nice to handle cases where we can return early first, often by inverting an if A else return None with if not A return None; B. This pattern also reduces indentation for the B clause.

    # Extract required fields from the row
    name = row.pop("name")
    listing_date = row.pop("listing_date")
    
    # Proceed only if both 'name' and 'listing_date' are available
    if not (name and listing_date):
        return
    
    # Create the entity
    entity = context.make("LegalEntity")
    entity.id = context.make_id(name, listing_date)
    
  • Instead of using urljoin from urllib.parse, leverage .make_links_absolute() for cleaner URL resolution. This ensures all relative URLs are converted to absolute URLs within the crawler.

    # Make all relative links in the document absolute using the data_url as the base
    doc = context.fetch_html(context.data_url)
    doc.make_links_absolute(context.data_url)
    

Addresses

When distinct address fields are available, use h.make_address to compose it, adding a country code if possible.

Then use h.copy_address to add the full address to the entity's address property.

```python
address_ent = h.make_address(context, full=addr, city=city, lang="zhu")
h.copy_address(entity, address_ent)
```

Detect unhandled data

If a variable number of fields can extracted automatically (e.g. from a list or table):

  • Capture all fields provided by the data source in a dict.
  • dict_obj.pop() individual fields when adding them to entities.
  • Log warnings if there are unhandled fields remaining in the dict so that we notice and improve the crawler. The context method context.audit_data() can be used to warn about extra fields in a dict. It takes the ignore argument to explicitly list fields that are unused.

Logging and crawler feedback

It is good design to be told about issues, instead of having to go look to discover them.

Logs are essential for monitoring progress and debugging, but info-level and lower is only seen when we choose to go and look at a crawler's logs, so we might not notice from them if something is wrong except during debugging/development. Use the appropriate log level for the purpose.

  • Debug Logs: Enable verbose output for detailed tracking during development. Use zavod --debug to activate debug logs.

    context.log.debug(f"Unique ID {person.id}")
    
  • Info Logs: Monitor the crawler’s progress, especially on large sites.

    context.log.info(f"Processed {page_number} pages")
    
  • Warning Logs: Indicate potential issues that don't stop the crawl but may require attention. These are surfaced to the dev team on the Issues page and checked daily.

    Don't use warnings for things we know we won't fix, e.g. a permanent 404 that we can't do anything about. Do use warnings for things we should take action on, e.g. to notice a new entity type which we haven't mapped to a Schema yet.

    context.log.warning("Unhandled entity type", type=entity_type)
    

Data assertions

Build crawlers with robust assertions to catch missing data during runtime. Instead of manually inspecting logs, implement checks to ensure that expected data is present or that invalid data doesn't slip through:

# Ensure a valid date of birth (dob)
assert dob is None, (dob, entity_name)

# Validate Position Name
assert position_name != "Socialdemokratiet"

# Check for Non-None Position Name
assert position_name is not None, entity.id

Generating consistent unique identifiers

Make sure entity IDs are unique within the source. Avoid using only the name of the entity because there might eventually be two persons or two companies with the same name. It is preferable to have to deduplicate two Follow the Money entities for the same real world entity, rather than accidentally merge two entities.

Good values to use as identifiers are:

  • An ID in the source dataset, e.g. a sanction number, company registration number. These can be turned into a readable ID with the dataset prefix using the context.make_slug function.
  • Some combination of consistent attributes, e.g. a person's name and normalised date of birth in a dataset that holds a relatively small proportion of the population so that duplicates are extremely unlikely. These attributes can be turned into a unique hash describing the entity using the context.make_id function.
  • A combination of identifiers for the entities related by another entity, e.g. an owner and a company, in the form ownership.id = context.make_id(owner.id, "owns", company.id)

Note

Remember to make sure distinct sanctions, occupancies, positions, relationships, etc get distinct IDs.

Note

Do not reveal personally-identifying information such as names, ID numbers, etc in IDs, e.g. via context.make_slug.

Capture text in its original language

Useful fields like the reason someone is sanctioned should be captured regardless of the language it is written in. Don't worry about translating fields where arbitrary text would be written. If the language is known, include the three-letter language code in the lang parameter to Entity.add(), e.g.:

reason = data.pop("expunerea-a-temeiului-de-includere-in-lista-a-operatorului-economic")
sanction.add("reason", reason, lang="rom")

Handling special space characters in strings

Be aware of different types of space characters and how they affect text comparison. For example, a non-breaking space (\xa0) or zero-width space do not match a normal space character and can affect string comparison or processing.

An editor like VS Code highlights characters like this by default, and a hex editor is an effective way to see more precisely which values are present in strings that are surprising you. Remember that a hex editor is looking at the data encoded e.g. to utf-8 while Python strings are unicode code points.

To handle these cases, you can use string cleaning methods such as:

  • normality.collapse_spaces
  • normality.remove_unsafe_chars
  • .replace
import normality

# Replace non-breaking space with regular space
text = text.replace("\xa0", " ")

# When the source data contains messy or excessively repeated whitespace,
# e.g., collapsing whitespace from text extracted from HTML
cleaned_text = normality.collapse_spaces(text)

Use datapatch lookups to clean or map values from external forms to OpenSanctions

See Datapatches

e.g.

  • Fixing typos in dates
  • Translating column headings to English
  • Mapping source data entity types to FollowTheMoney entity types
  • Mapping relationship descriptions to FollowTheMoney relation entity types