Datapatch lookups
Lookups patch broken or inconsistent source values into clean ones, declared in the dataset metadata YAML and applied either automatically by zavod or explicitly by crawler code.
The mechanism comes from the datapatch library. Use it whenever the same dirty value or class of values reappears across crawls — listing the fixes in YAML keeps the crawler code free of one-off conditionals and gives reviewers a single place to inspect what has been overridden.
Please note
Avoid using lookups to express something not evident in the data. For example, do not make a date more precise than it is in the data, even if you know your version to be true.
Two ways lookups get invoked
Lookups appear under the lookups: key in the dataset YAML. Each named lookup is invoked in one of two ways:
- Type lookups — any lookup named
type.<typename>(e.g.type.country,type.email,type.identifier) is invoked automatically byzavodwhenever a property of that FollowTheMoney type is added to an entity. Crawler code does not need to call it. - Named lookups — every other lookup is invoked explicitly from the crawler, via
context.lookup_value(returns the result'svalue) orcontext.lookup(returns the fullResultobject with all attributes).
# Type lookup — runs implicitly:
entity.add("email", row.pop("email"))
# Named lookup — invoked explicitly:
res = context.lookup("relationships", row.pop("link_type"))
Matching: match, contains, regex
Each option in a lookup uses one of three matching modes:
match— exact string equality, after normalization. A list of strings matches any of them. Use the list form to merge multiple inputs that should produce the same result under a single option (see Result values).contains— substring match, after normalization.regex— raw Python regular expression. The input is not normalized before the regex runs, so write the pattern against the original string.
A single option may combine modes; if any clause matches, the option matches.
lookups:
type.country:
lowercase: true
options:
- match: Tazmania
value: Australia
- contains: Syrian Arab Republic
value: Syria
- regex: "^USSR.*"
value: SUHH
Normalization
Three flags control how the input value and the match/contains patterns are folded before comparison. They can be set on the lookup as defaults and overridden per option:
| Flag | Default | Effect |
|---|---|---|
normalize |
false |
Strip diacritics and collapse whitespace. |
lowercase |
false |
Lowercase before matching. |
asciify |
true |
When normalizing, transliterate non-ASCII to ASCII (Путин → Putin). Set false to keep non-Latin scripts intact. |
Matching null inputs
To match a missing value, list null in the option's match:
Disambiguating overlapping options
When two options match the same input with the same priority, datapatch raises LookupException. Use weight: <int> on one option to break the tie — higher weight wins.
Result values
Every option produces a result. The simplest result is a single replacement string:
A few rules about result values:
value: Xis shorthand forvalues: [X]. The two are interchangeable;values:exists for the multi-value case.value: nulldrops the input. No property is added to the entity.- Multiple values fan out into multiple property values:
- match: district@repkelly.com, mike@repkelly.com
values:
- district@repkelly.com
- mike@repkelly.com
- Consolidate inputs that share a result. When several distinct inputs should produce the same
value(orvalues, orvalue: null), list them under one option'smatch:. Inputs that map to different results must remain in separate options.
# One option, three inputs that all drop:
- match:
- 307j@att
- SL Jones@ballhealth.com
- na
value: null
# Two options — different replacements, cannot be merged:
- match: tcolpetzer@mcdonoughga.org.
value: tcolpetzer@mcdonoughga.org
- match: sensan buenaventura@capitol.hawaii.gov
value: sensanbuenaventura@capitol.hawaii.gov
- Arbitrary keys on the option are accessible as attributes on the result. Named lookups use this to attach schema, role, or category information; the next two sections show how.
Re-routing to a different property
In a type.* lookup, prop: moves the value to a different property of the same entity. This handles cases where source data labels a value as one thing but it is really another — for example an "email" column that occasionally contains a website URL.
When values is omitted, the original input value is preserved and only the destination property changes. To re-route and rewrite, set both:
type.email:
options:
# Pure re-route — original URL preserved, moved to the website property
- match: www.bloodandhonour.co.uk
prop: website
# Re-route with rewrite — fix the typo as well
- match: www.surena gc.com
prop: website
value: www.surenagc.com
If the target property does not exist on the entity's schema, zavod logs Invalid type lookup property re-write and falls back to the original property.
Same-type re-routes are the safe default
Re-routing keeps the value's cleaned flag from the original type's processing, so the value is not re-validated against the destination property's type. Re-routes within the same FtM type (e.g. identifier → identifier) are uncontroversial. Cross-type re-routes (e.g. email → website) work but the destination type's validator does not run on the value — only use them when the value is already known to be clean for the destination type.
Curated values bypass smell checks
Any value produced by a type.* lookup is treated as manually reviewed and bypasses three downstream warnings:
Property value '<value>' is not a valid name.(fromrigour.names.is_name)Property for <prop> looks too short for an address: <value>(≤ 3 characters)HTML/XSS suspicion in property value: <value>
This is useful for short place names like Zug or for legitimate names that fail is_name's heuristics — adding an identity lookup (match: Zug / value: Zug) marks the value as curated and silences the warning.
Mapping to richer concepts
Named lookups become powerful when the result carries more than just a replacement string. Any extra YAML key on the option is accessible as an attribute on the Result object.
lookups:
relationships:
lowercase: true
options:
- contains:
- chairman of
- director of
schema: Directorship
start: director
end: organization
link: role
- contains:
- shareholder of
- owner of
schema: Ownership
start: owner
end: asset
link: role
The crawler reads result.schema, result.start, result.end, result.link to assemble the relation:
link_type = row.pop("link_type")
res = context.lookup("relationships", link_type, warn_unmatched=True)
if res is None:
continue
rel = context.make(res.schema)
rel.id = context.make_id(rel.schema, company.id, other_entity.id, link_type)
rel.add(res.start, entity)
rel.add(res.end, other_entity)
rel.add(res.link, link_type)
Pass warn_unmatched=True to log a warning when a value matches no option — this surfaces values that need a new lookup entry rather than silently dropping data.
For lookups where any unmatched value should halt the crawl, set required: true on the lookup itself. A miss then raises LookupException.
Common runtime warnings and the lookup that fixes them
Several warnings emitted by zavod are best fixed by adding a lookup option. Each row below names the warning, what triggered it, and the lookup recipe.
| Warning | What it means | Fix |
|---|---|---|
Rejected property value [<prop>]: <value> |
The type cleaner could not normalize the value (an invalid date like 2020-02-31, a country string like France / Syria, an unparseable phone number). |
Add a type.<type> lookup mapping <value> to a corrected value: (or values: for the multi-country case). Use value: null to drop. |
Property value '<value>' is not a valid name. |
A name property on a LegalEntity failed rigour.names.is_name — usually because the string contains digits, punctuation patterns, or looks like an address. |
type.name lookup with a corrected value:, or value: null to drop. The value-came-from-a-lookup check then suppresses the warning automatically. |
Property for <prop> looks too short for an address: <value> |
An address value is three characters or fewer. Often a parsing error; sometimes a real short place name. | If the value is a real place, add an identity lookup (match: Zug / value: Zug) to mark it as curated. Otherwise value: null. |
HTML/XSS suspicion in property value: <value> |
The value contains HTML tags or entity references — usually leftover markup from extraction. | Map the dirty value to its cleaned text via a type.<type> lookup. If the markup is genuinely intended (rare), add silence_warnings: [xss-html-smell] to the option. |
Property value for <prop> exceeds type length: <value> |
The value is longer than the type's max_length. zavod warns but does not truncate. |
type.<type> lookup with a shorter value:. |
Failed to validate <format> identifier: <value> |
The value did not validate against a known identifier format (bic, isin, lei, iban, inn, ogrn, npi, uei, qid, uscc, imo). |
type.identifier lookup with match: <value> and a corrected value:, or value: null. |
Property name to type lookup
When an issue references a property name, the corresponding type lookup is one of these:
| Property names | Type lookup |
|---|---|
name, alias, previousName, weakAlias, firstName, lastName, … |
type.name |
address, full |
type.address |
country, jurisdiction, nationality, citizenship |
type.country |
date, startDate, endDate, birthDate, incorporationDate, dissolutionDate |
type.date |
registrationNumber, taxNumber, ogrnCode, innCode, npiCode, leiCode, bicCode, imoNumber, … |
type.identifier |
sourceUrl, website, wikipediaUrl |
type.url |
email |
type.email |
phone |
type.phone |
gender |
type.gender |
The full property listing is at followthemoney.tech.
Reference: configuration keys
Lookup-level keys
Set under each named lookup (e.g. lookups: type.country: …):
| Key | Default | Effect |
|---|---|---|
options |
required | List of options. |
map |
— | Dict shorthand of match: value pairs; merged with options. |
normalize |
false |
Strip diacritics and collapse whitespace before matching. |
lowercase |
false |
Lowercase before matching. |
asciify |
true |
Transliterate to ASCII when normalizing. |
required |
false |
Raise LookupException when no option matches the input. |
Option-level keys
Each entry in the options: list:
| Key | Effect |
|---|---|
match |
String, list of strings, or null; exact match after normalization. |
contains |
String or list; substring match after normalization. |
regex |
String or list; raw regex pattern, not normalized. |
value / values |
Replacement value(s). value: null drops the input. |
prop |
(type.* lookups only) Re-route the value to a different property. |
weight |
Integer to disambiguate when multiple options match the same input. |
normalize / lowercase / asciify |
Override the lookup-level setting. |
silence_warnings |
List of warning types to suppress. Currently the only recognized value is xss-html-smell. |
| any other key | Available as an attribute on the Result object (e.g. schema, start, end, link, is_alias, document_schema). |
Recipe: translating column headers
A recurring use of named lookups is mapping non-English source column headers onto English slug keys that the crawler code references. The lookup runs once per header during table parsing.
lookups:
columns:
options:
- match: الإسم الثلاثي
value: full_name
- match: تاريخ الولادة ومكانها
value: dob_place
- match: العنوان
value: address
- match: الجنسية
value: nationality
In the crawler, look up each header before treating it as a dictionary key:
slug = context.lookup_value("columns", raw_header)
if slug is None:
context.log.warning("Unknown column header", header=raw_header)
continue
This pattern keeps the crawler code in English regardless of the source language, and any new header in the source surfaces as an explicit warning rather than silently dropped data.