Data collection priorities
With some data sources, extracting some attributes of people or companies sufficiently cleanly/reliably can take more effort than others. Not all attributes are equally valuable to our users.
To avoid going too far down a rabbit hole or wasting effort, we recommend an approach of time-boxing the work on a crawler, and taking a best-effort approach according to the following priorities, categorised roughly by Essential, Should, Could and Won't.
Aim for complete coverage - make sure all risk-associated entities (people, companies, vessels, etc.) are included. But also ensure accuracy, e.g. make sure not to mark someone as a PEP when they are not.
Generally (PEPs and Sanctions crawlers)
Essential (bare minimum)
- Name(s) (see: name cleaning and review framework)
Essential (when available)
- People: Date of birth, place of birth, citizenship or nationality
- Official ID numbers (National ID for people, Registration number/VAT/tax for companies, etc)
- Other identifiers (See specifics in schemata, e.g.
innCode,wikidataId) - Country of birth, registration country (
Company:jurisdiction)
Should
- Companies/Organizations:
abbreviation - Companies/Organizations: Date of registration/creation (often ambiguous)
- start/end dates - useful for determining PEP status duration
- listing and effective dates (sanctions)
- company relationships
- person relationships
- addresses (Except PEPs - see below)
Could
- sourceUrl - only if it is a deep link to the specific company/person, not generic for the data source.
- notes
Politically-exposed persons
See also: guide for building PEP data crawlers
Must
country(occasionally multiple apply to one position, e.g. Ambassador of Palestine to Germany)position(of a person)occupancy(relating a person to the position(s) they hold/held) - focus on current positions before worrying about historical.
Could
citizenship- safe to assume overcountryfor most elected officialsbiography- Occupancy:constituency
- Position:subnationalArea
- Occupancy:politicalGroup
Won't - don't extract
- private individual addresses (not needed, and privacy concern)
- phone numbers (not needed, more sensitive than emails)