Developing crawlers for Politically Exposed Persons (PEPs)
If this is your first crawler, you may want to start with a basic crawler by following the tutorial, coming back here when you have one working. You may also want to look at children of the peps collection to see common approaches.
Being classified as a PEP does not imply you have done anything wrong. However, the concept is important because PEPs and members of their families should be the subject of enhanced public scrutiny. This is also mandated by financial crime laws in many countries. Read more about our PEP data.
In addition to capturing general information about PEPs, a PEP crawler must
- Generate a Position entity for each position where a person has the kind of influence causing them to be defined as a PEP.
- Generate Occupancy entities representing the act of each person occupying a position for a period of time.
- Add the
role.pep
topic to each PEP Person entity. - Add the
role.rca
topic to each relative or close associate, as well as the most appropriate entity to represent the relationship, e.g. Family, Associate, or UnknownLink.
Creating Positions
The Position name
property should ideally capture the position and its jurisdiction, but be no more specific than that.
Selecting a position name
Do
- use local preferred terminology
- include the role
- include the organisational body where needed
- include the specific geographic jurisdiction where relevant
- refer to Wikidata EveryPolitician for examples - specifically position Q4164871. Much work has been done on defining positions in understandable and accurate ways here, and we plan on contributing our politician in the near future.
Avoid
- including the legislative term
- including the constituency an elected official represents
- including the country for sub-national representatives
Examples
- Prefer
United States representative
overMember of the House of Representatives
- while it's true that they're a member of the house of representatives, the common generic term is United States representative. - Prefer
Member of the Landtag of Mecklenburg-Vorpommern
overMember of the Landtag of Mecklenburg-Vorpommern, Germany
- the country is already captured as a property of the entity. - Prefer
Member of the Hellenic Parliament
overMember of the 17th Hellenic Parliament (2015-202019)
- there is currently no need to distinguish between different terms of the same position. Occupancies represent distinct periods when a given person holds a position. If the same position occurs twice in time, e.g. it was only possible to beMinister of Electricity
up until 2015 and again from 2023, those can be distinguished sufficiently using the dissolution and inception properties rather than the name.
Use the make_position
helper to generate position entities consistently.
Pro tip
It's perfectly fine to emit the same position over and over for each instance of a person holding that position, if that simplifies your code.
It is often convenient to just create the person, all their positions, and
occupancies in a loop. You don't have to track created positions in your
crawler to avoid duplicates as long as the position id
is consistent for
each distinct position encountered. This will be the case if the values you
pass make_position
are consistent. The
export process will take care of deduplication of entities with consistent
id
s.
Categorising positions
Most sources by their nature comprise entirely of PEPs. On rare occasions a source may contain positions which do not fall within the categories of roles we consider PEPs.
We maintain a database of positions where we can easily categorise positions as PEP or not, as well as their scope and role. This categorisation is used to determine whether a position, its holder(s), and the Occupancy entities relating them, should be emitted based on whether it is a PEP position, and the PEP duration of its scope.
To allow newly discovered positions to be added to the database, and to use the
is_pep
value from the database, call zavod.logic.pep.categorise
with the Position.
If the data source is known to only include PEP positions, or if the crawler only
attempts to create positions known to be PEPs, the is_pep
argument should be True
.
Otherwise it should be None
, denoting that it should be manually categorised
in the database.
Only make occupancies and emit entities for which the returned categorisation.is_pep
is True
.
See example below.
During development, it is normally best to run Zavod with the environment variable
ZAVOD_SYNC_POSITIONS
set to false
, meaning the position topics and is_pep
value supplied to
categorise
will be used locally, rather than any value in the database.
In production, positions will be created in the database if they don't already exist,
which we can later categorise manually. The is_pep
and topics
values
from the database will then be used during crawling and enrichment
respectively.
zavod.logic.pep.categorise(context, position, is_pep=True)
cached
Checks whether this is a PEP position and for any topics needed to make PEP duration decisions.
If the position is not in the database yet, it is added.
Only emit positions where is_pep is true, even if the crawler sets is_pep to true, in case is_pep has been changed to false in the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
Context
|
|
required |
position
|
Entity
|
The position to be categorised |
required |
is_pep
|
Optional[bool]
|
Initial value for is_pep in the database if it gets added. |
True
|
Source code in zavod/logic/pep.py
zavod.logic.pep.PositionCategorisation
Bases: object
Source code in zavod/logic/pep.py
is_pep: Optional[bool] = is_pep
instance-attribute
Whether the position denotes a politically exposed person or not
topics: List[str] = topics
instance-attribute
The topics linked to the position, as a list
Creating Occupancies
Occupanies represent the fact that a person holds or held a position for a given period of time. If a person holds the same position numerous times, emit an occupancy for each instance.
For most positions, someone holding a position becomes less and less significant over time.
It becomes less important to carry out anti money-laundering checks on people the
more time has passed since they held a position of influence which could enable
money laundering. We therefore only represent people as PEPs if a data source indicates
they hold the position now, or they left the position within the past 5 years.
In these cases the occupancy status should be current
or ended
respectively.
If it is unclear from the data or the data methodology of the source whether
a position is currently held or not, we consider someone a PEP if they have not
passed away, and they entered the position within the past 40 years. In this
case the occupancy status should be unknown
.
Only emit if the person is a PEP
Occupancies and positions should only be emitted for instances where these conditions are met. Persons should only be emitted if at least one occupancy exists to indicate they meet our criteria for being considered a PEP.
The make_occupancy
helper will only return
occupancies if they still meet these conditions, taking the PEP duration
into account.
You can use this to create occupancies, automatically set the correct status
,
and determine whether the occupancy meets our criteria and should be emitted.
Example
# ... looping over people in a province ...
if person_data.pop("death_date", None):
return
person = context.make("Person")
source_id = person_data.pop("id")
person.add("country", "us")
person.add("name", person_data.pop("name"))
# ... more person properties ...
pep_entities = []
for role in person_data.pop("roles"):
position = h.make_position(
context,
f"Member of the {province} Legislature",
country="us",
subnational_area=province
)
categorisation = categorise(context, position, is_pep=True)
if not categorisation.is_pep:
continue
occupancy = h.make_occupancy(
context,
person,
position,
True,
start_date=role.get("start_date", None),
end_date=role.get("end_date", None),
categorisation=categorisation
)
if occupancy:
pep_entities.append(position)
pep_entities.append(occupancy)
if pep_entities:
person.add("topics", "role.pep")
context.emit(person, target=True)
for entity in pep_entities:
context.emit(entity)