Skip to content

Developing crawlers for Politically Exposed Persons (PEPs)

If this is your first crawler, you may want to start with a basic crawler by following the tutorial, coming back here when you have one working. You may also want to look at children of the peps collection to see common approaches.

Being classified as a PEP does not imply you have done anything wrong. However, the concept is important because PEPs and members of their families should be the subject of enhanced public scrutiny. This is also mandated by financial crime laws in many countries. Read more about our PEP data.

In addition to capturing general information about PEPs, a PEP crawler must

  • Generate a Position entity for each position where a person has the kind of influence causing them to be defined as a PEP.
  • Generate Occupancy entities representing the act of each person occupying a position for a period of time.
  • Add the role.pep topic to each PEP Person entity.
  • Add the role.rca topic to each relative or close associate, as well as the most appropriate entity to represent the relationship, e.g. Family, Associate, or UnknownLink.

Creating Positions

The Position name property should ideally capture the position and its jurisdiction, but be no more specific than that.

Selecting a position name

Do

  • use local preferred terminology
  • include the role
  • include the organisational body where needed
  • include the specific geographic jurisdiction where relevant
  • refer to Wikidata EveryPolitician for examples - specifically position Q4164871. Much work has been done on defining positions in understandable and accurate ways here, and we plan on contributing our politician in the near future.

Avoid

  • including the legislative term
  • including the constituency an elected official represents
  • including the country for sub-national representatives

Examples

  • Prefer United States representative over Member of the House of Representatives - while it's true that they're a member of the house of representatives, the common generic term is United States representative.
  • Prefer Member of the Landtag of Mecklenburg-Vorpommern over Member of the Landtag of Mecklenburg-Vorpommern, Germany - the country is already captured as a property of the entity.
  • Prefer Member of the Hellenic Parliament over Member of the 17th Hellenic Parliament (2015-202019) - there is currently no need to distinguish between different terms of the same position. Occupancies represent distinct periods when a given person holds a position. If the same position occurs twice in time, e.g. it was only possible to be Minister of Electricity up until 2015 and again from 2023, those can be distinguished sufficiently using the dissolution and inception properties rather than the name.

Use the make_position helper to generate position entities consistently.

Pro tip

It's perfectly fine to emit the same position over and over for each instance of a person holding that position, if that simplifies your code.

It is often convenient to just create the person, all their positions, and occupancies in a loop. You don't have to track created positions in your crawler to avoid duplicates as long as the position id is consistent for each distinct position encountered. This will be the case if the values you pass make_position are consistent. The export process will take care of deduplication of entities with consistent ids.

Categorising positions

Most sources by their nature comprise entirely of PEPs. On rare occasions a source may contain positions which do not fall within the categories of roles we consider PEPs.

We maintain a database of positions where we can easily categorise positions as PEP or not, as well as their scope and role. This categorisation is used to determine whether a position, its holder(s), and the Occupancy entities relating them, should be emitted based on whether it is a PEP position, and the PEP duration of its scope.

Position categorisation UI

To allow newly discovered positions to be added to the database, and to use the is_pep value from the database, call zavod.logic.pep.categorise with the Position. If the data source is known to only include PEP positions, or if the crawler only attempts to create positions known to be PEPs, the is_pep argument should be True. Otherwise it should be None, denoting that it should be manually categorised in the database. Only make occupancies and emit entities for which the returned categorisation.is_pep is True. See example below.

During development, it is normally best to run Zavod with the environment variable ZAVOD_SYNC_POSITIONS set to false, meaning the position topics and is_pep value supplied to categorise will be used locally, rather than any value in the database.

In production, positions will be created in the database if they don't already exist, which we can later categorise manually. The is_pep and topics values from the database will then be used during crawling and enrichment respectively.

zavod.logic.pep.categorise(context, position, is_pep=True) cached

Checks whether this is a PEP position and for any topics needed to make PEP duration decisions.

If the position is not in the database yet, it is added.

Only emit positions where is_pep is true, even if the crawler sets is_pep to true, in case is_pep has been changed to false in the database.

Parameters:

Name Type Description Default
context Context
required
position Entity

The position to be categorised

required
is_pep Optional[bool]

Initial value for is_pep in the database if it gets added.

True
Source code in zavod/logic/pep.py
@lru_cache(maxsize=5000)
def categorise(
    context: Context,
    position: Entity,
    is_pep: Optional[bool] = True,
) -> PositionCategorisation:
    """Checks whether this is a PEP position and for any topics needed to make
    PEP duration decisions.

    If the position is not in the database yet, it is added.

    Only emit positions where is_pep is true, even if the crawler sets is_pep
    to true, in case is_pep has been changed to false in the database.

    Args:
      context:
      position: The position to be categorised
      is_pep: Initial value for is_pep in the database if it gets added.
    """
    categorisation = get_categorisation(context, position.id)

    if categorisation is None:
        global NOTIFIED_SYNC_POSITIONS
        if not settings.SYNC_POSITIONS:
            if not NOTIFIED_SYNC_POSITIONS:
                context.log.info(
                    "Syncing positions is disabled - falling back to categorisation provided by crawler, if any."
                )
                NOTIFIED_SYNC_POSITIONS = True
            return PositionCategorisation(topics=position.get("topics"), is_pep=is_pep)

        if not settings.OPENSANCTIONS_API_KEY:
            context.log.error(
                "Setting OPENSANCTIONS_API_KEY is required when ZAVOD_SYNC_POSITIONS is true."
            )

        context.log.info("Adding position not yet in database", entity_id=position.id)
        url = f"{settings.OPENSANCTIONS_API_URL}/positions/"
        headers = {"authorization": settings.OPENSANCTIONS_API_KEY}
        body = {
            "entity_id": position.id,
            "caption": position.caption,
            "countries": position.get("country"),
            "topics": position.get("topics"),
            "dataset": position.dataset.name,
            "is_pep": is_pep,
        }
        res = context.http.post(url, headers=headers, json=body)
        res.raise_for_status()
        data = res.json()
        categorisation = PositionCategorisation(
            topics=data.get("topics", []),
            is_pep=data.get("is_pep"),
        )

    if categorisation.is_pep is None:
        context.log.debug(
            (
                f'Position {position.get("country")} {position.get("name")}'
                " not yet categorised as PEP or not."
            )
        )

    return categorisation

zavod.logic.pep.PositionCategorisation

Bases: object

Source code in zavod/logic/pep.py
class PositionCategorisation(object):
    is_pep: Optional[bool]
    """Whether the position denotes a politically exposed person or not"""
    topics: List[str]
    """The topics linked to the position, as a list"""

    __slots__ = ["topics", "is_pep"]

    def __init__(self, topics: List[str], is_pep: Optional[bool]):
        self.topics = topics
        self.is_pep = is_pep

is_pep: Optional[bool] = is_pep instance-attribute

Whether the position denotes a politically exposed person or not

topics: List[str] = topics instance-attribute

The topics linked to the position, as a list

Creating Occupancies

Occupanies represent the fact that a person holds or held a position for a given period of time. If a person holds the same position numerous times, emit an occupancy for each instance.

For most positions, someone holding a position becomes less and less significant over time. It becomes less important to carry out anti money-laundering checks on people the more time has passed since they held a position of influence which could enable money laundering. We therefore only represent people as PEPs if a data source indicates they hold the position now, or they left the position within the past 5 years. In these cases the occupancy status should be current or ended respectively.

If it is unclear from the data or the data methodology of the source whether a position is currently held or not, we consider someone a PEP if they have not passed away, and they entered the position within the past 40 years. In this case the occupancy status should be unknown.

Only emit if the person is a PEP

Occupancies and positions should only be emitted for instances where these conditions are met. Persons should only be emitted if at least one occupancy exists to indicate they meet our criteria for being considered a PEP.

The make_occupancy helper will only return occupancies if they still meet these conditions, taking the PEP duration into account. You can use this to create occupancies, automatically set the correct status, and determine whether the occupancy meets our criteria and should be emitted.

Example

# ... looping over people in a province ...
if person_data.pop("death_date", None):
    return
person = context.make("Person")
source_id = person_data.pop("id")
person.add("country", "us")
person.add("name", person_data.pop("name"))
# ... more person properties ...

pep_entities = []
for role in person_data.pop("roles"):
    position = h.make_position(
        context,
        f"Member of the {province} Legislature",
        country="us",
        subnational_area=province
    )
    categorisation = categorise(context, position, is_pep=True)
    if not categorisation.is_pep:
        continue
    occupancy = h.make_occupancy(
        context,
        person,
        position,
        True,
        start_date=role.get("start_date", None),
        end_date=role.get("end_date", None),
        categorisation=categorisation
    )
    if occupancy:
        pep_entities.append(position)
        pep_entities.append(occupancy)

if pep_entities:
    person.add("topics", "role.pep")
    context.emit(person, target=True)
for entity in pep_entities:
    context.emit(entity)