Skip to content

Context

zavod.context.Context

The context is a utility object that is passed as an argument into crawlers and other runners.

It supports creating and emitting (storing) entities, accessing metadata and logging errors and warnings. It also has functions for fetching data from the web and storing it in the dataset's data folder.

cache: Cache property

A cache object for storing HTTP responses and other data.

data_time: datetime property writable

The data provenance time to be used for the emitted statements. This is used to set the first_seen and last_seen properties of statements to a time that may be different than the real run time of the crawler, e.g. when a coverage end is defined, or the data source itself states an update time.

Returns:

Type Description
datetime

The time to be used for the emitted statements.

data_time_iso: str cached property

String representation of data_time in ISO format.

data_url: str property

The URL of the source data for the dataset.

lang: Optional[str] = None instance-attribute

Default language for statements emitted from this dataset

timestamps: TimeStampIndex property

An index of the first_seen time of every statement previous emitted by the dataset. This is used to determine if a statement is new or not.

version: Version property

The current version of the dataset.

audit_data(data, ignore=[])

Print the formatted data object if it contains any fields not explicitly excluded by the ignore list. This is used to warn about unexpected data in the source by removing the fields one by one and then inspecting the rest.

Parameters:

Name Type Description Default
data Dict[Any, Any]

A mapping which is to be checked.

required
ignore List[Any]

List of string keys to be skipped when checking the mapping

[]

begin(clear=False)

Prepare the context for running the exporter.

Parameters:

Name Type Description Default
clear bool

Remove the existing resources and issues from the dataset.

False

clear_url(fingerprint)

Remove a given URL from the cache using request fingerprint Args: fingerprint: The unique fingerprint of the request. Returns: None

close()

Flush and tear down the context.

debug_lookups()

Output a list of unused lookup options.

emit(entity, target=False, external=False)

Send an entity from the crawling/runner process to be stored.

Parameters:

Name Type Description Default
entity Entity

The entity to be stored.

required
target bool

Whether the entity is a target of the dataset.

False
external bool

Whether the entity is an enrichment candidate or already part of the dataset.

False

export_resource(path, mime_type=None, title=None)

Register a file as a data resource exported by the dataset.

Parameters:

Name Type Description Default
path Path

The file path of the exported resource

required
mime_type Optional[str]

MIME type of the resource, will be guessed otherwise

None
title Optional[str]

A human-readable description.

None

Returns:

Type Description
DataResource

The generated resource object which has been saved.

fetch_html(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)

Execute an HTTP request using the contexts' session and return an HTML DOM object based on the response. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name Type Description Default
url str

The URL to be fetched.

required
params ParamsType

URL query parameters to be included in the URL.

None
headers _Headers

HTTP request headers to be included.

None
auth _Auth

HTTP basic authorization username and password to be included.

None
cache_days Optional[int]

Number of days to retain cached responses for.

None
method str

The HTTP method to use for the request.

'GET'
data _Body

The data to be sent in the request body.

None

Returns: An lxml-based DOM of the web page that has been returned.

fetch_json(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)

Execute an HTTP request using the contexts' session and return a JSON-decoded object based on the response. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name Type Description Default
url str

The URL to be fetched.

required
params ParamsType

URL query parameters to be included in the URL.

None
headers _Headers

HTTP request headers to be included.

None
auth _Auth

HTTP basic authorization username and password to be included.

None
cache_days Optional[int]

Number of days to retain cached responses for.

None
method str

The HTTP method to use for the request.

'GET'

Returns:

Type Description
Any

The decoded response body as a JSON-decoded object.

fetch_resource(name, url, auth=None, headers=None, method='GET', data=None)

Fetch a URL into a file located in the current run folder, if it does not exist.

fetch_response(url, headers=None, auth=None, method='GET', data=None)

Execute an HTTP request using the contexts' session.

Parameters:

Name Type Description Default
url str

The URL to be fetched.

required
headers _Headers

HTTP request headers to be included.

None
auth _Auth

HTTP basic authorization username and password to be included.

None
method str

The HTTP method to use for the request.

'GET'
data _Body

The data to be sent in the request body.

None

Returns: A response object.

fetch_text(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)

Execute an HTTP request using the contexts' session and return the decoded response body. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name Type Description Default
url str

The URL to be fetched.

required
params ParamsType

URL query parameters to be included in the URL.

None
headers _Headers

HTTP request headers to be included.

None
auth _Auth

HTTP basic authorization username and password to be included.

None
cache_days Optional[int]

Number of days to retain cached responses for. None to disable.

None
method str

The HTTP method to use for the request.

'GET'
data _Body

The data to be sent in the request body.

None

Returns:

Type Description
Optional[str]

The decoded response body as a string.

get_resource_path(name)

Get the path to a file in the dataset data folder.

Parameters:

Name Type Description Default
name PathLike

The name of the file, relative to the dataset data folder.

required

Returns:

Type Description
Path

The full path to the file.

inspect(obj)

Display an object in a form suitable for inspection.

Parameters:

Name Type Description Default
obj Any

The object to be logged in pretty print.

required

lookup_value(lookup, value, default=None)

Invoke a datapatch lookup defined in the dataset metadata.

Parameters:

Name Type Description Default
lookup str

The name of the lookup. The key under the dataset lookups property.

required
value Optional[str]

The data value to look up.

required
default Optional[str]

The default value to use if the lookup doesn't match the value.

None

make(schema)

Make a new entity with some dataset context set.

Parameters:

Name Type Description Default
schema Union[str, Schema]

The entity's type name

required

Returns:

Type Description
Entity

A newly created entity object of the given type, with no ID.

make_id(*parts, prefix=None, hash_prefix=None)

Make a hash-based entity ID from a list of strings, prefixed with the dataset prefix.

Parameters:

Name Type Description Default
prefix Optional[str]

Use this prefix in the slug, but not the hash.

None
hash_prefix Optional[str]

Use this prefix in the hash, but not the slug.

None

make_slug(*parts, strict=True, prefix=None)

Make a slug-based entity ID from a list of strings, using the dataset prefix.

parse_resource_xml(name)

Parse a file in the resource folder into an XML tree.

Parameters:

Name Type Description Default
name PathLike

The resource name or relative file path.

required

Returns:

Type Description
_ElementTree

An lxml element tree of the parsed XML.

zavod.entity.Entity

Bases: CompositeEntity

Entity for sanctions list entries and adjacent objects.

Add utility methods to the EntityProxy for extracting data from sanctions lists and for auditing parsing errors to structured logging.

add_cast(schema, prop, values, cleaned=False, fuzzy=False, format=None, lang=None, original_value=None)

Set a property on an entity. If the entity is of a schema that doesn't have the given property, also modify the schema (e.g. if something has a birthDate, assume it's a Person, not a LegalEntity).

add_schema(schema)

Try to apply the given schema to the current entity, making it more specific (e.g. turning a LegalEntity into a Company). This raises an exception if the current and new type are incompatible.

unsafe_add(prop, value, cleaned=False, fuzzy=False, format=None, quiet=False, schema=None, dataset=None, seen=None, lang=None, original_value=None)

Add a statement to the entity, possibly the value.