Context

`zavod.context.Context`

The context is a utility object that is passed as an argument into crawlers and other runners.

It supports creating and emitting (storing) entities, accessing metadata and logging errors and warnings. It also has functions for fetching data from the web and storing it in the dataset's data folder.

`cache` `property`

A cache object for storing HTTP responses and other data.

`conn` `property`

Expose a database connection to the ETL store.

`data_url` `property`

The URL of the source data for the dataset.

`lang = None` `instance-attribute`

Default language for statements emitted from this dataset

`timestamps` `property`

An index of the first_seen time of every statement previous emitted by the dataset. This is used to determine if a statement is new or not.

`version` `property`

The current version of the dataset.

`audit_data(data, ignore=[])`

Print the formatted data object if it contains any fields not explicitly excluded by the ignore list. This is used to warn about unexpected data in the source by removing the fields one by one and then inspecting the rest.

Parameters:

Name	Type	Description	Default
`data`	`Dict[Any, Any]`	A mapping which is to be checked.	required
`ignore`	`List[Any]`	List of string keys to be skipped when checking the mapping	`[]`

`begin(clear=False)`

Prepare the context for running the exporter.

Parameters:

Name	Type	Description	Default
`clear`	`bool`	Remove the existing resources and issues from the dataset.	`False`

`clear_url(fingerprint)`

Remove a given URL from the cache using request fingerprint Args: fingerprint: The unique fingerprint of the request. Returns: None

`close()`

Flush and tear down the context.

`debug_lookups()`

Output a list of unused lookup options.

`emit(entity, external=False, origin=None)`

Send an entity from the crawling/runner process to be stored.

Parameters:

Name	Type	Description	Default
`entity`	`Entity`	The entity to be stored.	required
`external`	`bool`	Whether the entity is an enrichment candidate or already part of the dataset.	`False`
`origin`	`Optional[str]`	Set the origin for statements where none has been provided.	`None`

`export_resource(path, mime_type=None, title=None)`

Register a file as a data resource exported by the dataset.

Parameters:

Name	Type	Description	Default
`path`	`Path`	The file path of the exported resource	required
`mime_type`	`Optional[str]`	MIME type of the resource, will be guessed otherwise	`None`
`title`	`Optional[str]`	A human-readable description.	`None`

Returns:

Type	Description
`DataResource`	The generated resource object which has been saved.

`fetch_html(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None, absolute_links=False)`

Execute an HTTP request using the contexts' session and return an HTML DOM object based on the response. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to be fetched.	required
`params`	`ParamsType`	URL query parameters to be included in the URL.	`None`
`headers`	`_Headers`	HTTP request headers to be included.	`None`
`auth`	`_Auth`	HTTP basic authorization username and password to be included.	`None`
`cache_days`	`Optional[int]`	Number of days to retain cached responses for.	`None`
`method`	`str`	The HTTP method to use for the request.	`'GET'`
`data`	`Optional[_Body]`	The data to be sent in the request body.	`None`
`absolute_links`	`bool`	Whether to convert relative links to absolute links.	`False`

Returns: An lxml-based DOM of the web page that has been returned.

`fetch_json(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)`

Execute an HTTP request using the contexts' session and return a JSON-decoded object based on the response. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to be fetched.	required
`params`	`ParamsType`	URL query parameters to be included in the URL.	`None`
`headers`	`_Headers`	HTTP request headers to be included.	`None`
`auth`	`_Auth`	HTTP basic authorization username and password to be included.	`None`
`cache_days`	`Optional[int]`	Number of days to retain cached responses for.	`None`
`method`	`str`	The HTTP method to use for the request.	`'GET'`

Returns:

Type	Description
`Any`	The decoded response body as a JSON-decoded object.

`fetch_resource(name, url, auth=None, headers=None, method='GET', data=None)`

Fetch a URL into a file located in the current run folder, if it does not exist.

`fetch_response(url, headers=None, auth=None, method='GET', data=None)`

Execute an HTTP request using the contexts' session.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to be fetched.	required
`headers`	`_Headers`	HTTP request headers to be included.	`None`
`auth`	`_Auth`	HTTP basic authorization username and password to be included.	`None`
`method`	`str`	The HTTP method to use for the request.	`'GET'`
`data`	`Optional[_Body]`	The data to be sent in the request body.	`None`

Returns: A response object.

`fetch_text(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)`

Execute an HTTP request using the contexts' session and return the decoded response body. If a cache_days argument is provided, a cache will be used for the given number of days.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL to be fetched.	required
`params`	`ParamsType`	URL query parameters to be included in the URL.	`None`
`headers`	`_Headers`	HTTP request headers to be included.	`None`
`auth`	`_Auth`	HTTP basic authorization username and password to be included.	`None`
`cache_days`	`Optional[int]`	Number of days to retain cached responses for. `None` to disable.	`None`
`method`	`str`	The HTTP method to use for the request.	`'GET'`
`data`	`Optional[_Body]`	The data to be sent in the request body.	`None`

Returns:

Type	Description
`Optional[str]`	The decoded response body as a string.

`flush()`

Flush the context to ensure all data is written to disk.

`get_resource_path(name)`

Get the path to a file in the dataset data folder.

Parameters:

Name	Type	Description	Default
`name`	`PathLike`	The name of the file, relative to the dataset data folder.	required

Returns:

Type	Description
`Path`	The full path to the file.

`inspect(obj)`

Display an object in a form suitable for inspection.

Parameters:

Name	Type	Description	Default
`obj`	`Any`	The object to be logged in pretty print.	required

`lookup(lookup, value, *, warn_unmatched=False)`

Invoke a datapatch lookup defined in the dataset metadata.

Parameters:

Name	Type	Description	Default
`lookup`	`str`	The name of the lookup. The key under the dataset lookups property.	required
`value`	`Optional[str]`	The data value to look up.	required
`warn_unmatched`	`bool`	Whether to log a warning if no match is found.	`False`

`lookup_value(lookup, value, default=None, *, warn_unmatched=False)`

Invoke a datapatch lookup defined in the dataset metadata, returning the value attribute.

Parameters:

Name	Type	Description	Default
`lookup`	`str`	The name of the lookup. The key under the dataset lookups property.	required
`value`	`Optional[str]`	The data value to look up.	required
`default`	`Optional[str]`	The default value to use if the lookup doesn't match the value.	`None`
`warn_unmatched`	`bool`	Whether to log a warning if no match is found.	`False`

`make(schema)`

Make a new entity with some dataset context set.

Parameters:

Name	Type	Description	Default
`schema`	`Union[str, Schema]`	The entity's type name	required

Returns:

Type	Description
`Entity`	A newly created entity object of the given type, with no ID.

`make_id(*parts, prefix=None, hash_prefix=None)`

Make a hash-based entity ID from a list of strings, prefixed with the dataset prefix.

Parameters:

Name	Type	Description	Default
`prefix`	`Optional[str]`	Use this prefix in the slug, but not the hash.	`None`
`hash_prefix`	`Optional[str]`	Use this prefix in the hash, but not the slug.	`None`

`make_slug(*parts, strict=True, prefix=None)`

Make a slug-based entity ID from a list of strings, using the dataset prefix.

`parse_resource_xml(name)`

Parse a file in the resource folder into an XML tree.

Parameters:

Name	Type	Description	Default
`name`	`PathLike`	The resource name or relative file path.	required

Returns:

Type	Description
`_ElementTree`	An lxml element tree of the parsed XML.

`zavod.entity.Entity`

Bases: StatementEntity

Entity for sanctions list entries and adjacent objects.

Add utility methods to the EntityProxy for extracting data from sanctions lists and for auditing parsing errors to structured logging.

`add_cast(schema, prop, values, cleaned=False, fuzzy=False, format=None, lang=None, original_value=None)`

Set a property on an entity. If the entity is of a schema that doesn't have the given property, also modify the schema (e.g. if something has a birthDate, assume it's a Person, not a LegalEntity).

`add_schema(schema)`

Try to apply the given schema to the current entity, making it more specific (e.g. turning a LegalEntity into a Company). This raises an exception if the current and new type are incompatible.

`unsafe_add(prop, value, cleaned=False, fuzzy=False, format=None, quiet=False, schema=None, dataset=None, seen=None, lang=None, original_value=None, origin=None)`

Add a statement to the entity, possibly the value.

Context

zavod.context.Context

cache property

conn property

data_url property

lang = None instance-attribute

timestamps property

version property

audit_data(data, ignore=[])

begin(clear=False)

clear_url(fingerprint)

close()

debug_lookups()

emit(entity, external=False, origin=None)

export_resource(path, mime_type=None, title=None)

fetch_html(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None, absolute_links=False)

fetch_json(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)

fetch_resource(name, url, auth=None, headers=None, method='GET', data=None)

fetch_response(url, headers=None, auth=None, method='GET', data=None)

fetch_text(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)

flush()

get_resource_path(name)

inspect(obj)

lookup(lookup, value, *, warn_unmatched=False)

lookup_value(lookup, value, default=None, *, warn_unmatched=False)

make(schema)

make_id(*parts, prefix=None, hash_prefix=None)

make_slug(*parts, strict=True, prefix=None)

parse_resource_xml(name)

zavod.entity.Entity

add_cast(schema, prop, values, cleaned=False, fuzzy=False, format=None, lang=None, original_value=None)

add_schema(schema)

unsafe_add(prop, value, cleaned=False, fuzzy=False, format=None, quiet=False, schema=None, dataset=None, seen=None, lang=None, original_value=None, origin=None)

`zavod.context.Context`

`cache` `property`

`conn` `property`

`data_url` `property`

`lang = None` `instance-attribute`

`timestamps` `property`

`version` `property`

`audit_data(data, ignore=[])`

`begin(clear=False)`

`clear_url(fingerprint)`

`close()`

`debug_lookups()`

`emit(entity, external=False, origin=None)`

`export_resource(path, mime_type=None, title=None)`

`fetch_html(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None, absolute_links=False)`

`fetch_json(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)`

`fetch_resource(name, url, auth=None, headers=None, method='GET', data=None)`

`fetch_response(url, headers=None, auth=None, method='GET', data=None)`

`fetch_text(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)`

`flush()`

`get_resource_path(name)`

`inspect(obj)`

`lookup(lookup, value, *, warn_unmatched=False)`

`lookup_value(lookup, value, default=None, *, warn_unmatched=False)`

`make(schema)`

`make_id(*parts, prefix=None, hash_prefix=None)`

`make_slug(*parts, strict=True, prefix=None)`

`parse_resource_xml(name)`

`zavod.entity.Entity`

`add_cast(schema, prop, values, cleaned=False, fuzzy=False, format=None, lang=None, original_value=None)`

`add_schema(schema)`

`unsafe_add(prop, value, cleaned=False, fuzzy=False, format=None, quiet=False, schema=None, dataset=None, seen=None, lang=None, original_value=None, origin=None)`