Context
zavod.context.Context
The context is a utility object that is passed as an argument into crawlers and other runners.
It supports creating and emitting (storing) entities, accessing metadata and logging errors and warnings. It also has functions for fetching data from the web and storing it in the dataset's data folder.
cache: Cache
property
A cache object for storing HTTP responses and other data.
data_time: datetime
property
writable
The data provenance time to be used for the emitted statements. This is used to set the first_seen and last_seen properties of statements to a time that may be different than the real run time of the crawler, e.g. when a coverage end is defined, or the data source itself states an update time.
Returns:
Type | Description |
---|---|
datetime
|
The time to be used for the emitted statements. |
data_time_iso: str
cached
property
String representation of data_time
in ISO format.
data_url: str
property
The URL of the source data for the dataset.
lang: Optional[str] = None
instance-attribute
Default language for statements emitted from this dataset
timestamps: TimeStampIndex
property
An index of the first_seen time of every statement previous emitted by the dataset. This is used to determine if a statement is new or not.
version: Version
property
The current version of the dataset.
audit_data(data, ignore=[])
Print the formatted data object if it contains any fields not explicitly excluded by the ignore list. This is used to warn about unexpected data in the source by removing the fields one by one and then inspecting the rest.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Dict[Any, Any]
|
A mapping which is to be checked. |
required |
ignore
|
List[Any]
|
List of string keys to be skipped when checking the mapping |
[]
|
begin(clear=False)
Prepare the context for running the exporter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
clear
|
bool
|
Remove the existing resources and issues from the dataset. |
False
|
clear_url(fingerprint)
Remove a given URL from the cache using request fingerprint Args: fingerprint: The unique fingerprint of the request. Returns: None
close()
Flush and tear down the context.
debug_lookups()
Output a list of unused lookup options.
emit(entity, target=False, external=False)
Send an entity from the crawling/runner process to be stored.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
entity
|
Entity
|
The entity to be stored. |
required |
target
|
bool
|
Whether the entity is a target of the dataset. |
False
|
external
|
bool
|
Whether the entity is an enrichment candidate or already part of the dataset. |
False
|
export_resource(path, mime_type=None, title=None)
Register a file as a data resource exported by the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
The file path of the exported resource |
required |
mime_type
|
Optional[str]
|
MIME type of the resource, will be guessed otherwise |
None
|
title
|
Optional[str]
|
A human-readable description. |
None
|
Returns:
Type | Description |
---|---|
DataResource
|
The generated resource object which has been saved. |
fetch_html(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)
Execute an HTTP request using the contexts' session and return
an HTML DOM object based on the response. If a cache_days
argument
is provided, a cache will be used for the given number of days.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to be fetched. |
required |
params
|
ParamsType
|
URL query parameters to be included in the URL. |
None
|
headers
|
_Headers
|
HTTP request headers to be included. |
None
|
auth
|
_Auth
|
HTTP basic authorization username and password to be included. |
None
|
cache_days
|
Optional[int]
|
Number of days to retain cached responses for. |
None
|
method
|
str
|
The HTTP method to use for the request. |
'GET'
|
data
|
_Body
|
The data to be sent in the request body. |
None
|
Returns: An lxml-based DOM of the web page that has been returned.
fetch_json(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)
Execute an HTTP request using the contexts' session and return
a JSON-decoded object based on the response. If a cache_days
argument
is provided, a cache will be used for the given number of days.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to be fetched. |
required |
params
|
ParamsType
|
URL query parameters to be included in the URL. |
None
|
headers
|
_Headers
|
HTTP request headers to be included. |
None
|
auth
|
_Auth
|
HTTP basic authorization username and password to be included. |
None
|
cache_days
|
Optional[int]
|
Number of days to retain cached responses for. |
None
|
method
|
str
|
The HTTP method to use for the request. |
'GET'
|
Returns:
Type | Description |
---|---|
Any
|
The decoded response body as a JSON-decoded object. |
fetch_resource(name, url, auth=None, headers=None, method='GET', data=None)
Fetch a URL into a file located in the current run folder, if it does not exist.
fetch_response(url, headers=None, auth=None, method='GET', data=None)
Execute an HTTP request using the contexts' session.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to be fetched. |
required |
headers
|
_Headers
|
HTTP request headers to be included. |
None
|
auth
|
_Auth
|
HTTP basic authorization username and password to be included. |
None
|
method
|
str
|
The HTTP method to use for the request. |
'GET'
|
data
|
_Body
|
The data to be sent in the request body. |
None
|
Returns: A response object.
fetch_text(url, params=None, headers=None, auth=None, cache_days=None, method='GET', data=None)
Execute an HTTP request using the contexts' session and return
the decoded response body. If a cache_days
argument is provided, a
cache will be used for the given number of days.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL to be fetched. |
required |
params
|
ParamsType
|
URL query parameters to be included in the URL. |
None
|
headers
|
_Headers
|
HTTP request headers to be included. |
None
|
auth
|
_Auth
|
HTTP basic authorization username and password to be included. |
None
|
cache_days
|
Optional[int]
|
Number of days to retain cached responses for. |
None
|
method
|
str
|
The HTTP method to use for the request. |
'GET'
|
data
|
_Body
|
The data to be sent in the request body. |
None
|
Returns:
Type | Description |
---|---|
Optional[str]
|
The decoded response body as a string. |
get_resource_path(name)
Get the path to a file in the dataset data folder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
PathLike
|
The name of the file, relative to the dataset data folder. |
required |
Returns:
Type | Description |
---|---|
Path
|
The full path to the file. |
inspect(obj)
Display an object in a form suitable for inspection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Any
|
The object to be logged in pretty print. |
required |
lookup_value(lookup, value, default=None)
Invoke a datapatch lookup defined in the dataset metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
lookup
|
str
|
The name of the lookup. The key under the dataset lookups property. |
required |
value
|
Optional[str]
|
The data value to look up. |
required |
default
|
Optional[str]
|
The default value to use if the lookup doesn't match the value. |
None
|
make(schema)
Make a new entity with some dataset context set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
schema
|
Union[str, Schema]
|
The entity's type name |
required |
Returns:
Type | Description |
---|---|
Entity
|
A newly created entity object of the given type, with no ID. |
make_id(*parts, prefix=None, hash_prefix=None)
Make a hash-based entity ID from a list of strings, prefixed with the dataset prefix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
Optional[str]
|
Use this prefix in the slug, but not the hash. |
None
|
hash_prefix
|
Optional[str]
|
Use this prefix in the hash, but not the slug. |
None
|
make_slug(*parts, strict=True, prefix=None)
Make a slug-based entity ID from a list of strings, using the dataset prefix.
parse_resource_xml(name)
Parse a file in the resource folder into an XML tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
PathLike
|
The resource name or relative file path. |
required |
Returns:
Type | Description |
---|---|
_ElementTree
|
An lxml element tree of the parsed XML. |
zavod.entity.Entity
Bases: CompositeEntity
Entity for sanctions list entries and adjacent objects.
Add utility methods to the EntityProxy for extracting data from sanctions lists and for auditing parsing errors to structured logging.
add_cast(schema, prop, values, cleaned=False, fuzzy=False, format=None, lang=None, original_value=None)
Set a property on an entity. If the entity is of a schema that doesn't have the given property, also modify the schema (e.g. if something has a birthDate, assume it's a Person, not a LegalEntity).
add_schema(schema)
Try to apply the given schema to the current entity, making it more
specific (e.g. turning a LegalEntity
into a Company
). This raises an
exception if the current and new type are incompatible.
unsafe_add(prop, value, cleaned=False, fuzzy=False, format=None, quiet=False, schema=None, dataset=None, seen=None, lang=None, original_value=None)
Add a statement to the entity, possibly the value.