Great Expectations
This guide helps to setup and configure DataHubValidationAction
in Great Expectations to send assertions(expectations) and their results to DataHub using DataHub's Python Rest emitter.
Capabilities
DataHubValidationAction
pushes assertions metadata to DataHub. This includes
- Assertion Details: Details of assertions (i.e. expectation) set on a Dataset (Table).
- Assertion Results: Evaluation results for an assertion tracked over time.
This integration supports v3 api datasources using SqlAlchemyExecutionEngine.
Limitations
This integration does not support
- v2 Datasources such as SqlAlchemyDataset
- v3 Datasources using execution engine other than SqlAlchemyExecutionEngine (Spark, Pandas)
- Cross-dataset expectations (those involving > 1 table)
Setting up
- Install the required dependency in your Great Expectations environment.
pip install 'acryl-datahub-gx-plugin'
- To add
DataHubValidationAction
in Great Expectations Checkpoint, add following configuration in action_list for your Great ExpectationsCheckpoint
. For more details on setting action_list, see Checkpoints and ActionsConfiguration options:action_list:
- name: datahub_action
action:
module_name: module_name: datahub_gx_plugin.action
class_name: DataHubValidationAction
server_url: http://localhost:8080 #datahub server urlserver_url
(required): URL of DataHub GMS endpointenv
(optional, defaults to "PROD"): Environment to use in namespace when constructing dataset URNs.exclude_dbname
(optional): Exclude dbname / catalog when constructing dataset URNs. (Highly applicable to Trino / Presto where we want to omit catalog e.g.hive
)platform_alias
(optional): Platform alias when constructing dataset URNs. e.g. main data platform ispresto-on-hive
but usingtrino
to run the testplatform_instance_map
(optional): Platform instance mapping to use when constructing dataset URNs. Maps the GX 'data source' name to a platform instance on DataHub. e.g.platform_instance_map: { "datasource_name": "warehouse" }
graceful_exceptions
(defaults to true): If set to true, most runtime errors in the lineage backend will be suppressed and will not cause the overall checkpoint to fail. Note that configuration issues will still throw exceptions.token
(optional): Bearer token used for authentication.timeout_sec
(optional): Per-HTTP request timeout.retry_status_codes
(optional): Retry HTTP request also on these status codes.retry_max_times
(optional): Maximum times to retry if HTTP request fails. The delay between retries is increased exponentially.extra_headers
(optional): Extra headers which will be added to the datahub request.parse_table_names_from_sql
(defaults to false): The integration can use an SQL parser to try to parse the datasets being asserted. This parsing is disabled by default, but can be enabled by settingparse_table_names_from_sql: True
. The parser is based on thesqllineage
package.convert_urns_to_lowercase
(optional): Whether to convert dataset urns to lowercase.
Debugging
Set environment variable DATAHUB_DEBUG
(default false
) to true
to enable debug logging for DataHubValidationAction
.
Learn more
To see the Great Expectations in action, check out this demo from the Feb 2022 townhall.
Is this page helpful?