Great expectations is a Python library for data validation. It tries to profile a sample dataset for what are the attributes, and the statistics of the attributes (e.g., if numeric, what are min, max, mean, variance, etc.). If you read the documentation, it would suggest you to create a data repository with it, then load the data and let it do the profiling. When you get a new version of the same data (e.g., data of a different date), you can reuse the profile to validate for any outliers or unexpected attributes.
But this design is a violation of the Unix philosophy of single responsibility: Why it has to assume how do I store my data? In fact, you do not need to use great expectation as your data manager. But the library’s documentation didn’t make it clear.
Work flow
Let’s see how to use great expectation the minimal way. First let’s check the sample data (Taiwan Credit):
import pandas as pd
# Load data as pandas DataFrame
df_credit = pd.read_csv("data/TaiwanCredit.csv", low_memory=False)
with pd.option_context('display.max_rows', 200, 'display.max_columns', 200, 'display.width', 0):
print(pd.DataFrame({"dtype":df_credit.dtypes})
.join(df_credit.describe().T)
)
print()
print(df_credit.sample(3).T)
print()
df_credit.info()
its output will be:
dtype count mean std min 25% 50% 75% max
LIMIT_BAL float64 30000.0 167484.322667 129747.661567 10000.000000 50000.000000 140000.000000 240000.000000 1000000.000000
SEX float64 30000.0 1.603733 0.489129 1.000000 1.000000 2.000000 2.000000 2.000000
EDUCATION float64 30000.0 1.779867 0.728486 0.000000 1.000000 2.000000 2.000000 3.000000
MARRIAGE float64 30000.0 1.519567 0.524175 0.000000 1.000000 2.000000 2.000000 2.000000
AGE float64 30000.0 35.485500 9.217904 21.000000 28.000000 34.000000 41.000000 79.000000
PAY_1 float64 30000.0 0.167233 0.931323 -1.000000 0.000000 0.000000 0.000000 8.000000
PAY_2 float64 30000.0 0.118367 0.965844 -1.000000 0.000000 0.000000 0.000000 8.000000
PAY_3 float64 30000.0 0.106133 0.950876 -1.000000 0.000000 0.000000 0.000000 8.000000
PAY_4 float64 30000.0 0.069200 0.911613 -1.000000 0.000000 0.000000 0.000000 8.000000
PAY_5 float64 30000.0 0.036867 0.864561 -1.000000 0.000000 0.000000 0.000000 8.000000
PAY_6 float64 30000.0 0.035233 0.867919 -1.000000 0.000000 0.000000 0.000000 8.000000
BILL_AMT1 float64 30000.0 3.885958 1.547037 -5.219010 3.551419 4.349909 4.826671 5.984308
BILL_AMT2 float64 30000.0 3.795420 1.654534 -4.843718 3.475053 4.326357 4.806229 5.992965
BILL_AMT3 float64 30000.0 3.739384 1.696840 -5.196632 3.426064 4.302969 4.779349 6.221177
BILL_AMT4 float64 30000.0 3.670062 1.733904 -5.230452 3.366936 4.279964 4.736452 5.950164
BILL_AMT5 float64 30000.0 3.599323 1.762540 -4.910277 3.246499 4.257811 4.700630 5.967160
BILL_AMT6 float64 30000.0 3.507501 1.833062 -5.530973 3.099335 4.232285 4.691958 5.983024
PAY_AMT1 float64 30000.0 2.879374 1.411589 0.000000 3.000434 3.322426 3.699578 5.941289
PAY_AMT2 float64 30000.0 2.850124 1.423945 0.000000 2.921166 3.303196 3.699057 6.226409
PAY_AMT3 float64 30000.0 2.728754 1.454766 0.000000 2.592177 3.255514 3.653791 5.952328
PAY_AMT4 float64 30000.0 2.639519 1.475124 0.000000 2.472756 3.176381 3.603604 5.793092
PAY_AMT5 float64 30000.0 2.619748 1.495708 0.000000 2.403976 3.176381 3.605574 5.629950
PAY_AMT6 float64 30000.0 2.576596 1.532231 0.000000 2.074631 3.176381 3.602169 5.723182
FlagDefault float64 30000.0 0.221200 0.415062 0.000000 0.000000 0.000000 0.000000 1.000000
28247 15710 8917
LIMIT_BAL 30000.000000 270000.000000 30000.000000
SEX 2.000000 1.000000 2.000000
EDUCATION 2.000000 1.000000 2.000000
MARRIAGE 1.000000 1.000000 2.000000
AGE 32.000000 45.000000 27.000000
PAY_1 0.000000 1.000000 0.000000
PAY_2 0.000000 -1.000000 0.000000
PAY_3 0.000000 -1.000000 0.000000
PAY_4 0.000000 -1.000000 0.000000
PAY_5 0.000000 -1.000000 0.000000
PAY_6 0.000000 -1.000000 0.000000
BILL_AMT1 3.538448 0.000000 4.419377
BILL_AMT2 -0.477121 2.445604 4.439917
BILL_AMT3 -0.477121 2.784617 4.458970
BILL_AMT4 -0.477121 2.694605 4.443185
BILL_AMT5 2.850646 2.499687 4.451832
BILL_AMT6 2.973590 0.000000 4.473136
PAY_AMT1 0.000000 2.445604 3.301247
PAY_AMT2 0.000000 2.784617 3.302331
PAY_AMT3 0.000000 2.694605 3.000434
PAY_AMT4 2.851870 2.499687 3.000434
PAY_AMT5 3.018284 0.000000 3.352375
PAY_AMT6 0.000000 2.260071 0.000000
FlagDefault 0.000000 0.000000 1.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LIMIT_BAL 30000 non-null float64
1 SEX 30000 non-null float64
2 EDUCATION 30000 non-null float64
3 MARRIAGE 30000 non-null float64
4 AGE 30000 non-null float64
5 PAY_1 30000 non-null float64
6 PAY_2 30000 non-null float64
7 PAY_3 30000 non-null float64
8 PAY_4 30000 non-null float64
9 PAY_5 30000 non-null float64
10 PAY_6 30000 non-null float64
11 BILL_AMT1 30000 non-null float64
12 BILL_AMT2 30000 non-null float64
13 BILL_AMT3 30000 non-null float64
14 BILL_AMT4 30000 non-null float64
15 BILL_AMT5 30000 non-null float64
16 BILL_AMT6 30000 non-null float64
17 PAY_AMT1 30000 non-null float64
18 PAY_AMT2 30000 non-null float64
19 PAY_AMT3 30000 non-null float64
20 PAY_AMT4 30000 non-null float64
21 PAY_AMT5 30000 non-null float64
22 PAY_AMT6 30000 non-null float64
23 FlagDefault 30000 non-null float64
dtypes: float64(24)
memory usage: 5.5 MB
In above, you can see the basic statistics of the data and the data are all numeric. A few samples are displayed as well.
For the great expectation part, it works as follows: There is a great expectation object that holds the data. It remembers at least the data schema. In below, we create one using a pandas DataFrame. Hence it holds all data with it as well. This GE object behaves like a pandas DataFrame, indeed, and with some additional GE-specific methods are added:
#!/usr/bin/env python
# coding: utf-8
import json
import pandas as pd
import great_expectations as ge
# Load data as pandas DataFrame
df_credit = pd.read_csv("data/TaiwanCredit.csv", low_memory=False)
# Load into Great Expectation
df_ge = ge.from_pandas(df_credit)
# These are the list of functions added by great expectations
print([x for x in dir(df_ge) if not x.startswith("_") and x not in dir(df_credit)])
They are:
['add_citation',
'attempt_allowing_relative_error',
'batch_id',
'batch_kwargs',
'batch_markers',
'batch_parameters',
'caching',
'column_aggregate_expectation',
'column_map_expectation',
'column_pair_map_expectation',
'default_expectation_args',
'discard_failing_expectations',
'discard_subset_failing_expectations',
'expect_column_bootstrapped_ks_test_p_value_to_be_greater_than',
'expect_column_chisquare_test_p_value_to_be_greater_than',
'expect_column_distinct_values_to_be_in_set',
'expect_column_distinct_values_to_contain_set',
'expect_column_distinct_values_to_equal_set',
'expect_column_kl_divergence_to_be_less_than',
'expect_column_max_to_be_between',
'expect_column_mean_to_be_between',
'expect_column_median_to_be_between',
'expect_column_min_to_be_between',
'expect_column_most_common_value_to_be_in_set',
'expect_column_pair_cramers_phi_value_to_be_less_than',
'expect_column_pair_values_A_to_be_greater_than_B',
'expect_column_pair_values_to_be_equal',
'expect_column_pair_values_to_be_in_set',
'expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than',
'expect_column_proportion_of_unique_values_to_be_between',
'expect_column_quantile_values_to_be_between',
'expect_column_stdev_to_be_between',
'expect_column_sum_to_be_between',
'expect_column_to_exist',
'expect_column_unique_value_count_to_be_between',
'expect_column_value_lengths_to_be_between',
'expect_column_value_lengths_to_equal',
'expect_column_values_to_be_between',
'expect_column_values_to_be_dateutil_parseable',
'expect_column_values_to_be_decreasing',
'expect_column_values_to_be_in_set',
'expect_column_values_to_be_in_type_list',
'expect_column_values_to_be_increasing',
'expect_column_values_to_be_json_parseable',
'expect_column_values_to_be_null',
'expect_column_values_to_be_of_type',
'expect_column_values_to_be_unique',
'expect_column_values_to_match_json_schema',
'expect_column_values_to_match_regex',
'expect_column_values_to_match_regex_list',
'expect_column_values_to_match_strftime_format',
'expect_column_values_to_not_be_in_set',
'expect_column_values_to_not_be_null',
'expect_column_values_to_not_match_regex',
'expect_column_values_to_not_match_regex_list',
'expect_compound_columns_to_be_unique',
'expect_multicolumn_sum_to_equal',
'expect_multicolumn_values_to_be_unique',
'expect_select_column_values_to_be_unique_within_record',
'expect_table_column_count_to_be_between',
'expect_table_column_count_to_equal',
'expect_table_columns_to_match_ordered_list',
'expect_table_columns_to_match_set',
'expect_table_row_count_to_be_between',
'expect_table_row_count_to_equal',
'expectation',
'expectation_suite_name',
'from_dataset',
'get_binned_values',
'get_column_count',
'get_column_count_in_range',
'get_column_hist',
'get_column_max',
'get_column_mean',
'get_column_median',
'get_column_min',
'get_column_modes',
'get_column_nonnull_count',
'get_column_partition',
'get_column_quantiles',
'get_column_stdev',
'get_column_sum',
'get_column_unique_count',
'get_column_value_counts',
'get_config_value',
'get_crosstab',
'get_default_expectation_arguments',
'get_evaluation_parameter',
'get_expectation_suite',
'get_row_count',
'get_table_columns',
'hashable_getters',
'list_available_expectation_types',
'multicolumn_map_expectation',
'profile',
'save_expectation_suite',
'set_config_value',
'set_default_expectation_argument',
'set_evaluation_parameter',
'test_column_aggregate_expectation_function',
'test_column_map_expectation_function',
'test_expectation_function',
'validate']
So a lot of methods prefixed with “expect”.
GE works by creating rules in the object. Creating a rule doesn’t perform the validation immediately, but the GE object will remember the rule. Examples:
...
# unique values in this column
df_ge.expect_column_values_to_be_unique(column="EDUCATION")
# a column should be within a set of values
df_ge.expect_column_values_to_be_in_set(
column = "FlagDefault",
value_set = [1.0, 0.0]
)
Instead of creating rules in code one by one, we can also ask GE to profile the sample data and generate the rules by itself. The output will be extremely verbose and a lot of junk rules will be created. But it will be a good start:
from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler
...
# Generate rules automatically by learning the dataset, i.e., generate a "suite"
profiler = UserConfigurableProfiler(df_ge)
# may take long time to profile
# also see: https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/profile/index.html#great_expectations.profile.BasicDatasetProfiler._profile
# suite, validation_result = BasicSuiteBuilderProfiler().profile(
# dataset, {
# “columns”: [“id”, “username”, “address”],
# “excluded_expectations”: [“expect_column_mean_to_be_between”, “expect_column_median_to_be_between”, “expect_column_quantile_values_to_be_between”,],
# “included_expectations”: [“expect_column_to_not_be_null”, “expect_column_values_to_be_in_set”,],
# }
# )
suite = profiler.build_suite()
# save all the learned rules in JSON format
with open("data/TaiwanCredit.suite.json", "w") as fp:
fp.write(json.dumps(suite.to_json_dict(), indent=4, sort_keys=True))
This will save all the rules (206 of them in this case) into a JSON file. A rule is in the format like the following:
{
"expectation_type": "expect_column_proportion_of_unique_values_to_be_between",
"kwargs": {
"column": "BILL_AMT3",
"min_value": 0.7342,
"max_value": 0.7342
},
"meta": {}
}
The object suite
is a ExpectationSuite
object. It is detached from the
dataframe. You can see it has the following methods:
['add_citation',
'add_expectation',
'add_expectation_configurations',
'append_expectation',
'data_asset_type',
'evaluation_parameters',
'exclude_field_names',
'expectation_suite_name',
'expectations',
'find_expectation_indexes',
'find_expectations',
'ge_cloud_id',
'get',
'get_citations',
'get_column_expectations',
'get_column_pair_expectations',
'get_evaluation_parameter_dependencies',
'get_grouped_and_ordered_expectations_by_column',
'get_grouped_and_ordered_expectations_by_domain_type',
'get_grouped_and_ordered_expectations_by_expectation_type',
'get_multicolumn_expectations',
'get_table_expectations',
'include_field_names',
'isEquivalentTo',
'items',
'keys',
'meta',
'name',
'patch_expectation',
'property_names',
'remove_all_expectations_of_type',
'remove_expectation',
'render',
'replace_expectation',
'send_usage_event',
'show_expectations_by_domain_type',
'show_expectations_by_expectation_type',
'to_dict',
'to_json_dict',
'to_raw_dict',
'values']
You can get these with the following print statements (together with the rules count):
print(suite.expectations[123])
print([x for x in dir(suite) if not x.startswith("_")])
print(len(suite.expectations))
print(len(suite.get_column_expectations()))
The idea of writing the rules into a JSON file is that you can edit it. Assume you edited the rules, trimmed down some, and save it to another JSON file of the same format, you can at anytime load the rules into a schema and create a new suite. The suite has no data associated with it. But you can use it to validate a data. This time, most likely not the same as you profiled:
# Load expectation rules into a suite
with open("data/TaiwanCredit.suite2.json", "r") as fp:
schema = json.loads(fp.read())
new_suite = ge.core.expectation_suite.ExpectationSuite(
"Edited Rules",
expectations=schema["expectations"])
# Validate data using the new rules
df_ge.validate(new_suite)