validation module¶
Module containing Validation class, and Vcheck class and its subclasses
-
class
validation.
Err
(message)[source]¶ Bases:
validation.VCheck
VCheck subclass representing a serious problem in data validation that prevents validation.
Examples
>>> Err("This is a data validation error").expand() Tier Error Bullet [!] Level 1 Title ERRORS Message This is a data validation error dtype: object
-
bullet
()[source]¶ abstract property, must be overriden. Should be a str, representing a bullet point
-
-
class
validation.
Passing
(message)[source]¶ Bases:
validation.VCheck
VCheck subclass representing a passed check in data validation, where there is no problem.
Examples
>>> Passing("This is a passing data validation check").expand() Tier Passing Bullet [X] Level 4 Title CHECKS PASSED Message This is a passing data validation check dtype: object
-
bullet
()[source]¶ abstract property, must be overriden. Should be a str, representing a bullet point
-
-
class
validation.
Suggest
(message)[source]¶ Bases:
validation.VCheck
VCheck subclass representing a minor problem with data that does not prevent data validation.
Examples
>>> Suggest("This is a data validation suggestion").expand() Tier Suggestion Bullet [i] Level 3 Title SUGGESTIONS Message This is a data validation suggestion dtype: object
-
bullet
()[source]¶ abstract property, must be overriden. Should be a str, representing a bullet point
-
-
class
validation.
VCheck
(message)[source]¶ Bases:
object
Abstract class fior a single validation check
-
bullet
¶ abstract property, must be overriden. Should be a str, representing a bullet point
-
expand
()[source]¶ Expands VCheck information as a Pandas Series
Parameters: None – Returns: representing VCheck attributes as a Pandas Series Return type: Pandas Series Examples
>>> Err("Error Message").expand() Tier Error Bullet [!] Level 1 Title ERRORS Message Error Message dtype: object
-
level
¶ abstract property, must be overriden. Should be int ,representing VCheck tier
-
tier
¶ abstract property, must be overriden. Should be str, representing name of VCheck tier
-
title
¶ abstract property, must be overriden. Should be str, representing title of VCheck type
-
-
class
validation.
Validation
(name='')[source]¶ Bases:
object
Validation object represents an organized dataframe of validation checks
Variables: vchecks (Pandas DataFrame) – a dataframe containing the expanded form of the VCheck instances that have been added. -
affected_by_absence
(missing_grped)[source]¶ adds a validation check as Warn describing the items in missing_grped, which detail the impact that missing columns have on newly created mappings.
- missing_grped (Pandas Series): series where the index is the name
- of the missing source column, and the values are a list of affected values.
- Returns:
- None
-
all_valid
(given, valid, definition)[source]¶ adds a validation check where all values in given must be in valid to pass. fail_check is Err (fails validation).
Parameters: - given (Pandas Series) – the items representing input given
- valid (Pandas Series) – list of all possible valid items accepted in given
- definition (str) – string describing what makes an item in given be in valid
Returns: None
Examples
>>> v = Validation() >>> v.all_valid(pd.Series(["a","b"], name="example input"), pd.Series(["a","b","c"], name="valid value(s)"), "pre-defined") >>> v.all_valid(pd.Series(["a","b","c"], name="example input"), pd.Series(["a","d"], name="valid value(s)"), "'a' or 'd'") >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] All values in example input are valid. <BLANKLINE> ERRORS [!] 2 values in example input were invalid ('b', and 'c'). These must be 'a' or 'd' to be valid.
-
check_na
(df)[source]¶ Adds a validation check flagging the rows in every column of df that are None
Parameters: df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no NA values Returns: None Examples
>>> v = Validation() >>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e",None]}) >>> v.check_na(test_df) >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] No NA's in column A detected. <BLANKLINE> WARNINGS [?] 1 NA's in column B detected in row(s) #2.
-
fix_alnum
(df)[source]¶ Adds a validation check flagging the rows in every column of df that contain non-alphanumeric characters. Regex removes all characters that are not alpha-numeric, but leaves periods that are part of a number.
Parameters: df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only alphanumeric characters Returns: df where alphanumeric characters are removed Return type: Pandas DataFrame Examples
>>> v = Validation() >>> test_df = pd.DataFrame({"A":["a","3.0","c"], "B":["??.test","test<>!",";test_data"]}) >>> v.fix_alnum(test_df) A B 0 a test 1 3.0 test 2 c test_data >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] No non-alphanumeric value(s) in column A detected. <BLANKLINE> WARNINGS [?] 3 non-alphanumeric value(s) in column B detected in row(s) #0, #1, and #2. This text should be alphanumeric. Non-alphanumeric characters will be removed.
-
fix_lowcase
(df)[source]¶ Adds a validation check flagging the rows in every column of df that contain lowercase characters.
Parameters: df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only uppercase characters Returns: df where all characters are uppercase Return type: Pandas DataFrame Examples
>>> v = Validation() >>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]}) >>> v.fix_lowcase(test_df) A B 0 A D 1 B E 2 C F >>> v.report(verbose=4) Validating . . . <BLANKLINE> WARNINGS [?] 2 lower case value(s) in column A detected in row(s) #0, and #2. Convention to have this text be uppercase. Lower case text will be made uppercase. [?] 1 lower case value(s) in column B detected in row(s) #1. Convention to have this text be uppercase. Lower case text will be made uppercase.
-
fix_upcase
(df)[source]¶ Adds a validation check flagging the rows in every column of df that contain uppercase characters
Parameters: df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only lowercase characters Returns: df where all characters are lowercase Return type: Pandas DataFrame Examples
>>> v = Validation() >>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]}) >>> v.fix_upcase(test_df) A B 0 a d 1 b e 2 c f >>> v.report(verbose=4) Validating . . . <BLANKLINE> WARNINGS [?] 1 upper case value(s) in column A detected in row(s) #1. Convention is to have this text be lowercase. Upper case text will be made lowercase. [?] 2 upper case value(s) in column B detected in row(s) #0, and #2. Convention is to have this text be lowercase. Upper case text will be made lowercase.
-
fix_whitespace
(df)[source]¶ Adds a validation check flagging the rows in every column of df that contain whitespace
Parameters: df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no whitespace Returns: - df where whitespace is replaced with an
- underscore
Return type: Pandas DataFrame Examples
>>> v = Validation() >>> test_df = pd.DataFrame({"A":["a"," B ","Test Data"], "B":["D"," e","F "]}) >>> v.fix_whitespace(test_df) A B 0 a D 1 B e 2 Test_Data F >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] No whitespace in column B detected. <BLANKLINE> WARNINGS [?] 1 leading/trailing spaces column A detected in row(s) #1. Leading/trailing spaces will be removed. [?] 2 leading/trailing spaces column B detected in row(s) #1, and #2. Leading/trailing spaces will be removed. [?] 1 whitespace in column A detected in row(s) #2. Whitespace will be converted to '_'
-
flag_elements
(flag_where, flag_elements, criteria)[source]¶ Adds a validation check seeing if any values in flag_where are true, and then reports on the corresponding items in flag_elements.
Parameters: - flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check
- flag_elements (Pandas Series) – a boolean Pandas Series listing elements that are affected by True values in flag_where
- criteria (String) – a brief description of what elements are being flagged and reported on
Returns: None
Examples
>>> v = Validation("element test") >>> v.flag_elements(pd.Series([False, False]), pd.Series(["A", "B"]), "red flag(s)") >>> v.flag_elements(pd.Series([False, True]), pd.Series(["A", "B"]), "blue flag(s)") >>> v.report(verbose=4) Validating element test . . . <BLANKLINE> CHECKS PASSED [X] No red flag(s) in element test detected. <BLANKLINE> WARNINGS [?] 1 blue flag(s) in element test detected. These ('B') will be treated as NA.
-
flag_rows
(flag_where, flag_criteria, flag_action='', flag_tier=<class 'validation.Warn'>)[source]¶ Adds a validation check seeing if any values in flag_where are true, where fail_check is of type flag_tier. Note that rows are reported counting from 0.
Parameters: - flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check.
- flag_criteria (str) – a noun clause describing the criteria for an item to be flagged in flag_where
- flag_action (str) – string describing the action to be taken if an item is flagged. Defaults to “”.
- flag_tier (VCheck) – should be either Suggest, Warn, or Err, is the seriousness of the failed check.
Returns: None
Examples
>>> v = Validation() >>> v.flag_rows(pd.Series([False, False]), flag_criteria="true values") >>> v.flag_rows(pd.Series([False, True]), flag_criteria="true values") >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] No true values detected. <BLANKLINE> WARNINGS [?] 1 true values detected in row(s) #1.
-
is_valid
()[source]¶ Checks to see if instance is valid.
Parameters: None – Returns: - True if is valid (has no errors in vchecks) and False if
- instance has errors or where vchecks is empty.
Return type: bool Examples
>>> Validation().is_valid() False >>> v = Validation() >>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["B"])) >>> v.is_valid() True >>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["C"])) >>> v.is_valid() False
-
must_contain
(given, required, passing_msg='', fail=<class 'validation.Err'>)[source]¶ adds a validation check where given must contain every item in required at least once to pass, and fail_check is fail, (fails validation).
Parameters: - given (Pandas Series) – the items representing input given
- required (Pandas Series) – the items required to be in given
- passing_msg (str) – Message to return if all items in expected are listed in given. Defaults to “”.
- fail (VCheck) – the outcome if the check fails. Default is Err.
- impact (Pandas Series) – a corresponding series to required that represents the affected information when
Returns: None
Examples
>>> v = Validation() >>> v.must_contain(pd.Series(["a","b","c"], name="example input"), pd.Series(["a","b"], name="example requirement(s)"), "all included") >>> v.must_contain(pd.Series(["a","b","c"], name="example input"), pd.Series(["a","b","d"], name="example requirement(s)")) >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] all included <BLANKLINE> ERRORS [!] 1 (33.3%) example requirement(s) ('d') were not found in example input. Their values will be NA.
-
no_duplicates
(my_series)[source]¶ adds a validation check as Err if any items in my_series are duplicates. Intended to alert users of issues where there are duplicate columns before an exception is raised.
my_series (Pandas Series): series where there should not be dupes
- Returns:
- None
-
no_extraneous
(given, relevant, value_type)[source]¶ adds a validation check where all values in given should also be in relevant to pass. fail_check is Warn
Parameters: - given (Pandas Series) – the items representing input given
- relevant (Pandas Series) – all items in given that will be used
- value_type (str) – string describing the kind of noun that is listed in given
Returns: None
Examples
>>> v = Validation() >>> v.no_extraneous(pd.Series(["a","b"], name="example input"), pd.Series(["a","b","c"], name="relevant value(s)"), "example") >>> v.no_extraneous(pd.Series(["a","b","c"], name="example input"), pd.Series(["a","d"], name="relevant value(s)"), "example") >>> v.report(verbose=4) Validating . . . <BLANKLINE> CHECKS PASSED [X] No extraneous example found in example input. <BLANKLINE> ERRORS [!] 2 extraneous example(s) found in example input ('b', and 'c') Extraneous example(s) will be ommitted.
-
report
(verbose=2)[source]¶ Prints the checks in the vchecks attribute
Parameters: verbose (int) – Parameter controlling how much to print by filtering for the level in each vchecks row to be less than or equal to verbose. Defaults to 2 (print only converted Warn and Err checks) Returns: None Examples
>>> v = Validation("Testing Tests") >>> v._add_condition(pd.Series([False, False, False]), Passing("Passed test"), Err("Failed test")) >>> v._add_condition(pd.Series([False, False, False]), Passing("Passed test 2"), Err("Failed test")) >>> v._add_condition(pd.Series([False, False, True]), Passing("Passed test"), Err("Error test")) >>> v._add_condition(pd.Series([False, False, True]), Passing("Passed test"), Warn("Warn test")) >>> v._add_condition(pd.Series([False, False, True]), Passing(""), Suggest("Suggest test")) >>> v.report(verbose=1) Validating Testing Tests . . . <BLANKLINE> ERRORS [!] Error test >>> v.report(verbose=4) Validating Testing Tests . . . <BLANKLINE> CHECKS PASSED [X] Passed test [X] Passed test 2 <BLANKLINE> ERRORS [!] Error test <BLANKLINE> SUGGESTIONS [i] Suggest test <BLANKLINE> WARNINGS [?] Warn test
-
-
class
validation.
Warn
(message)[source]¶ Bases:
validation.VCheck
VCheck subclass representing a problem in data validation that can be fixed in place, but would otherwise prevent validation.
Examples
>>> Warn("This is a data validation warning").expand() Tier Warning Bullet [?] Level 2 Title WARNINGS Message This is a data validation warning dtype: object
-
bullet
()[source]¶ abstract property, must be overriden. Should be a str, representing a bullet point
-
-
validation.
report_row
(flag_where)[source]¶ A helper method to return an english explanation of what rows have been flagged with a failed validation check.
Parameters: flag_where (Pandas Series) – boolean Pandas Series representing failed validation checks. Returns: a string reporting the index of the flagged rows Return type: str Examples
>>> report_row(pd.Series([True, True, False, True, False])) '#0, #1, and #3'