validation module¶

Module containing Validation class, and Vcheck class and its subclasses

class validation.Err(message)[source]¶

Bases: validation.VCheck

VCheck subclass representing a serious problem in data validation that prevents validation.

Examples

>>> Err("This is a data validation error").expand()
Tier                                 Error
Bullet                                 [!]
Level                                    1
Title                               ERRORS
Message    This is a data validation error
dtype: object

bullet()[source]¶: abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]¶: abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]¶: abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]¶: abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Passing(message)[source]¶

Bases: validation.VCheck

VCheck subclass representing a passed check in data validation, where there is no problem.

Examples

>>> Passing("This is a passing data validation check").expand()
Tier                                       Passing
Bullet                                         [X]
Level                                            4
Title                                CHECKS PASSED
Message    This is a passing data validation check
dtype: object

bullet()[source]¶: abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]¶: abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]¶: abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]¶: abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Suggest(message)[source]¶

Bases: validation.VCheck

VCheck subclass representing a minor problem with data that does not prevent data validation.

Examples

>>> Suggest("This is a data validation suggestion").expand()
Tier                                 Suggestion
Bullet                                      [i]
Level                                         3
Title                               SUGGESTIONS
Message    This is a data validation suggestion
dtype: object

bullet()[source]¶: abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]¶: abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]¶: abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]¶: abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.VCheck(message)[source]¶

Bases: object

Abstract class fior a single validation check

bullet¶: abstract property, must be overriden. Should be a str, representing a bullet point

expand()[source]¶

Expands VCheck information as a Pandas Series

Parameters:	None –
Returns:	representing VCheck attributes as a Pandas Series
Return type:	Pandas Series

Examples

>>> Err("Error Message").expand()
Tier               Error
Bullet               [!]
Level                  1
Title             ERRORS
Message    Error Message
dtype: object

level¶: abstract property, must be overriden. Should be int ,representing VCheck tier

tier¶: abstract property, must be overriden. Should be str, representing name of VCheck tier

title¶: abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Validation(name='')[source]¶

Bases: object

Validation object represents an organized dataframe of validation checks

Variables:	vchecks (Pandas DataFrame) – a dataframe containing the expanded form of the VCheck instances that have been added.

affected_by_absence(missing_grped)[source]¶

adds a validation check as Warn describing the items in missing_grped, which detail the impact that missing columns have on newly created mappings.

missing_grped (Pandas Series): series where the index is the name

of the missing source column, and the values are a list of affected values.

Returns:

None

all_valid(given, valid, definition)[source]¶

adds a validation check where all values in given must be in valid to pass. fail_check is Err (fails validation).

Parameters:	given (Pandas Series) – the items representing input given valid (Pandas Series) – list of all possible valid items accepted in given definition (str) – string describing what makes an item in given be in valid
Returns:	None

Examples

>>> v = Validation()
>>> v.all_valid(pd.Series(["a","b"], name="example input"),  pd.Series(["a","b","c"],  name="valid value(s)"), "pre-defined")
>>> v.all_valid(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","d"],  name="valid value(s)"), "'a' or 'd'")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          All values in example input are valid.
<BLANKLINE>
 ERRORS
[!]          2 values in example input were invalid ('b', and 'c').
These must be 'a' or 'd' to be valid.

check_na(df)[source]¶

Adds a validation check flagging the rows in every column of df that are None

Parameters:	df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no NA values
Returns:	None

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e",None]})
>>> v.check_na(test_df)
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No NA's in column A detected.
<BLANKLINE>
 WARNINGS
[?]          1 NA's in column B detected in row(s) #2.

fix_alnum(df)[source]¶

Adds a validation check flagging the rows in every column of df that contain non-alphanumeric characters. Regex removes all characters that are not alpha-numeric, but leaves periods that are part of a number.

Parameters:	df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only alphanumeric characters
Returns:	df where alphanumeric characters are removed
Return type:	Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","3.0","c"],  "B":["??.test","test<>!",";test_data"]})
>>> v.fix_alnum(test_df)
   A          B
0  a       test
1  3.0     test
2  c  test_data
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]      No non-alphanumeric value(s) in column A detected.
<BLANKLINE>
 WARNINGS
[?]      3 non-alphanumeric value(s) in column B detected in row(s)
#0, #1, and #2. This text should be alphanumeric. Non-alphanumeric
characters will be removed.

fix_lowcase(df)[source]¶

Adds a validation check flagging the rows in every column of df that contain lowercase characters.

Parameters:	df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only uppercase characters
Returns:	df where all characters are uppercase
Return type:	Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]})
>>> v.fix_lowcase(test_df)
   A  B
0  A  D
1  B  E
2  C  F
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 WARNINGS
[?]      2 lower case value(s) in  column A detected in row(s) #0,
and #2. Convention to have this text be uppercase. Lower case text
will be made uppercase.
[?]      1 lower case value(s) in  column B detected in row(s) #1.
Convention to have this text be uppercase. Lower case text will be
made uppercase.

fix_upcase(df)[source]¶

Adds a validation check flagging the rows in every column of df that contain uppercase characters

Parameters:	df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only lowercase characters
Returns:	df where all characters are lowercase
Return type:	Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]})
>>> v.fix_upcase(test_df)
   A  B
0  a  d
1  b  e
2  c  f
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 WARNINGS
[?]      1 upper case value(s) in column A detected in row(s) #1.
Convention is to have this text be lowercase. Upper case text will
be made lowercase.
[?]      2 upper case value(s) in column B detected in row(s) #0,
and #2. Convention is to have this text be lowercase. Upper case
text will be made lowercase.

fix_whitespace(df)[source]¶

Adds a validation check flagging the rows in every column of df that contain whitespace

Parameters:	df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no whitespace
Returns:	df where whitespace is replaced with an underscore
Return type:	Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a"," B ","Test Data"],  "B":["D"," e","F "]})
>>> v.fix_whitespace(test_df)
           A  B
0          a  D
1          B  e
2  Test_Data  F
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No whitespace in column B detected.
<BLANKLINE>
 WARNINGS
[?]          1 leading/trailing spaces column A detected in row(s) #1.  Leading/trailing spaces will be removed.
[?]          2 leading/trailing spaces column B detected in row(s)  #1, and #2. Leading/trailing spaces will be removed.
[?]          1 whitespace in column A detected in row(s) #2.  Whitespace will be converted to '_'

flag_elements(flag_where, flag_elements, criteria)[source]¶

Adds a validation check seeing if any values in flag_where are true, and then reports on the corresponding items in flag_elements.

Parameters:	flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check flag_elements (Pandas Series) – a boolean Pandas Series listing elements that are affected by True values in flag_where criteria (String) – a brief description of what elements are being flagged and reported on
Returns:	None

Examples

>>> v = Validation("element test")
>>> v.flag_elements(pd.Series([False, False]),  pd.Series(["A", "B"]), "red flag(s)")
>>> v.flag_elements(pd.Series([False, True]),  pd.Series(["A", "B"]), "blue flag(s)")
>>> v.report(verbose=4)
Validating element test . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No red flag(s) in element test detected.
<BLANKLINE>
 WARNINGS
[?]          1 blue flag(s) in element test detected. These ('B') will be treated as NA.

flag_rows(flag_where, flag_criteria, flag_action='', flag_tier=<class 'validation.Warn'>)[source]¶

Adds a validation check seeing if any values in flag_where are true, where fail_check is of type flag_tier. Note that rows are reported counting from 0.

Parameters:

flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check.
flag_criteria (str) – a noun clause describing the criteria for an item to be flagged in flag_where
flag_action (str) – string describing the action to be taken if an item is flagged. Defaults to “”.
flag_tier (VCheck) – should be either Suggest, Warn, or Err, is the seriousness of the failed check.

Returns:

None

Examples

>>> v = Validation()
>>> v.flag_rows(pd.Series([False, False]),  flag_criteria="true values")
>>> v.flag_rows(pd.Series([False, True]),  flag_criteria="true values")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
CHECKS PASSED
[X]      No true values detected.
<BLANKLINE>
WARNINGS
[?]      1 true values detected in row(s) #1.

is_valid()[source]¶

Checks to see if instance is valid.

Parameters:	None –
Returns:	True if is valid (has no errors in vchecks) and False if instance has errors or where vchecks is empty.
Return type:	bool

Examples

>>> Validation().is_valid()
False
>>> v = Validation()
>>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["B"]))
>>> v.is_valid()
True
>>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["C"]))
>>> v.is_valid()
False

must_contain(given, required, passing_msg='', fail=<class 'validation.Err'>)[source]¶

adds a validation check where given must contain every item in required at least once to pass, and fail_check is fail, (fails validation).

Parameters:

given (Pandas Series) – the items representing input given
required (Pandas Series) – the items required to be in given
passing_msg (str) – Message to return if all items in expected are listed in given. Defaults to “”.
fail (VCheck) – the outcome if the check fails. Default is Err.
impact (Pandas Series) – a corresponding series to required that represents the affected information when

Returns:

None

Examples

>>> v = Validation()
>>> v.must_contain(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","b"],  name="example requirement(s)"),  "all included")
>>> v.must_contain(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","b","d"],  name="example requirement(s)"))
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          all included
<BLANKLINE>
 ERRORS
 [!]          1 (33.3%) example requirement(s) ('d') were not found in example input. Their values will be NA.

no_duplicates(my_series)[source]¶

adds a validation check as Err if any items in my_series are duplicates. Intended to alert users of issues where there are duplicate columns before an exception is raised.

my_series (Pandas Series): series where there should not be dupes

Returns:

None

no_extraneous(given, relevant, value_type)[source]¶

adds a validation check where all values in given should also be in relevant to pass. fail_check is Warn

Parameters:	given (Pandas Series) – the items representing input given relevant (Pandas Series) – all items in given that will be used value_type (str) – string describing the kind of noun that is listed in given
Returns:	None

Examples

>>> v = Validation()
>>> v.no_extraneous(pd.Series(["a","b"], name="example input"),  pd.Series(["a","b","c"],  name="relevant value(s)"), "example")
>>> v.no_extraneous(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","d"],  name="relevant value(s)"), "example")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
CHECKS PASSED
[X]      No extraneous example found in example input.
<BLANKLINE>
ERRORS
[!]      2 extraneous example(s) found in example input
('b', and 'c') Extraneous example(s) will be ommitted.

report(verbose=2)[source]¶

Prints the checks in the vchecks attribute

Parameters:	verbose (int) – Parameter controlling how much to print by filtering for the level in each vchecks row to be less than or equal to verbose. Defaults to 2 (print only converted Warn and Err checks)
Returns:	None

Examples

>>> v = Validation("Testing Tests")
>>> v._add_condition(pd.Series([False, False, False]),  Passing("Passed test"), Err("Failed test"))
>>> v._add_condition(pd.Series([False, False, False]),  Passing("Passed test 2"), Err("Failed test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing("Passed test"), Err("Error test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing("Passed test"), Warn("Warn test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing(""), Suggest("Suggest test"))
>>> v.report(verbose=1)
Validating Testing Tests . . .
<BLANKLINE>
 ERRORS
[!]      Error test
>>> v.report(verbose=4)
Validating Testing Tests . . .
<BLANKLINE>
 CHECKS PASSED
[X]      Passed test
[X]      Passed test 2
<BLANKLINE>
 ERRORS
[!]      Error test
<BLANKLINE>
 SUGGESTIONS
[i]      Suggest test
<BLANKLINE>
 WARNINGS
[?]      Warn test

class validation.Warn(message)[source]¶

Bases: validation.VCheck

VCheck subclass representing a problem in data validation that can be fixed in place, but would otherwise prevent validation.

Examples

>>> Warn("This is a data validation warning").expand()
Tier                                 Warning
Bullet                                   [?]
Level                                      2
Title                               WARNINGS
Message    This is a data validation warning
dtype: object

bullet()[source]¶: abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]¶: abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]¶: abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]¶: abstract property, must be overriden. Should be str, representing title of VCheck type

validation.report_row(flag_where)[source]¶

A helper method to return an english explanation of what rows have been flagged with a failed validation check.

Parameters:	flag_where (Pandas Series) – boolean Pandas Series representing failed validation checks.
Returns:	a string reporting the index of the flagged rows
Return type:	str

Examples

>>> report_row(pd.Series([True, True, False, True, False]))
'#0, #1, and #3'