validation module

Module containing Validation class, and Vcheck class and its subclasses

class validation.Err(message)[source]

Bases: validation.VCheck

VCheck subclass representing a serious problem in data validation that prevents validation.

Examples

>>> Err("This is a data validation error").expand()
Tier                                 Error
Bullet                                 [!]
Level                                    1
Title                               ERRORS
Message    This is a data validation error
dtype: object
bullet()[source]

abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]

abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]

abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]

abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Passing(message)[source]

Bases: validation.VCheck

VCheck subclass representing a passed check in data validation, where there is no problem.

Examples

>>> Passing("This is a passing data validation check").expand()
Tier                                       Passing
Bullet                                         [X]
Level                                            4
Title                                CHECKS PASSED
Message    This is a passing data validation check
dtype: object
bullet()[source]

abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]

abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]

abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]

abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Suggest(message)[source]

Bases: validation.VCheck

VCheck subclass representing a minor problem with data that does not prevent data validation.

Examples

>>> Suggest("This is a data validation suggestion").expand()
Tier                                 Suggestion
Bullet                                      [i]
Level                                         3
Title                               SUGGESTIONS
Message    This is a data validation suggestion
dtype: object
bullet()[source]

abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]

abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]

abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]

abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.VCheck(message)[source]

Bases: object

Abstract class fior a single validation check

bullet

abstract property, must be overriden. Should be a str, representing a bullet point

expand()[source]

Expands VCheck information as a Pandas Series

Parameters:None
Returns:representing VCheck attributes as a Pandas Series
Return type:Pandas Series

Examples

>>> Err("Error Message").expand()
Tier               Error
Bullet               [!]
Level                  1
Title             ERRORS
Message    Error Message
dtype: object
level

abstract property, must be overriden. Should be int ,representing VCheck tier

tier

abstract property, must be overriden. Should be str, representing name of VCheck tier

title

abstract property, must be overriden. Should be str, representing title of VCheck type

class validation.Validation(name='')[source]

Bases: object

Validation object represents an organized dataframe of validation checks

Variables:vchecks (Pandas DataFrame) – a dataframe containing the expanded form of the VCheck instances that have been added.
affected_by_absence(missing_grped)[source]

adds a validation check as Warn describing the items in missing_grped, which detail the impact that missing columns have on newly created mappings.

missing_grped (Pandas Series): series where the index is the name
of the missing source column, and the values are a list of affected values.
Returns:
None
all_valid(given, valid, definition)[source]

adds a validation check where all values in given must be in valid to pass. fail_check is Err (fails validation).

Parameters:
  • given (Pandas Series) – the items representing input given
  • valid (Pandas Series) – list of all possible valid items accepted in given
  • definition (str) – string describing what makes an item in given be in valid
Returns:

None

Examples

>>> v = Validation()
>>> v.all_valid(pd.Series(["a","b"], name="example input"),  pd.Series(["a","b","c"],  name="valid value(s)"), "pre-defined")
>>> v.all_valid(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","d"],  name="valid value(s)"), "'a' or 'd'")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          All values in example input are valid.
<BLANKLINE>
 ERRORS
[!]          2 values in example input were invalid ('b', and 'c').
These must be 'a' or 'd' to be valid.
check_na(df)[source]

Adds a validation check flagging the rows in every column of df that are None

Parameters:df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no NA values
Returns:None

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e",None]})
>>> v.check_na(test_df)
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No NA's in column A detected.
<BLANKLINE>
 WARNINGS
[?]          1 NA's in column B detected in row(s) #2.
fix_alnum(df)[source]

Adds a validation check flagging the rows in every column of df that contain non-alphanumeric characters. Regex removes all characters that are not alpha-numeric, but leaves periods that are part of a number.

Parameters:df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only alphanumeric characters
Returns:df where alphanumeric characters are removed
Return type:Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","3.0","c"],  "B":["??.test","test<>!",";test_data"]})
>>> v.fix_alnum(test_df)
   A          B
0  a       test
1  3.0     test
2  c  test_data
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]      No non-alphanumeric value(s) in column A detected.
<BLANKLINE>
 WARNINGS
[?]      3 non-alphanumeric value(s) in column B detected in row(s)
#0, #1, and #2. This text should be alphanumeric. Non-alphanumeric
characters will be removed.
fix_lowcase(df)[source]

Adds a validation check flagging the rows in every column of df that contain lowercase characters.

Parameters:df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only uppercase characters
Returns:df where all characters are uppercase
Return type:Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]})
>>> v.fix_lowcase(test_df)
   A  B
0  A  D
1  B  E
2  C  F
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 WARNINGS
[?]      2 lower case value(s) in  column A detected in row(s) #0,
and #2. Convention to have this text be uppercase. Lower case text
will be made uppercase.
[?]      1 lower case value(s) in  column B detected in row(s) #1.
Convention to have this text be uppercase. Lower case text will be
made uppercase.
fix_upcase(df)[source]

Adds a validation check flagging the rows in every column of df that contain uppercase characters

Parameters:df (Pandas DataFrame) – a Pandas DataFrame with columns that should have only lowercase characters
Returns:df where all characters are lowercase
Return type:Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a","B","c"], "B":["D","e","F"]})
>>> v.fix_upcase(test_df)
   A  B
0  a  d
1  b  e
2  c  f
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 WARNINGS
[?]      1 upper case value(s) in column A detected in row(s) #1.
Convention is to have this text be lowercase. Upper case text will
be made lowercase.
[?]      2 upper case value(s) in column B detected in row(s) #0,
and #2. Convention is to have this text be lowercase. Upper case
text will be made lowercase.
fix_whitespace(df)[source]

Adds a validation check flagging the rows in every column of df that contain whitespace

Parameters:df (Pandas DataFrame) – a Pandas DataFrame with columns that should have no whitespace
Returns:
df where whitespace is replaced with an
underscore
Return type:Pandas DataFrame

Examples

>>> v = Validation()
>>> test_df = pd.DataFrame({"A":["a"," B ","Test Data"],  "B":["D"," e","F "]})
>>> v.fix_whitespace(test_df)
           A  B
0          a  D
1          B  e
2  Test_Data  F
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No whitespace in column B detected.
<BLANKLINE>
 WARNINGS
[?]          1 leading/trailing spaces column A detected in row(s) #1.  Leading/trailing spaces will be removed.
[?]          2 leading/trailing spaces column B detected in row(s)  #1, and #2. Leading/trailing spaces will be removed.
[?]          1 whitespace in column A detected in row(s) #2.  Whitespace will be converted to '_'
flag_elements(flag_where, flag_elements, criteria)[source]

Adds a validation check seeing if any values in flag_where are true, and then reports on the corresponding items in flag_elements.

Parameters:
  • flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check
  • flag_elements (Pandas Series) – a boolean Pandas Series listing elements that are affected by True values in flag_where
  • criteria (String) – a brief description of what elements are being flagged and reported on
Returns:

None

Examples

>>> v = Validation("element test")
>>> v.flag_elements(pd.Series([False, False]),  pd.Series(["A", "B"]), "red flag(s)")
>>> v.flag_elements(pd.Series([False, True]),  pd.Series(["A", "B"]), "blue flag(s)")
>>> v.report(verbose=4)
Validating element test . . .
<BLANKLINE>
 CHECKS PASSED
[X]          No red flag(s) in element test detected.
<BLANKLINE>
 WARNINGS
[?]          1 blue flag(s) in element test detected. These ('B') will be treated as NA.
flag_rows(flag_where, flag_criteria, flag_action='', flag_tier=<class 'validation.Warn'>)[source]

Adds a validation check seeing if any values in flag_where are true, where fail_check is of type flag_tier. Note that rows are reported counting from 0.

Parameters:
  • flag_where (Pandas Series) – a boolean Pandas Series where True represents a failed check.
  • flag_criteria (str) – a noun clause describing the criteria for an item to be flagged in flag_where
  • flag_action (str) – string describing the action to be taken if an item is flagged. Defaults to “”.
  • flag_tier (VCheck) – should be either Suggest, Warn, or Err, is the seriousness of the failed check.
Returns:

None

Examples

>>> v = Validation()
>>> v.flag_rows(pd.Series([False, False]),  flag_criteria="true values")
>>> v.flag_rows(pd.Series([False, True]),  flag_criteria="true values")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
CHECKS PASSED
[X]      No true values detected.
<BLANKLINE>
WARNINGS
[?]      1 true values detected in row(s) #1.
is_valid()[source]

Checks to see if instance is valid.

Parameters:None
Returns:
True if is valid (has no errors in vchecks) and False if
instance has errors or where vchecks is empty.
Return type:bool

Examples

>>> Validation().is_valid()
False
>>> v = Validation()
>>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["B"]))
>>> v.is_valid()
True
>>> v.must_contain(pd.Series(["A", "B"]), pd.Series(["C"]))
>>> v.is_valid()
False
must_contain(given, required, passing_msg='', fail=<class 'validation.Err'>)[source]

adds a validation check where given must contain every item in required at least once to pass, and fail_check is fail, (fails validation).

Parameters:
  • given (Pandas Series) – the items representing input given
  • required (Pandas Series) – the items required to be in given
  • passing_msg (str) – Message to return if all items in expected are listed in given. Defaults to “”.
  • fail (VCheck) – the outcome if the check fails. Default is Err.
  • impact (Pandas Series) – a corresponding series to required that represents the affected information when
Returns:

None

Examples

>>> v = Validation()
>>> v.must_contain(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","b"],  name="example requirement(s)"),  "all included")
>>> v.must_contain(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","b","d"],  name="example requirement(s)"))
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
 CHECKS PASSED
[X]          all included
<BLANKLINE>
 ERRORS
 [!]          1 (33.3%) example requirement(s) ('d') were not found in example input. Their values will be NA.
no_duplicates(my_series)[source]

adds a validation check as Err if any items in my_series are duplicates. Intended to alert users of issues where there are duplicate columns before an exception is raised.

my_series (Pandas Series): series where there should not be dupes

Returns:
None
no_extraneous(given, relevant, value_type)[source]

adds a validation check where all values in given should also be in relevant to pass. fail_check is Warn

Parameters:
  • given (Pandas Series) – the items representing input given
  • relevant (Pandas Series) – all items in given that will be used
  • value_type (str) – string describing the kind of noun that is listed in given
Returns:

None

Examples

>>> v = Validation()
>>> v.no_extraneous(pd.Series(["a","b"], name="example input"),  pd.Series(["a","b","c"],  name="relevant value(s)"), "example")
>>> v.no_extraneous(pd.Series(["a","b","c"], name="example input"),  pd.Series(["a","d"],  name="relevant value(s)"), "example")
>>> v.report(verbose=4)
Validating  . . .
<BLANKLINE>
CHECKS PASSED
[X]      No extraneous example found in example input.
<BLANKLINE>
ERRORS
[!]      2 extraneous example(s) found in example input
('b', and 'c') Extraneous example(s) will be ommitted.
report(verbose=2)[source]

Prints the checks in the vchecks attribute

Parameters:verbose (int) – Parameter controlling how much to print by filtering for the level in each vchecks row to be less than or equal to verbose. Defaults to 2 (print only converted Warn and Err checks)
Returns:None

Examples

>>> v = Validation("Testing Tests")
>>> v._add_condition(pd.Series([False, False, False]),  Passing("Passed test"), Err("Failed test"))
>>> v._add_condition(pd.Series([False, False, False]),  Passing("Passed test 2"), Err("Failed test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing("Passed test"), Err("Error test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing("Passed test"), Warn("Warn test"))
>>> v._add_condition(pd.Series([False, False, True]),  Passing(""), Suggest("Suggest test"))
>>> v.report(verbose=1)
Validating Testing Tests . . .
<BLANKLINE>
 ERRORS
[!]      Error test
>>> v.report(verbose=4)
Validating Testing Tests . . .
<BLANKLINE>
 CHECKS PASSED
[X]      Passed test
[X]      Passed test 2
<BLANKLINE>
 ERRORS
[!]      Error test
<BLANKLINE>
 SUGGESTIONS
[i]      Suggest test
<BLANKLINE>
 WARNINGS
[?]      Warn test
class validation.Warn(message)[source]

Bases: validation.VCheck

VCheck subclass representing a problem in data validation that can be fixed in place, but would otherwise prevent validation.

Examples

>>> Warn("This is a data validation warning").expand()
Tier                                 Warning
Bullet                                   [?]
Level                                      2
Title                               WARNINGS
Message    This is a data validation warning
dtype: object
bullet()[source]

abstract property, must be overriden. Should be a str, representing a bullet point

level()[source]

abstract property, must be overriden. Should be int ,representing VCheck tier

tier()[source]

abstract property, must be overriden. Should be str, representing name of VCheck tier

title()[source]

abstract property, must be overriden. Should be str, representing title of VCheck type

validation.report_row(flag_where)[source]

A helper method to return an english explanation of what rows have been flagged with a failed validation check.

Parameters:flag_where (Pandas Series) – boolean Pandas Series representing failed validation checks.
Returns:a string reporting the index of the flagged rows
Return type:str

Examples

>>> report_row(pd.Series([True, True, False, True, False]))
'#0, #1, and #3'