transform module

Defines main CrossVA function, transform which maps raw VA data into data for use with a VA algorithm in OpenVA.

transform.transform(mapping, raw_data, raw_data_id=None, verbose=2, preserve_na=True, result_values={'Absent': 'n', 'NA': '.', 'Present': 'y'})[source]

transforms raw VA data (raw_data) into data suitable for use with a VA algorithm, according to the specified transformations given in mapping.

  • mapping (string, tuple or Pandas DataFrame) – Should be either a tuple in form (input, output), a path to csv containing a configuration data file, or a Pandas DataFrame containing configuration data
  • raw_data (string or Pandas DataFrame) – raw verbal autopsy data to process
  • raw_data_id (string) – column name with record ID
  • verbose (int) – integer from 0 to 5, controlling how much status detail is printed to console. Silent if 0. Defaults to 2, which will print only errors and warnings.
  • preserve_na (bool) – whether to preserve NAs in data, or to count them as FALSE. Overridden with True for InSilicoVA, False for InterVA4 when mapping is given as a tuple. Defaults to TRUE, which allows NA values to perpetuate through the data.
  • result_values (dict) – available as a simple customization option if user would like values indicating presence, absence, and NAs to be mapped to certain values.

the raw data transformed according to specifications given in mapping data. Default values are y where symptom is present, n where symptom is absent, and if . are preserved, they are represented in the data as NaNs. If NAs are not preserved, they are considered to be false / absent / 0.

Return type:

Pandas DataFrame


You can specify the mapping as (‘input’, ‘output’) and the path to csv as a string:

>>> transform(("2016WHOv151", "InterVA4"), "resources/sample_data/2016WHO_mock_data_1.csv").loc[range(5),["ACUTE","CHRONIC","TUBER"]]
0      y        n      .
1      y        n      .
2      n        y      .
3      n        y      .
4      y        n      .

You can also give the data and mapping as Pandas DataFrames:

>>> my_special_data = pd.read_csv("resources/sample_data/2016WHO_mock_data_1.csv")
>>> my_special_mapping = pd.read_csv("resources/mapping_configuration_files/2016WHOv151_to_InSilicoVA.csv")
>>> transform(my_special_mapping, my_special_data).loc[range(5),["ACUTE","CHRONIC","TUBER"]]
0      y        n      .
1      y        n      .
2      n        y      .
3      n        y      .
4      y        n      .

Note that by default, preserve_na is True and NA values will be left in. If preserve_na is False, or if the algorithm does not preserve NAs, then NA values will be filled in as 0’s, as they are in the first InterVA4 example above.

The user can also pass in a different mapping dictionary for result_values to change the values from their defaults of 0 (False / Absent), 1 (True / Present), and NaN (No data / missing), if they need their results in a different format.

>>> transform(("2016WHOv151", "InterVA4"), "resources/sample_data/2016WHO_mock_data_1.csv", result_values={"Absent":"A","Present":"P","NA":"Missing"}).loc[range(5),["ACUTE","CHRONIC","TUBER"]]
0     P       A  Missing
1     P       A  Missing
2     A       P  Missing
3     A       P  Missing
4     P       A  Missing

The mapping-data relationship is designed to be as flexible as possible, while still emphasizing tracebility and alerting the user to data integrity issues.

Not every source column in the mapping needs to be represented in the data. If source columns are missing in the source data, then those columns will be created and filled with NA values.

>>> transform(("2016WHOv151", "InSilicoVA"), "resources/sample_data/2016WHO_mock_data_2.csv").loc[range(5),["ACUTE","FEMALE","MARRIED"]]
Validating Mapping-Data Relationship . . .
[?]          3 (1.3%) expected source column IDs listed in mapping file ('-ageInDaysNeonate', '-Id10019', and '-Id10059') were not found in the input data columns. Their values will be NA.
[?]          '-Id10019' is missing, which affects the creation of  column(s) 'FEMALE', and 'MALE'
[?]          '-Id10059' is missing, which affects the creation of  column(s) 'MARRIED'
[?]          '-ageInDaysNeonate' is missing, which affects the creation of  column(s) 'DIED_D1', 'DIED_D23', 'DIED_D36', 'DIED_W1', and 'NEONATE'
0      y       .        .
1      y       .        .
2      y       .        .
3      y       .        .
4      y       .        .

transform will also accept mapping configurations with missing values, with new columns that are specified but missing source columns. These new columns will be created so that the final result has the correct expeted columns for the algorithm, but filled with NA values to indicate the lack of information. If preserve_na is set to False, then the NA values will also be False.

This situation is common between certain questionnaire sources and algorithms. For example, in the mapping between the PHRMC Short questionnaire to InterVA5 mapping, there are 107 InterVA5 variables that are listed in the mapping configuration to be created, but have no corresponding question in PHRMC short.

For example, variables i004a and i004b have no specifications in the mapping below. They are still listed under “New Column Name” so CrossVA knows that they should be created in the final result, but because they have no logic defined, they will be left as their default value of NA.

>>> phrmc_to_interva5 = pd.read_csv('resources/mapping_configuration_files/PHRMCShort_to_InterVA5.csv')
>>> phrmc_to_interva5.iloc[:5,[0,2,4,-1]]
  New Column Name Source Column ID Relationship Meta: Notes
0           i004a              NaN          NaN   Not asked
1           i004b              NaN          NaN   Not asked
2           i019a          gen_5_2           eq         NaN
3           i019b          gen_5_2           eq         NaN
4           i022a         gen_5_4h           ge         NaN

The transform function will warn the user of this behavior.

>>> transform(phrmc_to_interva5, "resources/sample_data/PHRMC_mock_data_1.csv").iloc[:5,:5]
Validating Mapping Configuration . . .
[?]      124 new column(s) listed but not defined in Mapping Configuration detected. These ('i004a', 'i004b', 'i059o', 'i082o', 'i087o', 'i091o', 'i092o', 'i093o', 'i094o', 'i095o', etc) will be treated as NA.
Validating Mapping-Data Relationship . . .
[?]      9 (5.7%) expected source column IDs listed in mapping file ('child_6_2', 'child_4_4', 'child_4_20', 'child_4_7a', 'child_4_40', 'child_4_28', 'child_4_30', 'child_1_5a', and 'child_5_1') were not found in the input data columns. Their values will be NA.
[?]      'child_1_5a' is missing, which affects the creation of  column(s) 'i358a'
[?]      'child_4_20' is missing, which affects the creation of  column(s) 'i171o'
[?]      'child_4_28' is missing, which affects the creation of  column(s) 'i208o'
[?]      'child_4_30' is missing, which affects the creation of  column(s) 'i233o'
[?]      'child_4_4' is missing, which affects the creation of  column(s) 'i150a'
[?]      'child_4_40' is missing, which affects the creation of  column(s) 'i200o'
[?]      'child_4_7a' is missing, which affects the creation of  column(s) 'i183o'
[?]      'child_5_1' is missing, which affects the creation of  column(s) 'i418o'
[?]      'child_6_2' is missing, which affects the creation of  column(s) 'i130o'
   ID  i004a  i004b  i019a  i019b
0   1      .      .      y      n
1   2      .      .      n      n
2   3      .      .      n      n
3   4      .      .      y      n
4   5      .      .      n      n

However, the mapping-data relationship must be valid. For example, if the source column IDs are not unique for the input data - that is, if multiple columns in the input data contain the same source ID - then validation will fail.

For example, bad_data contains columns named A-Id10004 and B-Id10004, but the 2016 WHO mapping is looking for just -Id10004 as a source ID. CrossVA cannot tell which column should be used, so validation fails.

>>> bad_data = pd.read_csv("resources/sample_data/2016WHO_bad_data_1.csv")
>>> transform(("2016WHOv151", "InSilicoVA"), bad_data)
Validating Mapping-Data Relationship . . .
[!]      1 source column IDs ('-Id10004') were found multiple times in the input data. Each source column ID should only occur once as part of an input data column name. It should be a unique identifier at the end of an input data column name. Source column IDs are case sensitive. Please revise your mapping configuration or your input data so that this condition is satisfied.