data_util¶
Data handling utilities involving dataframes.
Functions
|
Add specified columns to df if not there. |
|
Check two Pandas data series have the same values. |
|
Delete the specified columns from a dataframe. |
|
Delete rows where columns have this value. |
|
Return a list of the items from values that are in values_included or None if no values_included. |
|
Calculate a hash key for tuple of values. |
|
Get a new dataframe representing a tsv file. |
|
Get a hash key from key column values for row. |
|
Get a dictionary of two columns of a dataframe. |
|
Get a dataframe from selected columns. |
|
Create a new dataframe with columns reordered. |
|
Replace (in place) the n/a with np.nan taking care of categorical columns. |
|
Replace string values in specified columns. |
|
Get target values from the target_values list. |
- add_columns(df, column_list, value='n/a')[source]¶
Add specified columns to df if not there.
- Parameters:
df (DataFrame) – Pandas dataframe.
column_list (list) – List of columns to append to the dataframe.
value (str) – Default fill value for the column.
- check_match(ds1, ds2, numeric=False)[source]¶
Check two Pandas data series have the same values.
- Parameters:
ds1 (DataSeries) – Pandas data series to check.
ds2 (DataSeries) – Pandas data series to check.
numeric (bool) – If True, treat as numeric and do close-to comparison.
- Returns:
Error messages indicating the mismatch or empty if the series match.
- Return type:
list
- delete_columns(df, column_list)[source]¶
Delete the specified columns from a dataframe.
- Parameters:
df (DataFrame) – Pandas dataframe from which to delete columns.
column_list (list) – List of candidate column names for deletion.
Notes
The deletion of columns is done in place.
This does not raise an error if df does not have a column in the list.
- delete_rows_by_column(df, value, column_list=None)[source]¶
Delete rows where columns have this value.
- Parameters:
df (DataFrame) – Pandas dataframe from which to delete rows.
value (str) – Specified value to indicate row should be deleted.
column_list (list) – List of columns to search for value.
Notes
All values are converted to string before testing.
Deletion is done in place.
- get_eligible_values(values, values_included)[source]¶
Return a list of the items from values that are in values_included or None if no values_included.
- Parameters:
values (list) – List of strings against which to test.
values_included (list) – List of items to be selected from values if they are present.
- Returns:
list of selected values or None if values_included is empty or None.
- Return type:
list
- get_key_hash(key_tuple)[source]¶
Calculate a hash key for tuple of values.
- Parameters:
key_tuple (tuple, list) – The key values in the correct order for lookup.
- Returns:
A hash key for the tuple.
- Return type:
int
- get_new_dataframe(data)[source]¶
Get a new dataframe representing a tsv file.
- Parameters:
data (DataFrame or str) – DataFrame or filename representing a tsv file.
- Returns:
- A dataframe containing the contents of the tsv file or if data was
a DataFrame to start with, a new copy of the DataFrame.
- Return type:
DataFrame
- Raises:
A filename is given, and it cannot be read into a Dataframe.
- get_row_hash(row, key_list)[source]¶
Get a hash key from key column values for row.
- Parameters:
row (DataSeries) –
key_list (list) –
- Returns:
Hash key constructed from the entries of row in the columns specified by key_list.
- Return type:
str
- Raises:
If row doesn’t have all the columns in key_list HedFileError is raised.
- get_value_dict(tsv_path, key_col='file_basename', value_col='sampling_rate')[source]¶
Get a dictionary of two columns of a dataframe.
- Parameters:
tsv_path (str) – Path to a tsv file with a header row to be read into a DataFrame.
key_col (str) – Name of the column which should be the key.
value_col (str) – Name of the column which should be the value.
- Returns:
Dictionary with key_col values as the keys and the corresponding value_col values as the values.
- Return type:
dict
- Raises:
When tsv_path does not correspond to a file that can be read into a DataFrame.
- make_info_dataframe(col_info, selected_col)[source]¶
Get a dataframe from selected columns.
- Parameters:
col_info (dict) – Dictionary of dictionaries of column values and counts.
selected_col (str) – Name of the column used as top level key for col_info.
- Returns:
- A two-column dataframe with first column containing values from the
dictionary whose key is selected_col and whose second column are the corresponding counts. The returned value is None if selected_col is not a top-level key in col_info.
- Return type:
dataframe
- reorder_columns(data, col_order, skip_missing=True)[source]¶
Create a new dataframe with columns reordered.
- Parameters:
data (DataFrame, str) – Dataframe or filename of dataframe whose columns are to be reordered.
col_order (list) – List of column names in desired order.
skip_missing (bool) – If true, col_order columns missing from data are skipped, otherwise error.
- Returns:
A new reordered dataframe.
- Return type:
DataFrame
- Raises:
If col_order contains columns not in data and skip_missing is False.
If data corresponds to a filename from which a dataframe cannot be created.
- replace_values(df, values=None, replace_value='n/a', column_list=None)[source]¶
Replace string values in specified columns.
- Parameters:
df (DataFrame) – Dataframe whose values will be replaced.
values (list, None) – List of strings to replace. If None, only empty strings are replaced.
replace_value (str) – String replacement value.
column_list (list, None) – List of columns in which to do replacement. If None all columns are processed.
- Returns:
number of values replaced.
- Return type:
int