data_util¶

Data handling utilities involving dataframes.

Functions

`add_columns`(df, column_list[, value])	Add specified columns to df if not there.
`check_match`(ds1, ds2[, numeric])	Check two Pandas data series have the same values.
`delete_columns`(df, column_list)	Delete the specified columns from a dataframe.
`delete_rows_by_column`(df, value[, column_list])	Delete rows where columns have this value.
`get_eligible_values`(values, values_included)	Return a list of the items from values that are in values_included or None if no values_included
`get_indices`(df, column, start, stop)
`get_key_hash`(key_tuple)	Calculate a hash key for tuple of values.
`get_new_dataframe`(data)	Get a new dataframe representing a tsv file.
`get_row_hash`(row, key_list)	Get a hash key from key column values for row.
`get_value_dict`(tsv_path[, key_col, value_col])	Get a dictionary of two columns of a dataframe.
`make_info_dataframe`(col_info, selected_col)	Get a dataframe from selected columns.
`reorder_columns`(data, col_order[, skip_missing])	Create a new dataframe with columns reordered.
`replace_values`(df[, values, replace_value, ...])	Replace string values in specified columns.
`separate_values`(values, target_values)	Get target values from the target_values list.
`tuple_to_range`(tuple_list, inclusion)

add_columns(df, column_list, value='n/a')[source]¶

Add specified columns to df if not there.

Parameters:

df (DataFrame) – Pandas dataframe.
column_list (list) – List of columns to append to the dataframe.
value (str) – Default fill value for the column.

check_match(ds1, ds2, numeric=False)[source]¶

Check two Pandas data series have the same values.

Parameters:

ds1 (DataSeries) – Pandas data series to check.
ds2 (DataSeries) – Pandas data series to check.
numeric (bool) – If true, treat as numeric and do close-to comparison.

Returns:

Error messages indicating the mismatch or empty if the series match.

Return type:

list

delete_columns(df, column_list)[source]¶

Delete the specified columns from a dataframe.

Parameters:

df (DataFrame) – Pandas dataframe from which to delete columns.
column_list (list) – List of candidate column names for deletion.

Notes

The deletion of columns is done in place.
This does not raise an error if df does not have a column in the list.

delete_rows_by_column(df, value, column_list=None)[source]¶

Delete rows where columns have this value.

Parameters:

df (DataFrame) – Pandas dataframe from which to delete rows.
value (str) – Specified value to indicate row should be deleted.
column_list (list) – List of columns to search for value.

Notes

All values are converted to string before testing.
Deletion is done in place.

get_eligible_values(values, values_included)[source]¶

Return a list of the items from values that are in values_included or None if no values_included

Parameters:

values (list) – List of strings against which to test.
values_included (list) – List of items to be selected from values if they are present.

Returns:

list of selected values or None if values_included is empty or None.

Return type:

list

get_indices(df, column, start, stop)[source]¶

get_key_hash(key_tuple)[source]¶

Calculate a hash key for tuple of values.

Parameters:: key_tuple (tuple, list) – The key values in the correct order for lookup.
Returns:: A hash key for the tuple.
Return type:: int

get_new_dataframe(data)[source]¶

Get a new dataframe representing a tsv file.

Parameters:

data (DataFrame or str) – DataFrame or filename representing a tsv file.

Returns:

A dataframe containing the contents of the tsv file or if data was: a DataFrame to start with, a new copy of the DataFrame.

Return type:

DataFrame

Raises:

HedFileError –

A filename is given, and it cannot be read into a Dataframe.

get_row_hash(row, key_list)[source]¶

Get a hash key from key column values for row.

Parameters:

row (DataSeries) –
key_list (list) –

Returns:

Hash key constructed from the entries of row in the columns specified by key_list.

Return type:

str

Raises:

HedFileError –

If row doesn’t have all the columns in key_list HedFileError is raised.

get_value_dict(tsv_path, key_col='file_basename', value_col='sampling_rate')[source]¶

Get a dictionary of two columns of a dataframe.

Parameters:

tsv_path (str) – Path to a tsv file with a header row to be read into a DataFrame.
key_col (str) – Name of the column which should be the key.
value_col (str) – Name of the column which should be the value.

Returns:

Dictionary with key_col values as the keys and the corresponding value_col values as the values.

Return type:

dict

Raises:

HedFileError –

When tsv_path does not correspond to a file that can be read into a DataFrame.

make_info_dataframe(col_info, selected_col)[source]¶

Get a dataframe from selected columns.

Parameters:

col_info (dict) – Dictionary of dictionaries of column values and counts.
selected_col (str) – Name of the column used as top level key for col_info.

Returns:

A two-column dataframe with first column containing values from the: dictionary whose key is selected_col and whose second column are the corresponding counts. The returned value is None if selected_col is not a top-level key in col_info.

Return type:

dataframe

reorder_columns(data, col_order, skip_missing=True)[source]¶

Create a new dataframe with columns reordered.

Parameters:

data (DataFrame, str) – Dataframe or filename of dataframe whose columns are to be reordered.
col_order (list) – List of column names in desired order.
skip_missing (bool) – If true, col_order columns missing from data are skipped, otherwise error.

Returns:

A new reordered dataframe.

Return type:

DataFrame

Raises:

HedFileError –

If col_order contains columns not in data and skip_missing is False.
If data corresponds to a filename from which a dataframe cannot be created.

replace_values(df, values=None, replace_value='n/a', column_list=None)[source]¶

Replace string values in specified columns.

Parameters:

df (DataFrame) – Dataframe whose values will be replaced.
values (list, None) – List of strings to replace. If None, only empty strings are replaced.
replace_value (str) – String replacement value.
column_list (list, None) – List of columns in which to do replacement. If None all columns are processed.

Returns:

number of values replaced.

Return type:

int

separate_values(values, target_values)[source]¶

Get target values from the target_values list.

Parameters:

values (list) – List of values to be tested.
target_values – List of desired values.

tuple_to_range(tuple_list, inclusion)[source]¶