data_util

Data handling utilities involving dataframes.

Functions

add_columns(df, column_list[, value])

Add specified columns to df if not there.

check_match(ds1, ds2[, numeric])

Check two Pandas data series have the same values.

delete_columns(df, column_list)

Delete the specified columns from a dataframe.

delete_rows_by_column(df, value[, column_list])

Delete rows where columns have this value.

get_eligible_values(values, values_included)

Return a list of the items from values that are in values_included or None if no values_included.

get_key_hash(key_tuple)

Calculate a hash key for tuple of values.

get_new_dataframe(data)

Get a new dataframe representing a tsv file.

get_row_hash(row, key_list)

Get a hash key from key column values for row.

get_value_dict(tsv_path[, key_col, value_col])

Get a dictionary of two columns of a dataframe.

make_info_dataframe(col_info, selected_col)

Get a dataframe from selected columns.

reorder_columns(data, col_order[, skip_missing])

Create a new dataframe with columns reordered.

replace_na(df)

Replace (in place) the n/a with np.nan taking care of categorical columns.

replace_values(df[, values, replace_value, ...])

Replace string values in specified columns.

separate_values(values, target_values)

Get target values from the target_values list.

add_columns(df, column_list, value='n/a')[source]

Add specified columns to df if not there.

Parameters:
  • df (DataFrame) – Pandas dataframe.

  • column_list (list) – List of columns to append to the dataframe.

  • value (str) – Default fill value for the column.

check_match(ds1, ds2, numeric=False)[source]

Check two Pandas data series have the same values.

Parameters:
  • ds1 (DataSeries) – Pandas data series to check.

  • ds2 (DataSeries) – Pandas data series to check.

  • numeric (bool) – If True, treat as numeric and do close-to comparison.

Returns:

Error messages indicating the mismatch or empty if the series match.

Return type:

list

delete_columns(df, column_list)[source]

Delete the specified columns from a dataframe.

Parameters:
  • df (DataFrame) – Pandas dataframe from which to delete columns.

  • column_list (list) – List of candidate column names for deletion.

Notes

  • The deletion of columns is done in place.

  • This does not raise an error if df does not have a column in the list.

delete_rows_by_column(df, value, column_list=None)[source]

Delete rows where columns have this value.

Parameters:
  • df (DataFrame) – Pandas dataframe from which to delete rows.

  • value (str) – Specified value to indicate row should be deleted.

  • column_list (list) – List of columns to search for value.

Notes

  • All values are converted to string before testing.

  • Deletion is done in place.

get_eligible_values(values, values_included)[source]

Return a list of the items from values that are in values_included or None if no values_included.

Parameters:
  • values (list) – List of strings against which to test.

  • values_included (list) – List of items to be selected from values if they are present.

Returns:

list of selected values or None if values_included is empty or None.

Return type:

list

get_key_hash(key_tuple)[source]

Calculate a hash key for tuple of values.

Parameters:

key_tuple (tuple, list) – The key values in the correct order for lookup.

Returns:

A hash key for the tuple.

Return type:

int

get_new_dataframe(data)[source]

Get a new dataframe representing a tsv file.

Parameters:

data (DataFrame or str) – DataFrame or filename representing a tsv file.

Returns:

A dataframe containing the contents of the tsv file or if data was

a DataFrame to start with, a new copy of the DataFrame.

Return type:

DataFrame

Raises:

HedFileError

  • A filename is given, and it cannot be read into a Dataframe.

get_row_hash(row, key_list)[source]

Get a hash key from key column values for row.

Parameters:
  • row (DataSeries) –

  • key_list (list) –

Returns:

Hash key constructed from the entries of row in the columns specified by key_list.

Return type:

str

Raises:

HedFileError

  • If row doesn’t have all the columns in key_list HedFileError is raised.

get_value_dict(tsv_path, key_col='file_basename', value_col='sampling_rate')[source]

Get a dictionary of two columns of a dataframe.

Parameters:
  • tsv_path (str) – Path to a tsv file with a header row to be read into a DataFrame.

  • key_col (str) – Name of the column which should be the key.

  • value_col (str) – Name of the column which should be the value.

Returns:

Dictionary with key_col values as the keys and the corresponding value_col values as the values.

Return type:

dict

Raises:

HedFileError

  • When tsv_path does not correspond to a file that can be read into a DataFrame.

make_info_dataframe(col_info, selected_col)[source]

Get a dataframe from selected columns.

Parameters:
  • col_info (dict) – Dictionary of dictionaries of column values and counts.

  • selected_col (str) – Name of the column used as top level key for col_info.

Returns:

A two-column dataframe with first column containing values from the

dictionary whose key is selected_col and whose second column are the corresponding counts. The returned value is None if selected_col is not a top-level key in col_info.

Return type:

dataframe

reorder_columns(data, col_order, skip_missing=True)[source]

Create a new dataframe with columns reordered.

Parameters:
  • data (DataFrame, str) – Dataframe or filename of dataframe whose columns are to be reordered.

  • col_order (list) – List of column names in desired order.

  • skip_missing (bool) – If true, col_order columns missing from data are skipped, otherwise error.

Returns:

A new reordered dataframe.

Return type:

DataFrame

Raises:

HedFileError

  • If col_order contains columns not in data and skip_missing is False.

  • If data corresponds to a filename from which a dataframe cannot be created.

replace_na(df)[source]

Replace (in place) the n/a with np.nan taking care of categorical columns.

replace_values(df, values=None, replace_value='n/a', column_list=None)[source]

Replace string values in specified columns.

Parameters:
  • df (DataFrame) – Dataframe whose values will be replaced.

  • values (list, None) – List of strings to replace. If None, only empty strings are replaced.

  • replace_value (str) – String replacement value.

  • column_list (list, None) – List of columns in which to do replacement. If None all columns are processed.

Returns:

number of values replaced.

Return type:

int

separate_values(values, target_values)[source]

Get target values from the target_values list.

Parameters:
  • values (list) – List of values to be tested.

  • target_values – List of desired values.