Splitting Data#

Here are auxiliar functions used to split data before sending it to process on API.

split(dataframe, columns, sort: bool = False, prefix: str = 'id_')#

Split columns from dataframe returning a dataframe with the unique values for each specified column and replacing the original column with the corresponding index of the new dataframe

Parameters:

dataframe (pd.DataFrame) – Dataframe to be factored.
columns (str, list of str or dict) – Column to be separated from dataset. If column has multiple data, use a dict with the format column name as key and separator as value. Use None if no separator is needed.
sort (bool, optional) – Sort values of the split data.
prefix (str, optional) – prefix added to the splitted column names.

Returns:

bases (list of pd.DataFrame) – list of dataframes with each base extracted.
dataframe (pd.DataFrame) – original dataframe with columns replaced by the ids of the correlated base.

Example

>>> from jai.utilities import split
...
>>> split_bases, main_base = split(df, ["split_column"])

split_recommendation(dataframe, split_config: Dict[str, List[str]], columns: str, as_index: bool | Dict[str, str] = False, sort: bool = False, prefix: str = 'id_')#

Split data into the 3 datasets for recommendation and also splits columns returning the datasets for pretrained bases and replacing the original column with the corresponding index of the new dataframe

Parameters:

dataframe (pd.DataFrame) – Dataframe to be factored.
split_config (Dict[str, List[str]]) – Dictionary with length 2. - keys: db_names for each of the child Recommendation databases created on Recommendation System’s setup. - values: list of columns of those databases.
columns (str, list of str or dict) – Column to be separated from dataset. If column has multiple data, use a dict with the format column name as key and separator as value. Use None if no separator is needed.
as_index (False or Dict[str, str]) – Dictionary with length 2: - keys: database name. - values: column name to be used as id for that database
sort (bool) – sort values of the split data. See split function.
prefix (str) – Prefix added to the splitted column names. See split function. Also used as prefix for de id columns of the child Recommendation databases.

Returns:

main_bases (list of pd.DataFrame) – original dataframe with columns replaced by the ids of the correlated base.
pretrained_bases (pd.DataFrame) – list of dataframes with each base extracted.

Example

>>> from jai.utilities import split
...
>>> processed = predict2df(results)
>>> pd.DataFrame(processed)

Splitting Data

Contents

Splitting Data#