JAI Python Class#

class Jai(auth_key: str | None = None, environment: str = 'default', env_var: str = 'JAI_AUTH', safe_mode: bool = False)#

General class for communication with the Jai API.

Used as foundation for more complex applications for data validation such as matching tables, resolution of duplicated values, filling missing values and more.

An authorization key is needed to use the Jai API.

Contains the implementation of most functionalities from the API.

Parameters:
  • environment (str) – Jai environment id or name to use. Defaults to “default”

  • env_var (str) – Name of the Environment Variable to get the value of your auth key. Defaults to “JAI_AUTH”.

  • safe_mode (bool) – When safe_mode is True, responses from Jai API are validated. If the validation fails, the current version you are using is probably incompatible with the current API version. We advise updating it to a newer version. If the problem persists and you are on the latest SDK version, please open an issue so we can work on a fix. Defaults to False.

add_data(name: str, data, batch_size: int = 1048576, frequency_seconds: int = 1)#

Insert raw data and extract their latent representation.

This method should be used when we already setup up a database using setup() and want to create the vector representations of new data using the model we already trained for the given database.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pandas.DataFrame or pandas.Series) – Data to be inserted and used for training.

  • batch_size (int) – Size of batch to send the data. Default is 2**20 (1.048.576).

  • frequency_seconds (int) – Time in between each check of status. Default is 10.

Returns:

insert_responses – Dictionary of responses for each batch. Each response contains information of whether or not that particular batch was successfully inserted.

Return type:

dict

append(name: str, data, batch_size: int = 1048576, frequency_seconds: int = 1)#

Another name for add_data

delete_database(name: str)#

Delete a database and everything that goes with it (I thank you all).

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – Dictionary with the API response.

Return type:

dict

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> j.delete_database(name=name)
'Bombs away! We nuked database chosen_name!'
delete_ids(name, ids)#

Delete the specified ids from database.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • ids (list) – List of ids to be removed from database.

Returns:

response – Dictionary with the API response.

Return type:

dict

delete_raw_data(name: str)#

Delete raw data. It is good practice to do this after training a model.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – Dictionary with the API response.

Return type:

dict

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> j.delete_raw_data(name=name)
'All raw data from database 'chosen_name' was deleted!'
describe(name: str)#

Get the database hyperparameters and parameters of a specific database.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – Dictionary with database description.

Return type:

dict

download_vectors(name: str)#

Download vectors from a particular database.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

vector – Numpy array with all vectors.

Return type:

np.array

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> vectors = j.download_vectors(name=name)
>>> print(vectors)
[[ 0.03121682  0.2101511  -0.48933393 ...  0.05550333  0.21190546  0.19986008]
[-0.03121682 -0.21015109  0.48933393 ...  0.2267401   0.11074653  0.15064166]
...
[-0.03121682 -0.2101511   0.4893339  ...  0.00758727  0.15916921  0.1226602 ]]
embedding(name: str, data, db_type='TextEdit', batch_size: int = 1048576, frequency_seconds: int = 1, hyperparams=None, overwrite=False)#

Quick embedding for high numbers of categories in columns.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pd.Series) – Data for your text based model.

  • db_type (str, optional) – type of model to be trained. The default is ‘TextEdit’.

  • hyperparams (optional) – See setup documentation for the db_type used.

Returns:

name – name of the base where the data was embedded.

Return type:

str

environments()#

Return names of available environments.

fields(name: str)#

Get the table fields for a Supervised/SelfSupervised database.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – Dictionary with table fields.

Return type:

dict

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> fields = j.fields(name=name)
>>> print(fields)
{'id': 0, 'feature1': 0.01, 'feature2': 'string', 'feature3': 0}
fill(name: str, data, column: str, batch_size: int = 1048576, db_type='TextEdit', **kwargs)#

Experimental

Fills the column in data with the most likely value given the other columns.

Only works with categorical columns. Can not fill missing values for numerical columns.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pd.DataFrame) – data to fill NaN.

  • column (str) – name of the column to be filled.

  • db_type (str or dict) – which db_type to use for embedding high dimensional categorical columns. If a string is provided, we assume that all columns will be embedded using that db_type; if a dict-like structure {“col1”: “TextEdit”, “col2”: “FastText”, …} is provided, we embed the specified columns with their respective db_types, and columns not in dict are by default embedded with “TextEdit”

  • **kwargs (TYPE) – Extra args for supervised model. See setup method.

Returns:

List of dicts with possible filling values for each id with column NaN.

Return type:

list of dicts

Example

>>> import pandas as pd
>>> from jai.processing import predict2df
>>>
>>> j = Jai()
>>> results = j.fill(name, data, COL_TO_FILL)
>>> processed = predict2df(results)
>>> pd.DataFrame(processed).sort_values('id')
          id   sanity_prediction    confidence_level (%)
   0       1             value_1                    70.9
   1       4             value_1                    67.3
   2       7             value_1                    80.2
filters(name: str)#

Gets the valid values of filters.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – List of valid filter values.

Return type:

list of strings

fit(*args, **kwargs)#

Another name for setup.

generate_name(length: int = 8, prefix: str = '', suffix: str = '')#

Generate a random string. You can pass a prefix and/or suffix. In this case, the generated string will be a concatenation of prefix + random + suffix.

Parameters:
  • length (int) – Length for the desired string. Default is 8.

  • prefix (str) – Prefix of your string. Default is empty.

  • suffix (str) – Suffix of your string. Default is empty.

Returns:

A random string.

Return type:

str

Example

>>> j.generate_name()
13636a8b
>>> j.generate_name(length=16, prefix="company")
companyb8bbd445d
static get_auth_key(email: str, firstName: str, lastName: str, company: str = '')#

Request an auth key to use JAI-SDK with.

This method will be deprecated. Please use get_auth_key function.

Parameters:
  • email (str) – A valid email address where the auth key will be sent to.

  • firstName (str) – User’s first name.

  • lastName (str) – User’s last name.

  • company (str) – User’s company.

Returns:

`response` – A Response object with whether or not the auth key was created.

Return type:

dict

get_dtype(name: str)#

Return the database type.

Parameters:

name (str) – String with the name of a database in your JAI environment.

Raises:

ValueError – If the name is not valid.

Returns:

db_type – The name of the type of the database.

Return type:

str

ids(name: str, mode: Mode = 'simple')#

Get id information of a given database.

Args mode : str, optional

Returns:

response – List with the actual ids (mode: ‘complete’) or a summary of ids (‘simple’/’summarized’) of the given database.

Return type:

list

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> ids = j.ids(name)
>>> print(ids)
['891 items from 0 to 890']
import_database(database_name: str, owner_id: str, owner_email: str, import_name: str | None = None)#
property info#

Get name and type of each database in your environment.

Returns:

Pandas dataframe with name, type, creation date and parent databases of each database in your environment.

Return type:

pandas.DataFrame

Example

>>> j.info
                        db_name           db_type
0                  jai_database              Text
1            jai_selfsupervised    SelfSupervised
2                jai_supervised        Supervised
insert_vectors(data, name, batch_size: int = 1048576, overwrite: bool = False, append: bool = False)#

Insert raw vectors database directly into JAI without any need of fit.

Parameters:
  • data (pd.DataFrame, pd.Series or np.ndarray) – Database data to be inserted.

  • name (str) – String with the name of a database in your JAI environment.

  • batch_size (int, optional) – Size of batch to send the data. Default is 2**20 (1.048.576).

  • overwrite (bool, optional) – If True, then the vector database is always recriated. Default is False.

  • append (bool, optional) – If True, then the inserted data will be added to the existent database. Default is False.

Returns:

insert_responses – Dictionary of responses for each batch. Each response contains information of whether or not that particular batch was successfully inserted.

Return type:

dict

is_valid(name: str)#

Check if a given name is a valid database name (i.e., if it is in your environment).

Parameters:

name (str) – String with the name of a database in your JAI environment.

Returns:

response – True if name is in your environment. False, otherwise.

Return type:

bool

Example

>>> name = 'chosen_name'
>>> j = Jai()
>>> check_valid = j.is_valid(name)
>>> print(check_valid)
True
match(name: str, data_left, data_right, top_k: int = 100, batch_size: int = 1048576, threshold: float | None = None, original_data: bool = False, db_type='TextEdit', hyperparams=None, overwrite: bool = False)#

Match two datasets with their possible equal values.

Queries the data right to get the similar results in data left.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data_left (pd.Series) – data to be matched.

  • data_right (pd.Series) – data to be matched.

  • top_k (int, optional) – Number of similars to query. Default is 100.

  • threshold (float, optional) – Distance threshold to decide if the result is the same item or not. Smaller distances give more strict results. Default is None. The threshold is automatically set by default, but may need manual setting for more accurate results.

  • original_data (bool, optional) – If True, returns the values of the original data along with the ids. Default is False.

  • db_type (str, optional) – type of model to be trained. The default is ‘TextEdit’.

  • hyperparams (dict, optional) – See setup documentation for the db_type used.

  • overwrite (bool, optional) – If True, then the model is always retrained. Default is False.

Returns:

Returns a dataframe with the matching ids of data_left and data_right.

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> from jai.processing import process_similar
>>>
>>> j = Jai()
>>> match = j.match(name, data1, data2)
>>> match
          id_left     id_right     distance
   0            1            2         0.11
   1            2            1         0.11
   2            3          NaN          NaN
   3            4          NaN          NaN
   4            5            5         0.15
property names#

Retrieves databases already created for the provided Auth Key.

Return type:

List with the sorted names of the databases created so far.

Example

>>> j.names
['jai_database', 'jai_selfsupervised', 'jai_supervised']
predict(name: str, data, predict_proba: bool = False, as_frame: bool = False, batch_size: int = 1048576, max_workers: int | None = None)#

Predict the output of new data for a given database.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pd.Series or pd.DataFrame) – Data to be queried for similar inputs in your database.

  • predict_proba (bool) – Whether or not to return the probabilities of each prediction is it’s a classification. Default is False.

  • batch_size (int) – Size of batches to send the data. Default is 2**20 (1.048.576).

  • max_workers (bool) – Number of workers to use to parallelize the process. If None, use all workers. Defaults to None.

Returns:

results – List of dictionaries with ‘id’ of the inputed data and ‘predict’ as predictions for the data passed as input.

Return type:

list of dicts

Example

>>> name = 'chosen_name'
>>> DATA_ITEM = # data in the format of the database
>>> j = Jai()
>>> preds = j.predict(name, DATA_ITEM)
>>> print(preds)
[{"id":0, "predict": "class1"}, {"id":1, "predict": "class0"}]
>>> preds = j.predict(name, DATA_ITEM, predict_proba=True)
>>> print(preds)
[{"id": 0 , "predict"; {"class0": 0.1, "class1": 0.6, "class2": 0.3}}]
recommendation(name: str, data: list | ndarray | Index | Series | DataFrame, top_k: int = 5, orient: str = 'nested', filters: List[str] | None = None, max_workers: int | None = None, batch_size: int = 1048576)#

Query a database in search for the top_k most recommended entries for each input data passed as argument.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (list, np.ndarray, pd.Series or pd.DataFrame) – Data to be queried for recommendation in your database.

  • top_k (int) – Number of k recommendations that we want to return. Default is 5.

  • orient ("nested" or "flat") – Changes the output format. Default is “nested”.

  • filters (List of strings) – Filters to use on the similarity query. Default is None.

  • max_workers (bool) – Number of workers to use to parallelize the process. If None, use all workers. Defaults to None.

  • batch_size (int) – Size of batches to send the data. Default is 2**20 (1.048.576).

Returns:

results – A list with a dictionary for each input value identified with ‘query_id’ and ‘result’ which is a list with ‘top_k’ most recommended items dictionaries, each dictionary has the ‘id’ from the database previously setup and ‘distance’ in between the correspondent ‘id’ and ‘query_id’.

Return type:

list of dicts

Example

>>> name = 'chosen_name'
>>> DATA_ITEM = # data in the format of the database
>>> TOP_K = 3
>>> j = Jai()
>>> df_index_distance = j.recommendation(name, DATA_ITEM, TOP_K)
>>> print(pd.DataFrame(df_index_distance['recommendation']))
   id  distance
10007       0.0
45568    6995.6
 8382    7293.2
rename(original_name: str, new_name: str)#
report(name, verbose: int = 2, return_report: bool = False)#

Get a report about the training model.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • verbose (int, optional) – Level of description. The default is 2. Use verbose 2 to get the loss graph, verbose 1 to get only the metrics result.

Returns:

Dictionary with the information.

Return type:

dict

resolution(name: str, data, top_k: int = 20, batch_size: int = 1048576, threshold: float | None = None, return_self: bool = True, original_data: bool = False, db_type='TextEdit', hyperparams=None, overwrite=False)#

Experimental

Find possible duplicated values within the data.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pd.Series) – data to find duplicates.

  • top_k (int, optional) – Number of similars to query. Default is 100.

  • threshold (float, optional) – Distance threshold to decide if the result is the same item or not. Smaller distances give more strict results. Default is None. The threshold is automatically set by default, but may need manual setting for more accurate results.

  • original_data (bool, optional) – If True, returns the ids when resolution_id is the same as id. Default is True.

  • original_data – If True, returns the values of the original data along with the ids. Default is False.

  • db_type (str, optional) – type of model to be trained. The default is ‘TextEdit’.

  • hyperparams (dict, optional) – See setup documentation for the db_type used.

  • overwrite (bool, optional) – If True, then the model is always retrained. Default is False.

Returns:

Each id with its resolution id. More columns depending on parameters.

Return type:

pd.DataFrame

Example

>>> import pandas as pd
>>> from jai.processing import process_similar
>>>
>>> j = Jai()
>>> results = j.resolution(name, data)
>>> results
  id  resolution_id
   0              0
   1              0
   2              0
   3              3
   4              3
   5              5
sanity(name: str, data, batch_size: int = 1048576, columns_ref: list | None = None, db_type='TextEdit', **kwargs)#

Experimental

Validates consistency in the columns (columns_ref).

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • data (pd.DataFrame) – Data reference of sound data.

  • columns_ref (list, optional) – Columns that can have inconsistencies. As default we use all non numeric columns.

  • db_type (str or dict) – which db_type to use for embedding high dimensional categorical columns. If a string is provided, we assume that all columns will be embedded using that db_type; if a dict-like structure {“col1”: “TextEdit”, “col2”: “FastText”, “col3”: “Text”, …} is provided, we embed the specified columns with their respective db_types, and columns not in dict are by default embedded with “TextEdit”

  • kwargs

    Extra args for supervised model except label and split. See setup method. Also:

    • frac (float):

      Percentage of the orignal dataframe to be shuffled to create invalid samples for each column in columns_ref. Default is 0.1.

    • random_seed (int):

      random seed. Default is 42.

    • cat_threshold (int):

      threshold for processing categorical columns with fasttext model. Default is 512.

    • target (str):

      target validation column. If target is already in data, shuffling is skipped. Default is “is_valid”.

Returns:

Result of data is valid or not.

Return type:

list of dicts

Example

>>> import pandas as pd
>>> from jai.processing import predict2df
>>>
>>> j = Jai()
>>> results = j.sanity(name, data)
>>> processed = predict2df(results)
>>> pd.DataFrame(processed).sort_values('id')
          id   sanity_prediction    confidence_level (%)
   0       1               Valid                    70.9
   1       4             Invalid                    67.3
   2       7             Invalid                    80.6
   3      13               Valid                    74.2
setup(name: str, data, db_type: str, batch_size: int = 1048576, max_insert_workers: int | None = None, frequency_seconds: int = 1, verbose: int = 1, **kwargs)#

Insert data and train model. This is JAI’s crème de la crème.

Parameters:
  • name (str) – Database name.

  • data (pandas.DataFrame or pandas.Series) – Data to be inserted and used for training.

  • db_type (str) – Database type. {RecommendationSystem, Supervised, SelfSupervised, Text, FastText, TextEdit, Image}

  • batch_size (int) – Size of batch to insert the data.`Default is 2**20 (1.048.576)`.

  • max_insert_workers (int) – Number of workers to use in the insert data process. Default is None.

  • frequency_seconds (int) – Time in between each check of status. Default is 10.

  • verbose (int) – Level of information to retrieve to the user. Default is 1.

  • **kwargs – Parameters that should be passed as a dictionary in compliance with the API methods. In other words, every kwarg argument should be passed as if it were in the body of a POST method. To check all possible kwargs in Jai.setup method, you can check the Fit Kwargs section.

Returns:

  • insert_response (dict) – Dictionary of responses for each data insertion.

  • setup_response (dict) – Setup response telling if the model started training.

Example

>>> name = 'chosen_name'
>>> data = # data in pandas.DataFrame format
>>> j = Jai()
>>> _, setup_response = j.setup(
        name=name,
        data=data,
        db_type="Supervised",
        label={
            "task": "metric_classification",
            "label_name": "my_label"
        }
    )
>>> print(setup_response)
{
    "Task": "Training",
    "Status": "Started",
    "Description": "Training of database chosen_name has started."
}
similar(name: str, data: list | ndarray | Index | Series | DataFrame, top_k: int = 5, orient: str = 'nested', filters: List[str] | None = None, max_workers: int | None = None, batch_size: int = 1048576)#

Query a database in search for the top_k most similar entries for each input data passed as argument.

Parameters:
  • data (list, np.ndarray, pd.Index, pd.Series or pd.DataFrame) – Data to be queried for similar inputs in your database. - Use list, np.ndarray or pd.Index for id. - Use pd.Series or pd.Dataframe for raw data.

  • top_k (int) – Number of k similar items that we want to return. Default is 5.

  • orient ("nested" or "flat") – Changes the output format. Default is “nested”.

  • filters (List of strings) – Filters to use on the similarity query. Default is None.

  • max_workers (bool) – Number of workers to use to parallelize the process. If None, use all workers. Defaults to None.

  • batch_size (int) – Size of batches to send the data. Default is 2**20 (1.048.576).

Returns:

results – A list with a dictionary for each input value identified with ‘query_id’ and ‘result’ which is a list with ‘top_k’ most similar items dictionaries, each dictionary has the ‘id’ from the database previously setup and ‘distance’ in between the correspondent ‘id’ and ‘query_id’.

Return type:

list of dicts

Example

>>> name = 'chosen_name'
>>> DATA_ITEM = # data in the format of the database
>>> TOP_K = 3
>>> j = Jai()
>>> df_index_distance = j.similar(name, DATA_ITEM, TOP_K)
>>> print(pd.DataFrame(df_index_distance['similarity']))
   id  distance
10007       0.0
45568    6995.6
 8382    7293.2
status(max_tries=5, patience=5)#

Get the status of your JAI environment when training.

Returns:

response – A JSON file with the current status of the training tasks.

Return type:

dict

Example

>>> j.status()
{
    "Task": "Training",
    "Status": "Completed",
    "Description": "Training of database YOUR_DATABASE has ended."
}
transfer(original_name: str, to_environment: str, new_name: str | None = None, from_environment: str = 'default')#
update_database(name: str, display_name: str | None = None, project: str | None = None)#
user()#

User information.

Returns:

  • userId: str

  • email: str

  • firstName: str

  • lastName: str

  • memberRole: str

  • namespace: srt

Return type:

dict

wait_setup(name: str, frequency_seconds: int = 1)#

Wait for the setup (model training) to finish

Placeholder method for scripts.

Parameters:
  • name (str) – String with the name of a database in your JAI environment.

  • frequency_seconds (int, optional) – Number of seconds apart from each status check. Default is 5.

Return type:

None.