Data Cleaning with JAI#

Here are some applications developed using the basic operations, to allow a more direct approach to the specific goals listed bellow. They should work as standalone without the need of any of the basic operations

Matching values from two datasets#

Match two datasets with their possible equal values. This method matches similar values in between text columns of two databases. It queries data right to get the similar results in data left.

You can find a complete match example here.

>>> data1, data2 = dataframe1['name'], dataframe2['name']
>>>
>>> j = Jai()
>>> match = j.match(name, data1, data2)
>>> match

          id_left     id_right     distance
   0            1            2         0.11
   1            2            1         0.11
   2            3          NaN          NaN
   3            4          NaN          NaN
   4            5            5         0.15

Resolution of duplicated values#

Find possible duplicated values within the data. This method finds similar values in text columns of your database.

You can find a complete resoultion example here.

>>> data = dataframe['name']
>>> j = Jai()
>>> results = j.resolution(name, data)
>>> results
  id  resolution_id
   0              0
   1              0
   2              0
   3              3
   4              3
   5              5

Filling missing values#

Fills the column in data with the most likely value given the other columns.

You can find a complete fill example here.

>>> import pandas as pd
>>> from jai.processing import predict2df
...
>>> j = Jai()
>>> results = j.fill(name, data, COL_TO_FILL)
>>> processed = predict2df(results)
>>> pd.DataFrame(processed).sort_values('id')
          id   sanity_prediction    confidence_level (%)
   0       1             value_1                    70.9
   1       4             value_1                    67.3
   2       7             value_1                    80.2

Check data sanity#

Validates consistency in the columns (columns_ref).

You can find a complete sanity example here.

>>> import pandas as pd
>>> from jai.processing import predict2df
...
>>> j = Jai()
>>> results = j.sanity(name, data)
>>> processed = predict2df(results)
>>> pd.DataFrame(processed).sort_values('id')
          id   sanity_prediction    confidence_level (%)
   0       1               Valid                    70.9
   1       4             Invalid                    67.3
   2       7             Invalid                    80.6
   3      13               Valid                    74.2