Creating a new collection with filters#

This section will demonstrate how to create a collection with filters and its implications.

Note: You can only set one filter for each collection.

Note: You can only set filters upon creating a new collection. It’s not possible to set a filter to an existing collection.

The dataset used is Wine Reviews from Kaggle. We’ll take a sample of the dataset, so if you are running this example, it won’t take too long to get the results (as the entire processing takes around 10 minutes).

[1]:
import pandas as pd

# Let's take a sample here
SAMPLE_SIZE = 2000

# Let's take a look at the dataset
# If you wish to use the full dataset, just remove the nrows argument
df = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv", index_col=0, nrows=SAMPLE_SIZE)

df.head()
[1]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Text Example#

Here is a simple example of how the filters can be used.

We want to create a collection with the “description” column of our dataset. We’ll be able to peform similarity searches from the descriptions and get the descriptions with similar meanings.

Now, if we add the column “country” as a filter for this collection, we can filter the results to a country or group of countries of our choice.

First, we’ll run how to set up a collection using the classes from the Task Module and then use the Jai Class to perform the same thing.

Text collection with filters using Trainer Class#

[2]:
from jai import Trainer

trainer = Trainer("wine_description")
trainer.set_parameters(db_type="Text", features={"country": {"dtype": "filter"}})


Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter

Note: On fit_parameters attribute, there’s a feature named “text”. This is only a default, and is corrected to “description” on the fit process.

[3]:
trainer.fit_parameters
[3]:
{'db_type': 'Text',
 'features': {'text': {'name': 'text', 'dtype': 'text'},
  'country': {'name': 'country', 'dtype': 'filter'}}}
[4]:
df_description = df[["description", "country"]]
query = trainer.fit(df_description, overwrite=True)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.84it/s]

Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter
JAI is working: 100%|██████████|12/12 [00:16]

Here, you can see that the feature “text” has been changed to the correct value.

[5]:
query.describe()["features"]
[5]:
[{'name': 'description', 'dtype': 'text'},
 {'name': 'country', 'feature_id': 'country', 'dtype': 'filter'}]

Here’s the list of filters:

Note: the “_default” filter is always defined, it will contain the cases where the filter is NaN.

[6]:
query.filters()
[6]:
['_default',
 'Italy',
 'Portugal',
 'US',
 'Spain',
 'France',
 'Germany',
 'Argentina',
 'Chile',
 'Australia',
 'Austria',
 'South Africa',
 'New Zealand',
 'Israel',
 'Hungary',
 'Greece',
 'Romania',
 'Mexico',
 'Canada',
 'Turkey']

An example of a normal similarity query

[7]:
r = query.similar([0], orient="flat", top_k=10)
q = pd.DataFrame(r) # we'll structure the results in a dataframe for visualization
q
Similar: 100%|██████████| 1/1 [00:00<00:00,  6.10it/s]
[7]:
query_id id distance
0 0 0 0.000000
1 0 680 1.580637
2 0 1022 1.692475
3 0 1314 1.726672
4 0 1550 1.856125
5 0 1292 1.876980
6 0 223 1.935903
7 0 22 1.939301
8 0 1036 1.945883
9 0 1553 1.981671
[8]:
# the query result on the original data
df.loc[q["id"], ["country", "description"]]
[8]:
country description
0 Italy Aromas include tropical fruit, broom, brimston...
680 Spain Plum, prune and raspberry aromas are solid. Th...
1022 Italy Made entirely with Insolia, this opens with su...
1314 Italy This opens with smoke, mint, coconut, red berr...
1550 Italy Subtly scented, this opens with aromas of citr...
1292 Italy This vibrant wine opens with varietal aromas o...
223 Italy Bright and creamy, this savory white offers ar...
22 Italy Delicate aromas recall white flower and citrus...
1036 Italy An attractive fragrance of white flower, stone...
1553 Italy Made with Nero di Troia grapes, this has aroma...

Now using filters on the query

[9]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00,  6.03it/s]
[9]:
country description
651 France Subtle notes of clean, fresh lemon zest promis...
911 France Juicy dark plum, cherry and boysenberry are up...
574 France This bend of 70% Syrah and 30% Grenache is nev...
82 France This fruity, sweet wine is immediately attract...
1301 France Lightly herbaceous, this is a ripe, lively win...
994 France An enticingly perfumed wine, with its white fl...
879 France A solid effort, with attractive, balanced blac...
1340 France Produced from organic grapes, the wine is ripe...
211 France This is taut and sinewy in profile, but shows ...
448 France This appealing blend of 50% Roussanne, 30% Gre...

You can also use multiple filters

[10]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00,  6.20it/s]
[10]:
country description
680 Spain Plum, prune and raspberry aromas are solid. Th...
821 Spain Bready aromas feature melon as the main fruit ...
837 Spain Generic white-fruit and matchstick aromas are ...
1847 Spain Aromas of raw oak, paint and dark-berry fruits...
1531 Spain Cherry and blackberry aromas come with spice a...
836 Spain Apple and melon aromas are standard and light....
1755 Spain Aromas of rhubarb, herbs and strawberry mark t...
651 France Subtle notes of clean, fresh lemon zest promis...
5 Spain Blackberry and raspberry aromas show a typical...
809 Spain Mild raisin, black cherry, anise and burnt tir...

Text collection with filters using Jai Class#

Here is the set up to achieve the same results as before.

Note There’s a need to treat the data here, because Jai class removes Null values on creation of Text and Image collections.

[11]:
from jai import Jai

j = Jai()

j.fit(
    "wine_description",
    df_description.fillna("_default"),
    db_type="Text",
    features={"country": {"dtype": "filter"}},
    overwrite=True,
)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.95it/s]
Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter
JAI is working: 100%|██████████|12/12 [00:15]
[11]:
({0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Training Model',
  'Status': 'Job Created',
  'Description': 'Check status after some time!',
  'kwargs': {'db_type': '"Text"',
   'features': '{"text": {"name": "text", "dtype": "text"}, "country": {"name": "country", "dtype": "filter"}}'}})
[12]:
j.filters("wine_description")
[12]:
['_default',
 'Italy',
 'Portugal',
 'US',
 'Spain',
 'France',
 'Germany',
 'Argentina',
 'Chile',
 'Australia',
 'Austria',
 'South Africa',
 'New Zealand',
 'Israel',
 'Hungary',
 'Greece',
 'Romania',
 'Mexico',
 'Canada',
 'Turkey']
[13]:
j.similar("wine_description", [0], filters=["Italy"], orient="flat")
Similar: 100%|██████████| 1/1 [00:00<00:00,  4.78it/s]
[13]:
[{'query_id': 0, 'id': 0, 'distance': 0.0},
 {'query_id': 0, 'id': 1022, 'distance': 1.6924750804901123},
 {'query_id': 0, 'id': 1314, 'distance': 1.7266716957092285},
 {'query_id': 0, 'id': 1550, 'distance': 1.856124997138977},
 {'query_id': 0, 'id': 1292, 'distance': 1.8769800662994385}]

Tabular Example#

Here is a slightly more complex example of how the filters can be used.

We’ll use the same dataset, but create a Selfsupervised collection this time using more columns.

[14]:
# the id mapping of the description collection uses the dataframe index
# if you made the preprocessing using the Jai class, you may need to treat the NaN value.
df_tabular = df[[ "country", "province", "region_1"]].copy()
df_tabular.loc[:, 'id_description'] = df_description.index

Tabular - Filters using Trainer Class#

[15]:
trainer = Trainer("wine_tabular")
trainer.set_parameters(
    db_type="SelfSupervised",
    features={"country": {"dtype": "filter"}},
    # since we alread processed the description column, why not reuse it:
    pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
)

Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description
[16]:
query = trainer.fit(df_tabular, overwrite=True)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  5.23it/s]

Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description
JAI is working: 100%|██████████|20/20 [00:11]

Setup Report:

Best model at epoch: 23 val_loss: 0.73
[17]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00,  5.99it/s]
[17]:
country description
1340 France Produced from organic grapes, the wine is ripe...
779 France This is a powerful, almost concentrated, very ...
1986 France For a Morgon, this is relatively light, showin...
193 France A citrus-dominated wine, lime and lemon giving...
1007 France Concentrated, with white fruits, a strong stre...
1727 France Laced with acidity, this is a complex, structu...
680 Spain Plum, prune and raspberry aromas are solid. Th...
958 Spain Crisp plum and red-bell-pepper aromas lead to ...
1590 France Now managed organically, this estate has produ...
1751 Spain Briny lemon-lime aromas are simplistic. This b...

Tabular - Filters using Jai Class#

[18]:
j = Jai()

j.fit(
    "wine_tabular",
    df_tabular,
    db_type="SelfSupervised",
    features={"country": {"dtype": "filter"}},
    pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
    overwrite=True,
)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.76it/s]
Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description
JAI is working: 100%|██████████|20/20 [00:10]

Setup Report:

Best model at epoch: 23 val_loss: 0.73
[18]:
({0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Training Model',
  'Status': 'Job Created',
  'Description': 'Check status after some time!',
  'kwargs': {'db_type': '"SelfSupervised"',
   'features': '{"country": {"name": "country", "dtype": "filter"}}',
   'pretrained_bases': '[{"db_parent": "wine_description", "id_name": "id_description", "embedding_dim": 128, "aggregation_method": "sum"}]'}})