Creating a new collection with filters#

This section will demonstrate how to create a collection with filters and its implications.

Note: You can only set one filter for each collection.

Note: You can only set filters upon creating a new collection. It’s not possible to set a filter to an existing collection.

The dataset used is Wine Reviews from Kaggle. We’ll take a sample of the dataset, so if you are running this example, it won’t take too long to get the results (as the entire processing takes around 10 minutes).

[1]:

import pandas as pd

# Let's take a sample here
SAMPLE_SIZE = 2000

# Let's take a look at the dataset
# If you wish to use the full dataset, just remove the nrows argument
df = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv", index_col=0, nrows=SAMPLE_SIZE)

df.head()

[1]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

`Text` Example#

Here is a simple example of how the filters can be used.

We want to create a collection with the “description” column of our dataset. We’ll be able to peform similarity searches from the descriptions and get the descriptions with similar meanings.

Now, if we add the column “country” as a filter for this collection, we can filter the results to a country or group of countries of our choice.

First, we’ll run how to set up a collection using the classes from the Task Module and then use the Jai Class to perform the same thing.

Text collection with filters using Trainer Class#

[2]:

from jai import Trainer

trainer = Trainer("wine_description")
trainer.set_parameters(db_type="Text", features={"country": {"dtype": "filter"}})


Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter

Note: On fit_parameters attribute, there’s a feature named “text”. This is only a default, and is corrected to “description” on the fit process.

[3]:

trainer.fit_parameters

[3]:

{'db_type': 'Text',
 'features': {'text': {'name': 'text', 'dtype': 'text'},
  'country': {'name': 'country', 'dtype': 'filter'}}}

[4]:

df_description = df[["description", "country"]]
query = trainer.fit(df_description, overwrite=True)

Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.84it/s]


Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter

JAI is working: 100%|██████████|12/12 [00:16]

Here, you can see that the feature “text” has been changed to the correct value.

[5]:

query.describe()["features"]

[5]:

[{'name': 'description', 'dtype': 'text'},
 {'name': 'country', 'feature_id': 'country', 'dtype': 'filter'}]

Here’s the list of filters:

Note: the “_default” filter is always defined, it will contain the cases where the filter is NaN.

[6]:

query.filters()

[6]:

['_default',
 'Italy',
 'Portugal',
 'US',
 'Spain',
 'France',
 'Germany',
 'Argentina',
 'Chile',
 'Australia',
 'Austria',
 'South Africa',
 'New Zealand',
 'Israel',
 'Hungary',
 'Greece',
 'Romania',
 'Mexico',
 'Canada',
 'Turkey']

An example of a normal similarity query

[7]:

r = query.similar([0], orient="flat", top_k=10)
q = pd.DataFrame(r) # we'll structure the results in a dataframe for visualization
q

Similar: 100%|██████████| 1/1 [00:00<00:00,  6.10it/s]

[7]:

	id	distance
0	0	0.000000
1	680	1.580637
2	1022	1.692475
3	1314	1.726672
4	1550	1.856125
5	1292	1.876980
6	223	1.935903
7	22	1.939301
8	1036	1.945883
9	1553	1.981671

[8]:

# the query result on the original data
df.loc[q["id"], ["country", "description"]]

[8]:

	country	description
0	Italy	Aromas include tropical fruit, broom, brimston...
680	Spain	Plum, prune and raspberry aromas are solid. Th...
1022	Italy	Made entirely with Insolia, this opens with su...
1314	Italy	This opens with smoke, mint, coconut, red berr...
1550	Italy	Subtly scented, this opens with aromas of citr...
1292	Italy	This vibrant wine opens with varietal aromas o...
223	Italy	Bright and creamy, this savory white offers ar...
22	Italy	Delicate aromas recall white flower and citrus...
1036	Italy	An attractive fragrance of white flower, stone...
1553	Italy	Made with Nero di Troia grapes, this has aroma...

Now using filters on the query

[9]:

q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["France"]))
df.loc[q["id"], ["country", "description"]]

Similar: 100%|██████████| 1/1 [00:00<00:00,  6.03it/s]

[9]:

	country	description
651	France	Subtle notes of clean, fresh lemon zest promis...
911	France	Juicy dark plum, cherry and boysenberry are up...
574	France	This bend of 70% Syrah and 30% Grenache is nev...
82	France	This fruity, sweet wine is immediately attract...
1301	France	Lightly herbaceous, this is a ripe, lively win...
994	France	An enticingly perfumed wine, with its white fl...
879	France	A solid effort, with attractive, balanced blac...
1340	France	Produced from organic grapes, the wine is ripe...
211	France	This is taut and sinewy in profile, but shows ...
448	France	This appealing blend of 50% Roussanne, 30% Gre...

You can also use multiple filters

[10]:

q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]

Similar: 100%|██████████| 1/1 [00:00<00:00,  6.20it/s]

[10]:

	country	description
680	Spain	Plum, prune and raspberry aromas are solid. Th...
821	Spain	Bready aromas feature melon as the main fruit ...
837	Spain	Generic white-fruit and matchstick aromas are ...
1847	Spain	Aromas of raw oak, paint and dark-berry fruits...
1531	Spain	Cherry and blackberry aromas come with spice a...
836	Spain	Apple and melon aromas are standard and light....
1755	Spain	Aromas of rhubarb, herbs and strawberry mark t...
651	France	Subtle notes of clean, fresh lemon zest promis...
5	Spain	Blackberry and raspberry aromas show a typical...
809	Spain	Mild raisin, black cherry, anise and burnt tir...

Text collection with filters using Jai Class#

Here is the set up to achieve the same results as before.

Note There’s a need to treat the data here, because Jai class removes Null values on creation of Text and Image collections.

[11]:

from jai import Jai

j = Jai()

j.fit(
    "wine_description",
    df_description.fillna("_default"),
    db_type="Text",
    features={"country": {"dtype": "filter"}},
    overwrite=True,
)

Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.95it/s]

Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: Text
- features:
  * country:
    - dtype: filter

JAI is working: 100%|██████████|12/12 [00:15]

[11]:

({0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Training Model',
  'Status': 'Job Created',
  'Description': 'Check status after some time!',
  'kwargs': {'db_type': '"Text"',
   'features': '{"text": {"name": "text", "dtype": "text"}, "country": {"name": "country", "dtype": "filter"}}'}})

[12]:

j.filters("wine_description")

[12]:

['_default',
 'Italy',
 'Portugal',
 'US',
 'Spain',
 'France',
 'Germany',
 'Argentina',
 'Chile',
 'Australia',
 'Austria',
 'South Africa',
 'New Zealand',
 'Israel',
 'Hungary',
 'Greece',
 'Romania',
 'Mexico',
 'Canada',
 'Turkey']

[13]:

j.similar("wine_description", [0], filters=["Italy"], orient="flat")

Similar: 100%|██████████| 1/1 [00:00<00:00,  4.78it/s]

[13]:

[{'query_id': 0, 'id': 0, 'distance': 0.0},
 {'query_id': 0, 'id': 1022, 'distance': 1.6924750804901123},
 {'query_id': 0, 'id': 1314, 'distance': 1.7266716957092285},
 {'query_id': 0, 'id': 1550, 'distance': 1.856124997138977},
 {'query_id': 0, 'id': 1292, 'distance': 1.8769800662994385}]

Tabular Example#

Here is a slightly more complex example of how the filters can be used.

We’ll use the same dataset, but create a Selfsupervised collection this time using more columns.

[14]:

# the id mapping of the description collection uses the dataframe index
# if you made the preprocessing using the Jai class, you may need to treat the NaN value.
df_tabular = df[[ "country", "province", "region_1"]].copy()
df_tabular.loc[:, 'id_description'] = df_description.index

Tabular - Filters using Trainer Class#

[15]:

trainer = Trainer("wine_tabular")
trainer.set_parameters(
    db_type="SelfSupervised",
    features={"country": {"dtype": "filter"}},
    # since we alread processed the description column, why not reuse it:
    pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
)


Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description

[16]:

query = trainer.fit(df_tabular, overwrite=True)

Insert Data: 100%|██████████| 1/1 [00:00<00:00,  5.23it/s]


Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description

JAI is working: 100%|██████████|20/20 [00:11]


Setup Report:

Best model at epoch: 23 val_loss: 0.73

[17]:

q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]

Similar: 100%|██████████| 1/1 [00:00<00:00,  5.99it/s]

[17]:

	country	description
1340	France	Produced from organic grapes, the wine is ripe...
779	France	This is a powerful, almost concentrated, very ...
1986	France	For a Morgon, this is relatively light, showin...
193	France	A citrus-dominated wine, lime and lemon giving...
1007	France	Concentrated, with white fruits, a strong stre...
1727	France	Laced with acidity, this is a complex, structu...
680	Spain	Plum, prune and raspberry aromas are solid. Th...
958	Spain	Crisp plum and red-bell-pepper aromas lead to ...
1590	France	Now managed organically, this estate has produ...
1751	Spain	Briny lemon-lime aromas are simplistic. This b...

Tabular - Filters using Jai Class#

[18]:

j = Jai()

j.fit(
    "wine_tabular",
    df_tabular,
    db_type="SelfSupervised",
    features={"country": {"dtype": "filter"}},
    pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
    overwrite=True,
)

Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.76it/s]

Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: SelfSupervised
- features:
  * country:
    - dtype: filter
- pretrained_bases:
  * id_name: id_description
    db_parent: wine_description

JAI is working: 100%|██████████|20/20 [00:10]


Setup Report:

Best model at epoch: 23 val_loss: 0.73

[18]:

({0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Training Model',
  'Status': 'Job Created',
  'Description': 'Check status after some time!',
  'kwargs': {'db_type': '"SelfSupervised"',
   'features': '{"country": {"name": "country", "dtype": "filter"}}',
   'pretrained_bases': '[{"db_parent": "wine_description", "id_name": "id_description", "embedding_dim": 128, "aggregation_method": "sum"}]'}})

Creating a new collection with filters

Contents