Creating a new collection with filters#
This section will demonstrate how to create a collection with filters and its implications.
Note: You can only set one filter for each collection.
Note: You can only set filters upon creating a new collection. It’s not possible to set a filter to an existing collection.
The dataset used is Wine Reviews from Kaggle. We’ll take a sample of the dataset, so if you are running this example, it won’t take too long to get the results (as the entire processing takes around 10 minutes).
[1]:
import pandas as pd
# Let's take a sample here
SAMPLE_SIZE = 2000
# Let's take a look at the dataset
# If you wish to use the full dataset, just remove the nrows argument
df = pd.read_csv("data/wine-reviews/winemag-data-130k-v2.csv", index_col=0, nrows=SAMPLE_SIZE)
df.head()
[1]:
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
Text
Example#
Here is a simple example of how the filters can be used.
We want to create a collection with the “description” column of our dataset. We’ll be able to peform similarity searches from the descriptions and get the descriptions with similar meanings.
Now, if we add the column “country” as a filter for this collection, we can filter the results to a country or group of countries of our choice.
First, we’ll run how to set up a collection using the classes from the Task Module and then use the Jai Class to perform the same thing.
Text collection with filters using Trainer Class#
[2]:
from jai import Trainer
trainer = Trainer("wine_description")
trainer.set_parameters(db_type="Text", features={"country": {"dtype": "filter"}})
Recognized setup args:
- db_type: Text
- features:
* country:
- dtype: filter
Note: On fit_parameters attribute, there’s a feature named “text”. This is only a default, and is corrected to “description” on the fit process.
[3]:
trainer.fit_parameters
[3]:
{'db_type': 'Text',
'features': {'text': {'name': 'text', 'dtype': 'text'},
'country': {'name': 'country', 'dtype': 'filter'}}}
[4]:
df_description = df[["description", "country"]]
query = trainer.fit(df_description, overwrite=True)
Insert Data: 100%|██████████| 1/1 [00:00<00:00, 3.84it/s]
Recognized setup args:
- db_type: Text
- features:
* country:
- dtype: filter
JAI is working: 100%|██████████|12/12 [00:16]
Here, you can see that the feature “text” has been changed to the correct value.
[5]:
query.describe()["features"]
[5]:
[{'name': 'description', 'dtype': 'text'},
{'name': 'country', 'feature_id': 'country', 'dtype': 'filter'}]
Here’s the list of filters:
Note: the “_default” filter is always defined, it will contain the cases where the filter is NaN.
[6]:
query.filters()
[6]:
['_default',
'Italy',
'Portugal',
'US',
'Spain',
'France',
'Germany',
'Argentina',
'Chile',
'Australia',
'Austria',
'South Africa',
'New Zealand',
'Israel',
'Hungary',
'Greece',
'Romania',
'Mexico',
'Canada',
'Turkey']
An example of a normal similarity query
[7]:
r = query.similar([0], orient="flat", top_k=10)
q = pd.DataFrame(r) # we'll structure the results in a dataframe for visualization
q
Similar: 100%|██████████| 1/1 [00:00<00:00, 6.10it/s]
[7]:
query_id | id | distance | |
---|---|---|---|
0 | 0 | 0 | 0.000000 |
1 | 0 | 680 | 1.580637 |
2 | 0 | 1022 | 1.692475 |
3 | 0 | 1314 | 1.726672 |
4 | 0 | 1550 | 1.856125 |
5 | 0 | 1292 | 1.876980 |
6 | 0 | 223 | 1.935903 |
7 | 0 | 22 | 1.939301 |
8 | 0 | 1036 | 1.945883 |
9 | 0 | 1553 | 1.981671 |
[8]:
# the query result on the original data
df.loc[q["id"], ["country", "description"]]
[8]:
country | description | |
---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... |
680 | Spain | Plum, prune and raspberry aromas are solid. Th... |
1022 | Italy | Made entirely with Insolia, this opens with su... |
1314 | Italy | This opens with smoke, mint, coconut, red berr... |
1550 | Italy | Subtly scented, this opens with aromas of citr... |
1292 | Italy | This vibrant wine opens with varietal aromas o... |
223 | Italy | Bright and creamy, this savory white offers ar... |
22 | Italy | Delicate aromas recall white flower and citrus... |
1036 | Italy | An attractive fragrance of white flower, stone... |
1553 | Italy | Made with Nero di Troia grapes, this has aroma... |
Now using filters on the query
[9]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00, 6.03it/s]
[9]:
country | description | |
---|---|---|
651 | France | Subtle notes of clean, fresh lemon zest promis... |
911 | France | Juicy dark plum, cherry and boysenberry are up... |
574 | France | This bend of 70% Syrah and 30% Grenache is nev... |
82 | France | This fruity, sweet wine is immediately attract... |
1301 | France | Lightly herbaceous, this is a ripe, lively win... |
994 | France | An enticingly perfumed wine, with its white fl... |
879 | France | A solid effort, with attractive, balanced blac... |
1340 | France | Produced from organic grapes, the wine is ripe... |
211 | France | This is taut and sinewy in profile, but shows ... |
448 | France | This appealing blend of 50% Roussanne, 30% Gre... |
You can also use multiple filters
[10]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00, 6.20it/s]
[10]:
country | description | |
---|---|---|
680 | Spain | Plum, prune and raspberry aromas are solid. Th... |
821 | Spain | Bready aromas feature melon as the main fruit ... |
837 | Spain | Generic white-fruit and matchstick aromas are ... |
1847 | Spain | Aromas of raw oak, paint and dark-berry fruits... |
1531 | Spain | Cherry and blackberry aromas come with spice a... |
836 | Spain | Apple and melon aromas are standard and light.... |
1755 | Spain | Aromas of rhubarb, herbs and strawberry mark t... |
651 | France | Subtle notes of clean, fresh lemon zest promis... |
5 | Spain | Blackberry and raspberry aromas show a typical... |
809 | Spain | Mild raisin, black cherry, anise and burnt tir... |
Text collection with filters using Jai Class#
Here is the set up to achieve the same results as before.
Note There’s a need to treat the data here, because Jai class removes
Null
values on creation of Text and Image collections.
[11]:
from jai import Jai
j = Jai()
j.fit(
"wine_description",
df_description.fillna("_default"),
db_type="Text",
features={"country": {"dtype": "filter"}},
overwrite=True,
)
Insert Data: 100%|██████████| 1/1 [00:00<00:00, 3.95it/s]
Training might finish early due to early stopping criteria.
Recognized setup args:
- db_type: Text
- features:
* country:
- dtype: filter
JAI is working: 100%|██████████|12/12 [00:15]
[11]:
({0: {'Task': 'Adding new data for tabular setup',
'Status': 'Completed',
'Description': 'Insertion completed.',
'Interrupted': False}},
{'Task': 'Training Model',
'Status': 'Job Created',
'Description': 'Check status after some time!',
'kwargs': {'db_type': '"Text"',
'features': '{"text": {"name": "text", "dtype": "text"}, "country": {"name": "country", "dtype": "filter"}}'}})
[12]:
j.filters("wine_description")
[12]:
['_default',
'Italy',
'Portugal',
'US',
'Spain',
'France',
'Germany',
'Argentina',
'Chile',
'Australia',
'Austria',
'South Africa',
'New Zealand',
'Israel',
'Hungary',
'Greece',
'Romania',
'Mexico',
'Canada',
'Turkey']
[13]:
j.similar("wine_description", [0], filters=["Italy"], orient="flat")
Similar: 100%|██████████| 1/1 [00:00<00:00, 4.78it/s]
[13]:
[{'query_id': 0, 'id': 0, 'distance': 0.0},
{'query_id': 0, 'id': 1022, 'distance': 1.6924750804901123},
{'query_id': 0, 'id': 1314, 'distance': 1.7266716957092285},
{'query_id': 0, 'id': 1550, 'distance': 1.856124997138977},
{'query_id': 0, 'id': 1292, 'distance': 1.8769800662994385}]
Tabular Example#
Here is a slightly more complex example of how the filters can be used.
We’ll use the same dataset, but create a Selfsupervised collection this time using more columns.
[14]:
# the id mapping of the description collection uses the dataframe index
# if you made the preprocessing using the Jai class, you may need to treat the NaN value.
df_tabular = df[[ "country", "province", "region_1"]].copy()
df_tabular.loc[:, 'id_description'] = df_description.index
Tabular - Filters using Trainer Class#
[15]:
trainer = Trainer("wine_tabular")
trainer.set_parameters(
db_type="SelfSupervised",
features={"country": {"dtype": "filter"}},
# since we alread processed the description column, why not reuse it:
pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
)
Recognized setup args:
- db_type: SelfSupervised
- features:
* country:
- dtype: filter
- pretrained_bases:
* id_name: id_description
db_parent: wine_description
[16]:
query = trainer.fit(df_tabular, overwrite=True)
Insert Data: 100%|██████████| 1/1 [00:00<00:00, 5.23it/s]
Recognized setup args:
- db_type: SelfSupervised
- features:
* country:
- dtype: filter
- pretrained_bases:
* id_name: id_description
db_parent: wine_description
JAI is working: 100%|██████████|20/20 [00:11]
Setup Report:
Best model at epoch: 23 val_loss: 0.73
[17]:
q = pd.DataFrame(query.similar([0], orient="flat", top_k=10, filters=["Spain", "France"]))
df.loc[q["id"], ["country", "description"]]
Similar: 100%|██████████| 1/1 [00:00<00:00, 5.99it/s]
[17]:
country | description | |
---|---|---|
1340 | France | Produced from organic grapes, the wine is ripe... |
779 | France | This is a powerful, almost concentrated, very ... |
1986 | France | For a Morgon, this is relatively light, showin... |
193 | France | A citrus-dominated wine, lime and lemon giving... |
1007 | France | Concentrated, with white fruits, a strong stre... |
1727 | France | Laced with acidity, this is a complex, structu... |
680 | Spain | Plum, prune and raspberry aromas are solid. Th... |
958 | Spain | Crisp plum and red-bell-pepper aromas lead to ... |
1590 | France | Now managed organically, this estate has produ... |
1751 | Spain | Briny lemon-lime aromas are simplistic. This b... |
Tabular - Filters using Jai Class#
[18]:
j = Jai()
j.fit(
"wine_tabular",
df_tabular,
db_type="SelfSupervised",
features={"country": {"dtype": "filter"}},
pretrained_bases=[{"id_name": "id_description", "db_parent": "wine_description"}],
overwrite=True,
)
Insert Data: 100%|██████████| 1/1 [00:00<00:00, 3.76it/s]
Training might finish early due to early stopping criteria.
Recognized setup args:
- db_type: SelfSupervised
- features:
* country:
- dtype: filter
- pretrained_bases:
* id_name: id_description
db_parent: wine_description
JAI is working: 100%|██████████|20/20 [00:10]
Setup Report:
Best model at epoch: 23 val_loss: 0.73
[18]:
({0: {'Task': 'Adding new data for tabular setup',
'Status': 'Completed',
'Description': 'Insertion completed.',
'Interrupted': False}},
{'Task': 'Training Model',
'Status': 'Job Created',
'Description': 'Check status after some time!',
'kwargs': {'db_type': '"SelfSupervised"',
'features': '{"country": {"name": "country", "dtype": "filter"}}',
'pretrained_bases': '[{"db_parent": "wine_description", "id_name": "id_description", "embedding_dim": 128, "aggregation_method": "sum"}]'}})