Similarity Seach#
After fitting your database, you can perform similarity searches in two ways: Based on an existing index of your already included model or using new data.
Using existing index#
You can query itens that already have been inputed by their ids
. The below example shows
how to find the five most similar values for ids
0 and 1.
>>> results = j.similar(name, [0, 1], top_k=5)
To find the 20 most similar values for every id from [0, 99]
.
>>> ids = list(range(100))
>>> results = j.similar(name, ids, top_k=20)
Now, finding the 100 most similar values for every input value can be done like the example below.
>>> results = j.similar(name, data.index, top_k=100, batch_size=1024)
Using new data#
(All data should be in pandas.DataFrame
or pandas.Series
format)
Find the 100 most similar values for every new_data
.
>>> results = j.similar(name, new_data, top_k=100, batch_size=1024)
The output will be a list of dictionaries with 'query_id'
being the id of the value you want
to find similars and 'results'
) a list with top_k
dictionaries with the 'id'
and the 'distance'
between 'query_id'
and 'id'
.
[
{
'query_id': 0,
'results':
[
{'id': 0, 'distance': 0.0},
{'id': 3836, 'distance': 2.298321008682251},
{'id': 9193, 'distance': 2.545339584350586},
{'id': 832, 'distance': 2.5819168090820312},
{'id': 6162, 'distance': 2.638622283935547},
...
]
},
...,
{
'query_id': 9,
'results':
[
{'id': 9, 'distance': 0.0},
{'id': 54, 'distance': 5.262974262237549},
{'id': 101, 'distance': 5.634262561798096},
...
]
},
...
]
Note
The method similar
has a default batch_size=2**20
, which will result in
ceil(n_samples/batch_size) + 2
requests. We DON’T recommend changing the default value
as it could reduce the performance of the API.
Output formating#
There are two possible output formats for the similarity search.
You can change which format you wish to use by changing the parameter orient
.
- orient: “nested” or “flat”
Changes the output format. Default is “nested”.
Here are some examples for each of the possible formats bellow:
nested
:
[
{
'query_id': 0,
'results':
[
{'id': 0, 'distance': 0.0},
{'id': 3836, 'distance': 2.298321008682251},
{'id': 9193, 'distance': 2.545339584350586},
{'id': 832, 'distance': 2.5819168090820312},
{'id': 6162, 'distance': 2.638622283935547},
...
]
},
...,
{
'query_id': 9,
'results':
[
{'id': 9, 'distance': 0.0},
{'id': 54, 'distance': 5.262974262237549},
{'id': 101, 'distance': 5.634262561798096},
...
]
},
...
]
flat
:
[
{'query_id': 0, 'id': 0, 'distance': 0.0},
{'query_id': 0, 'id': 3836, 'distance': 2.298321008682251},
{'query_id': 0, 'id': 9193, 'distance': 2.545339584350586},
{'query_id': 0, 'id': 832, 'distance': 2.5819168090820312},
{'query_id': 0, 'id': 6162, 'distance': 2.638622283935547},
...
{'query_id': 9, 'id': 9, 'distance': 0.0},
{'query_id': 9, 'id': 54, 'distance': 5.262974262237549},
{'query_id': 9, 'id': 101, 'distance': 5.634262561798096},
...
]