Similarity Seach#

After fitting your database, you can perform similarity searches in two ways: Based on an existing index of your already included model or using new data.

Using existing index#

You can query itens that already have been inputed by their ids. The below example shows how to find the five most similar values for ids 0 and 1.

>>> results = j.similar(name, [0, 1], top_k=5)

To find the 20 most similar values for every id from [0, 99].

>>> ids = list(range(100))
>>> results = j.similar(name, ids, top_k=20)

Now, finding the 100 most similar values for every input value can be done like the example below.

>>> results = j.similar(name, data.index, top_k=100, batch_size=1024)

Using new data#

(All data should be in pandas.DataFrame or pandas.Series format)

Find the 100 most similar values for every new_data.

>>> results = j.similar(name, new_data, top_k=100, batch_size=1024)

The output will be a list of dictionaries with 'query_id' being the id of the value you want to find similars and 'results') a list with top_k dictionaries with the 'id' and the 'distance' between 'query_id' and 'id'.

[
    {
        'query_id': 0,
        'results':
        [
        {'id': 0, 'distance': 0.0},
        {'id': 3836, 'distance': 2.298321008682251},
        {'id': 9193, 'distance': 2.545339584350586},
        {'id': 832, 'distance': 2.5819168090820312},
        {'id': 6162, 'distance': 2.638622283935547},
        ...
        ]
    },
    ...,
    {
        'query_id': 9,
        'results':
        [
        {'id': 9, 'distance': 0.0},
        {'id': 54, 'distance': 5.262974262237549},
        {'id': 101, 'distance': 5.634262561798096},
        ...
        ]
    },
    ...
]

Note

The method similar has a default batch_size=2**20, which will result in ceil(n_samples/batch_size) + 2 requests. We DON’T recommend changing the default value as it could reduce the performance of the API.

Output formating#

There are two possible output formats for the similarity search. You can change which format you wish to use by changing the parameter orient.

orient: “nested” or “flat”

Changes the output format. Default is “nested”.

Here are some examples for each of the possible formats bellow:

  • nested:

[
    {
        'query_id': 0,
        'results':
        [
        {'id': 0, 'distance': 0.0},
        {'id': 3836, 'distance': 2.298321008682251},
        {'id': 9193, 'distance': 2.545339584350586},
        {'id': 832, 'distance': 2.5819168090820312},
        {'id': 6162, 'distance': 2.638622283935547},
        ...
        ]
    },
    ...,
    {
        'query_id': 9,
        'results':
        [
        {'id': 9, 'distance': 0.0},
        {'id': 54, 'distance': 5.262974262237549},
        {'id': 101, 'distance': 5.634262561798096},
        ...
        ]
    },
    ...
]
  • flat:

[
    {'query_id': 0, 'id': 0, 'distance': 0.0},
    {'query_id': 0, 'id': 3836, 'distance': 2.298321008682251},
    {'query_id': 0, 'id': 9193, 'distance': 2.545339584350586},
    {'query_id': 0, 'id': 832, 'distance': 2.5819168090820312},
    {'query_id': 0, 'id': 6162, 'distance': 2.638622283935547},
    ...
    {'query_id': 9, 'id': 9, 'distance': 0.0},
    {'query_id': 9, 'id': 54, 'distance': 5.262974262237549},
    {'query_id': 9, 'id': 101, 'distance': 5.634262561798096},
    ...
]