Hi < Kishore Nallan> how should we make the JSONL file in or typesense #community-help

Hi <@U01PL2YSG8L>, how should we make the JSONL fi...

Aadarsh

03/06/2024, 7:06 PM

Hi @Kishore Nallan, how should we make the JSONL file in order to implement hybrid search (auto embedding) ? Right now, I have fields to return in my JSONL documents,

id

name

category

. Do I explicitly need to mention a field named

embedding

? I have added this

embedding

field in the step while creating the collection:

Copy code

{
  "name": "embedding",
  "type": "float[]",
  "embed": {
      "from": ["name", "category"],
      "model_config": {"model_name": "all-MiniLM-L12-v2"},
  },
}

But, running this with and without the embedding field in JSONL documents give me the error:

[Errno 400] Model file not found

Should i have the model stored in some directory within my project so that typesense can access it?

Jason Bosco

03/06/2024, 7:27 PM

I see syntax errors in that schema...

all-MiniLM-L12-v2

should be

ts/all-MiniLM-L12-v2

Here's a step-by-step guide you should be able to copy-paste: https://typesense.org/docs/guide/tips-for-searching-common-types-of-data.html#long-pieces-of-text

Jason Bosco

03/06/2024, 7:28 PM

Do I explicitly need to mention a field named embedding

You do need an explicit field, but it can be named anything

Jason Bosco

03/06/2024, 7:28 PM

Should i have the model stored in some directory within my project so that typesense can access it?

No, Typesense will automatically download the model for you. The issue here is a syntax error, see my first message above

Aadarsh

03/06/2024, 7:59 PM

Hey @Jason Bosco Adding

ts/

didn't help, still the same error.

Jason Bosco

03/06/2024, 8:05 PM

Could you share a full curl command that creates the schema, that throws the error?

Jason Bosco

03/06/2024, 8:05 PM

I can try running it locally

Aadarsh

03/06/2024, 8:12 PM

Here is the collection schema:

Copy code

{
  "name": "TEST",
  "fields": [
    {
      "name": "testId",
      "type": "string",
      "facet": true
    },
    {
      "name": "name",
      "type": "string",
      "facet": true
    },
    {
      "name": "category",
      "type": "string",
      "facet": true
    },
    {
      "name": "userId",
      "type": "string",
      "facet": true,
      "optional": true
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "name",
          "category"
        ],
        "model_config": {
          "model_name": "ts/all-MiniLM-L12-v2"
        }
      }
    }
  ]
}

Aadarsh

03/06/2024, 8:13 PM

And a sample JSONL document:

Copy code

{
  "testId": "1",
  "name": "Test String",
  "userId": "10",
  "category": "test-category",
  "embedding": []
}

Jason Bosco

03/06/2024, 8:16 PM

I just tried running it locally and it worked without any issues

Jason Bosco

03/06/2024, 8:17 PM

Copy code

export TYPESENSE_API_KEY=xyz
    
mkdir $(pwd)/typesense-data

docker run -p 8108:8108 \
            -v$(pwd)/typesense-data:/data typesense/typesense:0.25.2 \
            --data-dir /data \
            --api-key=$TYPESENSE_API_KEY \
            --enable-cors

Copy code

export TYPESENSE_API_KEY=xyz

curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '
{
    "name": "TEST",
    "fields": [
        {
            "name": "testId",
            "type": "string",
            "facet": true
        },
        {
            "name": "name",
            "type": "string",
            "facet": true
        },
        {
            "name": "category",
            "type": "string",
            "facet": true
        },
        {
            "name": "userId",
            "type": "string",
            "facet": true,
            "optional": true
        },
        {
            "name": "embedding",
            "type": "float[]",
            "embed": {
                "from": [
                    "name",
                    "category"
                ],
                "model_config": {
                    "model_name": "ts/all-MiniLM-L12-v2"
                }
            }
        }
    ]
}
'

Jason Bosco

03/06/2024, 8:17 PM

Could you try copy-pasting that?

Aadarsh

03/06/2024, 8:18 PM

Yes, on it

Aadarsh

03/06/2024, 8:25 PM

Response to

Copy code

curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


{"state":1,"version":"0.25.2"}

Response to the collection schema creation curl

Copy code

{
  "message": "Model not found"
}

Aadarsh

03/06/2024, 8:27 PM

Do i need to install the GPU build of

typesense 0.25.2

as well?

Jason Bosco

03/06/2024, 8:31 PM

Hmm no, you don't need a GPU build

Jason Bosco

03/06/2024, 8:31 PM

Does your docker container have internet access?

Aadarsh

03/06/2024, 8:32 PM

I'm running typesense service in linux

Jason Bosco

03/06/2024, 8:43 PM

Could you try using

0.26.0.rc62

of Typesense Server?

Aadarsh

03/07/2024, 5:20 AM

The error wasn't there in

0.26.0.rc62

but i received an empty response. ### The response:

Copy code

{
  "results": [
    {
      "facet_counts": [],
      "found": 0,
      "hits": [],
      "out_of": 0,
      "page": 1,
      "request_params": {
        "collection_name": "TEST",
        "first_q": "test",
        "per_page": 10,
        "q": "test"
      },
      "search_cutoff": false,
      "search_time_ms": 8
    }
  ]
}

### The collection schema

Copy code

{
  "name": "TEST",
  "fields": [
    {
      "name": "testId",
      "type": "string",
      "facet": true
    },
    {
      "name": "name",
      "type": "string",
      "facet": true
    },
    {
      "name": "category",
      "type": "string",
      "facet": true
    },
    {
      "name": "userId",
      "type": "string",
      "facet": true,
      "optional": true
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "name",
          "category"
        ],
        "model_config": {
          "model_name": "ts/all-MiniLM-L12-v2"
        }
      }
    }
  ]
}

### SEARCH REQUESTS

Copy code

{
  "searches": [
    {
      "collection": "TEST",
      "q": "test",
      "sort_by": "_vector_distance:asc,_text_match:desc",
      "prioritize_token_position": "true"
    }
  ]
}

### COMMON SEARCH PARAMS

Copy code

{
  "query_by": "embedding,name,category"
}

### Sample Document from JSONL File

Copy code

{
  "testId": "1",
  "name": "Test String",
  "userId": "10",
  "category": "test-category",
  "embedding": []
}

Aadarsh

03/07/2024, 5:21 AM

Supposedly, there's an issue with my code because the repsonse has

"out_of": 0

Kishore Nallan

03/07/2024, 5:22 AM

Yes that means no doc has been indexed

Aadarsh

03/07/2024, 5:25 AM

Are my documents correct?

Aadarsh

03/07/2024, 5:25 AM

The structure

Kishore Nallan

03/07/2024, 5:26 AM

Check if you get error during indexing.

Aadarsh

03/07/2024, 5:27 AM

Here's that:

Copy code

['Field `embedding` contains an invalid embedding.']

Kishore Nallan

03/07/2024, 5:27 AM

Yeah so some invalid values

Aadarsh

03/07/2024, 5:28 AM

I'm not getting it. I need to keep the embedding field empty, right? Typesense will automatically populate that field.

Kishore Nallan

03/07/2024, 5:28 AM

Don't send that field at all if you are using auto embedding

Kishore Nallan

03/07/2024, 5:29 AM

Empty array not valid, remove field from doc

Aadarsh

03/07/2024, 5:33 AM

Worked. for both, RC62 and

0.25.2

Aadarsh

03/07/2024, 5:46 AM

Now, I'm getting response sorted by vector distance but it's returning all the documents. Don't want that...

Aadarsh

03/07/2024, 7:47 AM

If i apply hybrid search for the keyword

laptop

then it should return only documents containing the keyword laptop and the documents containing the related keywords like computer based sorted by the vector distance and rank fusion score. But currently it is returning all documents (even with no text matching and absolutely unrelated) from the collection sorted by vector dictance

Kishore Nallan

03/07/2024, 7:50 AM

You can set the alpha parameter to give more weightage to keyword matches. There's also a parameter for fetching only records with min vector similarity value.

Aadarsh

03/07/2024, 7:52 AM

Can you send me the reference links for those?

Kishore Nallan

03/07/2024, 7:53 AM

Everything is documented on our docs page on vector search.

Aadarsh

03/07/2024, 8:05 AM

Can we specify a threshold for rank fusion score? Also sorting based on the same param?

Kishore Nallan

03/07/2024, 8:19 AM

_text_match

is already rank fusion score in hybrid search so you can just sort on

_text_match:desc

Aadarsh

03/07/2024, 8:23 AM

They are different. DIdn't get that part Here's the response

Copy code

{
  "results": [
    {
      "facet_counts": [],
      "found": 2,
      "hits": [
        {
          "document": {
            "category": "test-category",
            "id": "2",
            "name": "Gift laptop to your friend",
            "testId": "1011",
            "userId": "15"
          },
          "highlight": {
            "name": {
              "matched_tokens": [
                "laptop"
              ],
              "snippet": "Gift <mark>laptop</mark> to your friend"
            }
          },
          "highlights": [
            {
              "field": "name",
              "matched_tokens": [
                "laptop"
              ],
              "snippet": "Gift <mark>laptop</mark> to your friend"
            }
          ],
          "hybrid_search_info": {
            "rank_fusion_score": 1
          },
          "text_match": 1060320051,
          "text_match_info": {
            "best_field_score": "517734",
            "best_field_weight": 102,
            "fields_matched": 3,
            "score": "1060320051",
            "tokens_matched": 0
          },
          "vector_distance": 0.39046764373779297
        },
        {
          "document": {
            "category": "test-category",
            "id": "3",
            "name": "Gift a gaming computer to your friend",
            "testId": "10111",
            "userId": "15"
          },
          "highlight": {},
          "highlights": [],
          "hybrid_search_info": {
            "rank_fusion_score": 0.15000000596046448
          },
          "text_match": 0,
          "text_match_info": {
            "best_field_score": "0",
            "best_field_weight": 0,
            "fields_matched": 0,
            "score": "0",
            "tokens_matched": 0
          },
          "vector_distance": 0.5736579895019531
        }
      ],
      "out_of": 15,
      "page": 1,
      "request_params": {
        "collection_name": "TEST",
        "per_page": 10,
        "q": "laptop"
      },
      "search_cutoff": false,
      "search_time_ms": 5
    }
  ]
}

Kishore Nallan

03/07/2024, 8:26 AM

In hybrid search when you sort on

_text_match

you are actually sorting on fusion score.

Aadarsh

03/07/2024, 8:27 AM

Okay

Aadarsh

03/07/2024, 5:28 PM

Hi, I'm facing the error

[Errno 404] Model not found

in my another laptop that doesn't have GPU. Using both the versions RC62 and 0.25.2.

Jason Bosco

03/07/2024, 6:26 PM

Could you run the same set of curl commands I shared with you yesterday, on

0.26.0.rc62

and share the output of each command and also the Typesenese logs from the beginning of the process start?

Open in Slack

Previous Next