Hi <@U01PL2YSG8L>, how should we make the JSONL fi...
# community-help
a
Hi @Kishore Nallan, how should we make the JSONL file in order to implement hybrid search (auto embedding) ? Right now, I have fields to return in my JSONL documents,
id
,
name
,
category
. Do I explicitly need to mention a field named
embedding
? I have added this
embedding
field in the step while creating the collection:
Copy code
{
  "name": "embedding",
  "type": "float[]",
  "embed": {
      "from": ["name", "category"],
      "model_config": {"model_name": "all-MiniLM-L12-v2"},
  },
}
But, running this with and without the embedding field in JSONL documents give me the error:
[Errno 400] Model file not found
Should i have the model stored in some directory within my project so that typesense can access it?
j
I see syntax errors in that schema...
all-MiniLM-L12-v2
should be
ts/all-MiniLM-L12-v2
Here's a step-by-step guide you should be able to copy-paste: https://typesense.org/docs/guide/tips-for-searching-common-types-of-data.html#long-pieces-of-text
Do I explicitly need to mention a field named embedding
You do need an explicit field, but it can be named anything
Should i have the model stored in some directory within my project so that typesense can access it?
No, Typesense will automatically download the model for you. The issue here is a syntax error, see my first message above
a
Hey @Jason Bosco Adding
ts/
didn't help, still the same error.
j
Could you share a full curl command that creates the schema, that throws the error?
I can try running it locally
a
Here is the collection schema:
Copy code
{
  "name": "TEST",
  "fields": [
    {
      "name": "testId",
      "type": "string",
      "facet": true
    },
    {
      "name": "name",
      "type": "string",
      "facet": true
    },
    {
      "name": "category",
      "type": "string",
      "facet": true
    },
    {
      "name": "userId",
      "type": "string",
      "facet": true,
      "optional": true
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "name",
          "category"
        ],
        "model_config": {
          "model_name": "ts/all-MiniLM-L12-v2"
        }
      }
    }
  ]
}
And a sample JSONL document:
Copy code
{
  "testId": "1",
  "name": "Test String",
  "userId": "10",
  "category": "test-category",
  "embedding": []
}
j
I just tried running it locally and it worked without any issues
Copy code
export TYPESENSE_API_KEY=xyz
    
mkdir $(pwd)/typesense-data

docker run -p 8108:8108 \
            -v$(pwd)/typesense-data:/data typesense/typesense:0.25.2 \
            --data-dir /data \
            --api-key=$TYPESENSE_API_KEY \
            --enable-cors
Copy code
export TYPESENSE_API_KEY=xyz

curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '
{
    "name": "TEST",
    "fields": [
        {
            "name": "testId",
            "type": "string",
            "facet": true
        },
        {
            "name": "name",
            "type": "string",
            "facet": true
        },
        {
            "name": "category",
            "type": "string",
            "facet": true
        },
        {
            "name": "userId",
            "type": "string",
            "facet": true,
            "optional": true
        },
        {
            "name": "embedding",
            "type": "float[]",
            "embed": {
                "from": [
                    "name",
                    "category"
                ],
                "model_config": {
                    "model_name": "ts/all-MiniLM-L12-v2"
                }
            }
        }
    ]
}
'
Could you try copy-pasting that?
a
Yes, on it
Response to
Copy code
curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


{"state":1,"version":"0.25.2"}
Response to the collection schema creation curl
Copy code
{
  "message": "Model not found"
}
Do i need to install the GPU build of
typesense 0.25.2
as well?
j
Hmm no, you don't need a GPU build
Does your docker container have internet access?
a
I'm running typesense service in linux
j
Could you try using
0.26.0.rc62
of Typesense Server?
a
The error wasn't there in
0.26.0.rc62
but i received an empty response. ### The response:
Copy code
{
  "results": [
    {
      "facet_counts": [],
      "found": 0,
      "hits": [],
      "out_of": 0,
      "page": 1,
      "request_params": {
        "collection_name": "TEST",
        "first_q": "test",
        "per_page": 10,
        "q": "test"
      },
      "search_cutoff": false,
      "search_time_ms": 8
    }
  ]
}
### The collection schema
Copy code
{
  "name": "TEST",
  "fields": [
    {
      "name": "testId",
      "type": "string",
      "facet": true
    },
    {
      "name": "name",
      "type": "string",
      "facet": true
    },
    {
      "name": "category",
      "type": "string",
      "facet": true
    },
    {
      "name": "userId",
      "type": "string",
      "facet": true,
      "optional": true
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "name",
          "category"
        ],
        "model_config": {
          "model_name": "ts/all-MiniLM-L12-v2"
        }
      }
    }
  ]
}
### SEARCH REQUESTS
Copy code
{
  "searches": [
    {
      "collection": "TEST",
      "q": "test",
      "sort_by": "_vector_distance:asc,_text_match:desc",
      "prioritize_token_position": "true"
    }
  ]
}
### COMMON SEARCH PARAMS
Copy code
{
  "query_by": "embedding,name,category"
}
### Sample Document from JSONL File
Copy code
{
  "testId": "1",
  "name": "Test String",
  "userId": "10",
  "category": "test-category",
  "embedding": []
}
Supposedly, there's an issue with my code because the repsonse has
"out_of": 0
k
Yes that means no doc has been indexed
a
Are my documents correct?
The structure
k
Check if you get error during indexing.
a
Here's that:
Copy code
['Field `embedding` contains an invalid embedding.']
k
Yeah so some invalid values
a
I'm not getting it. I need to keep the embedding field empty, right? Typesense will automatically populate that field.
k
Don't send that field at all if you are using auto embedding
Empty array not valid, remove field from doc
a
Worked. for both, RC62 and
0.25.2
Now, I'm getting response sorted by vector distance but it's returning all the documents. Don't want that...
If i apply hybrid search for the keyword
laptop
then it should return only documents containing the keyword laptop and the documents containing the related keywords like computer based sorted by the vector distance and rank fusion score. But currently it is returning all documents (even with no text matching and absolutely unrelated) from the collection sorted by vector dictance
k
You can set the alpha parameter to give more weightage to keyword matches. There's also a parameter for fetching only records with min vector similarity value.
a
Can you send me the reference links for those?
k
Everything is documented on our docs page on vector search.
a
Can we specify a threshold for rank fusion score? Also sorting based on the same param?
k
_text_match
is already rank fusion score in hybrid search so you can just sort on
_text_match:desc
a
They are different. DIdn't get that part Here's the response
Copy code
{
  "results": [
    {
      "facet_counts": [],
      "found": 2,
      "hits": [
        {
          "document": {
            "category": "test-category",
            "id": "2",
            "name": "Gift laptop to your friend",
            "testId": "1011",
            "userId": "15"
          },
          "highlight": {
            "name": {
              "matched_tokens": [
                "laptop"
              ],
              "snippet": "Gift <mark>laptop</mark> to your friend"
            }
          },
          "highlights": [
            {
              "field": "name",
              "matched_tokens": [
                "laptop"
              ],
              "snippet": "Gift <mark>laptop</mark> to your friend"
            }
          ],
          "hybrid_search_info": {
            "rank_fusion_score": 1
          },
          "text_match": 1060320051,
          "text_match_info": {
            "best_field_score": "517734",
            "best_field_weight": 102,
            "fields_matched": 3,
            "score": "1060320051",
            "tokens_matched": 0
          },
          "vector_distance": 0.39046764373779297
        },
        {
          "document": {
            "category": "test-category",
            "id": "3",
            "name": "Gift a gaming computer to your friend",
            "testId": "10111",
            "userId": "15"
          },
          "highlight": {},
          "highlights": [],
          "hybrid_search_info": {
            "rank_fusion_score": 0.15000000596046448
          },
          "text_match": 0,
          "text_match_info": {
            "best_field_score": "0",
            "best_field_weight": 0,
            "fields_matched": 0,
            "score": "0",
            "tokens_matched": 0
          },
          "vector_distance": 0.5736579895019531
        }
      ],
      "out_of": 15,
      "page": 1,
      "request_params": {
        "collection_name": "TEST",
        "per_page": 10,
        "q": "laptop"
      },
      "search_cutoff": false,
      "search_time_ms": 5
    }
  ]
}
k
In hybrid search when you sort on
_text_match
you are actually sorting on fusion score.
a
Okay
Hi, I'm facing the error
[Errno 404] Model not found
in my another laptop that doesn't have GPU. Using both the versions RC62 and 0.25.2.
j
Could you run the same set of curl commands I shared with you yesterday, on
0.26.0.rc62
and share the output of each command and also the Typesenese logs from the beginning of the process start?