Auto-embedding generation within Typesense is a gr...
# community-help
n
Auto-embedding generation within Typesense is a great feature and it works. Still there are some questions. There are lot of parameters on schema- and field- level and after reading documentation - it is still unclear how these parameters impact auto-embedding generation. 1. On schema level:
symbols_to_index
token_separators
2. On field level:
stem
- as far as I understand: make no sense to use both stemming and embeddings/LLM. Is it correct? 3. HTML Content In such field definition:
Copy code
{
  "name": "embedding",
  "type": "float[]",
  "embed": {
    "from": [
      "title",
      "content"
    ],
    "model_config": {
      "model_name": "ts/e5-large-v2"
    }
  }
}
should I remove HTML tags from fields "title" and "content" ? 4. Highlighting I am doing Hybrid Search, and on the client side i set this:
Copy code
'query_by': 'title, content, embedding, organization.name',
      'vector_query': 'embedding:([], alpha: 0.19, distance_threshold:0.25)',
As I understand it: the highlight snippets are generated only in case of keyword match. In case a document found by semantic search - there is no highlight. Is it correct?
And by the way: How stopwords and synonyms are related to auto-embedding generation ?
k
symbols_to_index, token_separators, stemming are used to process the input query first. The transformed query is used for both keyword search and embedding. It does not make sense to use stemming for embedding, but since it's a common pre-processing step, it's done. Yes remove HTML tags. Even for semantic search we will highlight if any token in query is found within the text fields in the documents found and returned in response.
n
The transformed query is used for both keyword search and embedding.
Thanks, Kishore! And stopwords and synonyms ?
k
Stopwords are dropped from query before embedding. Synonyms are used only for keyword search.
1
n
Got it! I would say, just update documentation with this info and it will be great ))
👍 1