Hi everyone is it somehow possible to use tokenizers such as typesense #community-help

Hi everyone, is it somehow possible to use tokeniz...

Said

11/26/2024, 11:14 AM

Hi everyone, is it somehow possible to use tokenizers such as e.g. cl100k_base or others from LLM providers such as Azure OpenAI or use some custom open-source tokenizer? Furthermore, I saw in the documentation that typesense also provides a "locale" feature to use some tokenizers for specific languages, unfortunately german is not included.

Fanis Tharropoulos

11/26/2024, 11:41 AM

Hey there, to answer your second question first: German is autostemmed, so there's no need to provide a

locale

for it, just enable stemming for that field by setting

stem: true

. So is your first question directed towards semantic search and generating embeddings?

Said

11/26/2024, 11:45 AM

Thx for your response, will try that out. Regarding the Use-case for question1, I would like to implement a RAG system by using Hybrid search. For the text-search part, I don't want to just search in fields like e.g. title, text, but instead in the fields title, title_tkn, text, text_tkn

Fanis Tharropoulos

11/26/2024, 11:53 AM

So you'd like for the conversational LLM to have access to the embeddings while searching? If that's the case, you can just create the embeddings normally with any embedding model of your choosing (either one at our huggingface repo, or an OpenAI, PaLMAPI one, our guide here will walk you through the process). But, since the conversational search LLM will already contextualize both the conversation and the dataset, I'm not sure if it's necessary to use the embeddings while querying through RAG

2 Views

Open in Slack

Previous Next