Hi everyone, is it somehow possible to use tokeniz...
# community-help
s
Hi everyone, is it somehow possible to use tokenizers such as e.g. cl100k_base or others from LLM providers such as Azure OpenAI or use some custom open-source tokenizer? Furthermore, I saw in the documentation that typesense also provides a "locale" feature to use some tokenizers for specific languages, unfortunately german is not included.
f
Hey there, to answer your second question first: German is autostemmed, so there's no need to provide a
locale
for it, just enable stemming for that field by setting
stem: true
. So is your first question directed towards semantic search and generating embeddings?
s
Thx for your response, will try that out. Regarding the Use-case for question1, I would like to implement a RAG system by using Hybrid search. For the text-search part, I don't want to just search in fields like e.g. title, text, but instead in the fields title, title_tkn, text, text_tkn
f
So you'd like for the conversational LLM to have access to the embeddings while searching? If that's the case, you can just create the embeddings normally with any embedding model of your choosing (either one at our huggingface repo, or an OpenAI, PaLMAPI one, our guide here will walk you through the process). But, since the conversational search LLM will already contextualize both the conversation and the dataset, I'm not sure if it's necessary to use the embeddings while querying through RAG