#community-help

Tokenization and Indexing Fields with Typesense

TLDR kam wanted to understand how to control tokenization and indexing for certain fields. Jason explained that tokenization is applied during search queries and not during the indexing phase, and shared how to delete a document using an indexed unique value under id.

Powered by Struct AI

1

1

Aug 28, 2023 (3 months ago)
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
06:53 PM
Also is it possible to control tokenization, for certain fields index, but do not tokenize for example?

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:54 PM
Meaning, you only want to index a subset of fields and use the others only for display purposes?
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
06:55 PM
I mean for a field called say raw_field I want to index the value as is without tokenization
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:56 PM
There's a flag called pre_segmented_query: true documented here: https://typesense.org/docs/0.25.0/api/search.html#query-parameters
06:56
Jason
06:56 PM
Could you elaborate on your use-case though?
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
06:57 PM
I have a field that is unique per doc, I would like to index it as is, so I can issue delete query to delete a particular document
06:58
kam
06:58 PM
I see pre_segmented_query , is there similar thing for index field that is applied during indexing phase?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:59 PM
Ah. If you set that unique value in a field called id when you send the document to Typesense, you can then reference that document by ID to do crud operations on it
06:59
Jason
06:59 PM
pre_segmented_query has historically only been useful for CJK locales - most likely not required for the use-case you mentioned above
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
07:00 PM
I see, so you are saying engine teats id field specific such that it won't tokenize?
07:00
kam
07:00 PM
field type is string
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:00 PM
Correct
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
07:01 PM
good to know, how about if I have multiple such fields
07:01
kam
07:01 PM
I have in fact 3 fields per doc
07:01
kam
07:01 PM
1 is tenant id, 2nd is tenant company then record id is the last one
07:01
kam
07:01 PM
id is unique in tenant id and company id
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
07:02 PM
You'll most likely be using them in filter_by which doesn't do any tokenization and will use a separate non-tokenized index to filter results
07:02
Jason
07:02 PM
Tokenization only comes into the picture when you use the q (full text search) parameter
kam
Photo of md5-bddb511d8f792896126bcbd7c077ed12
kam
07:02 PM
got it
07:03
kam
07:03 PM
thanks for the help Jason

1

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Custom Tokenization and Search Issues in Chinese Text

crapthings inquired about custom tokenizer for Chinese which Kishore Nallan mentioned is unsupported. They discussed tokenization affecting vector search and hybrid search. Testing by crapthings raised issues with certain words not working and problems with larger documents. Kishore Nallan advised splitting larger documents for indexing and suggested `group_by=parent_doc_id` for deduplication.

5

35
1w

Troubleshooting "drop_tokens_threshold" and Typo Tolerance in Typesense

Joe had issues with "drop_tokens_threshold" = 0 and typo tolerance in Typesense, after which Kishore Nallan provided solutions and clarifications on feature functionality. Their issues with the search result limit and tokens were resolved after discussion and testing.

3

29
26mo

Resolving Typesense Search Issues

Conversation started by Maximilian about Typesense search behavior led to Users Kishore Nallan and Mike discussing and suggesting workaround, with Kishore Nallan promising an official solution soon. No final confirmation of resolution provided.

1

14
21mo

Issue with Query Expectations on Typesense Search

Sean was having an issue with their search query on Typesense. Kishore Nallan suggested adjusting the 'drop_tokens_threshold' parameter. After making the adjustment, Sean found an issue with the order of the results, which was resolved by updating Typesense version.

2

22
2mo

Performance Characteristics of Filtering Search Results

Oskar queries the performance difference in filtering search results. Jason clarifies how filters work and provides performance improvement suggestions like increasing vCPUs and sharding the collection. Kishore Nallan explains filter IDs and document ID matching. The thread concludes with discussions on performance tradeoffs in filter implementation.

33
15mo