Tokenization and Indexing Fields with Typesense
TLDR kam wanted to understand how to control tokenization and indexing for certain fields. Jason explained that tokenization is applied during search queries and not during the indexing phase, and shared how to delete a document using an indexed unique value under
Aug 28, 2023 (3 months ago)
raw_fieldI want to index the value as is without tokenization
pre_segmented_query: truedocumented here: https://typesense.org/docs/0.25.0/api/search.html#query-parameters
pre_segmented_query, is there similar thing for index field that is applied during indexing phase?
idwhen you send the document to Typesense, you can then reference that document by ID to do crud operations on it
pre_segmented_queryhas historically only been useful for CJK locales - most likely not required for the use-case you mentioned above
idfield specific such that it won't tokenize?
filter_bywhich doesn't do any tokenization and will use a separate non-tokenized index to filter results
q(full text search) parameter
Indexed 3015 threads (79% resolved)
Custom Tokenization and Search Issues in Chinese Text
crapthings inquired about custom tokenizer for Chinese which Kishore Nallan mentioned is unsupported. They discussed tokenization affecting vector search and hybrid search. Testing by crapthings raised issues with certain words not working and problems with larger documents. Kishore Nallan advised splitting larger documents for indexing and suggested `group_by=parent_doc_id` for deduplication.
Troubleshooting "drop_tokens_threshold" and Typo Tolerance in Typesense
Joe had issues with "drop_tokens_threshold" = 0 and typo tolerance in Typesense, after which Kishore Nallan provided solutions and clarifications on feature functionality. Their issues with the search result limit and tokens were resolved after discussion and testing.
Resolving Typesense Search Issues
Conversation started by Maximilian about Typesense search behavior led to Users Kishore Nallan and Mike discussing and suggesting workaround, with Kishore Nallan promising an official solution soon. No final confirmation of resolution provided.
Issue with Query Expectations on Typesense Search
Sean was having an issue with their search query on Typesense. Kishore Nallan suggested adjusting the 'drop_tokens_threshold' parameter. After making the adjustment, Sean found an issue with the order of the results, which was resolved by updating Typesense version.
Performance Characteristics of Filtering Search Results
Oskar queries the performance difference in filtering search results. Jason clarifies how filters work and provides performance improvement suggestions like increasing vCPUs and sharding the collection. Kishore Nallan explains filter IDs and document ID matching. The thread concludes with discussions on performance tradeoffs in filter implementation.