#community-help

Custom Tokenization and Search Issues in Chinese Text

TLDR crapthings inquired about custom tokenizer for Chinese which Kishore Nallan mentioned is unsupported. They discussed tokenization affecting vector search and hybrid search. Testing by crapthings raised issues with certain words not working and problems with larger documents. Kishore Nallan advised splitting larger documents for indexing and suggested group_by=parent_doc_id for deduplication.

Powered by Struct AI

2

1

1

1

35
1w
Solved
Join the chat
Nov 15, 2023 (2 weeks ago)
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
08:16 AM
hello how can i custom tokenizer dict for chinese?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:16 AM
We don't support custom tokenization at the moment.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
08:36 AM
ok
08:45
crapthings
08:45 AM
what is the chs tokenizer? is this cappjieba?
https://github.com/typesense/typesense/issues/267
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:46 AM
We use the tokenizer in libicu

1

1

Nov 16, 2023 (1 week ago)
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
07:51 AM
If, as mentioned in the text, the locale is not specified when creating the schema and self-tokenization is used, will this tokenization affect the embedding when using vector search?
``````
Image 1 for If, as mentioned in the text, the locale is not specified when creating the schema and self-tokenization is used, will this tokenization affect the embedding when using vector search?
``````
07:53
crapthings
07:53 AM
Segment both the indexed text and the search query string yourself

is there example available
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:56 AM
As long as you don't use locale but provide space separation in your data typesense will use those spaces for token boundaries. Then while searching use the pre_segmented_query flag so we will once again use spaces in the query to split tokens. That's all.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
07:58 AM
will this impact hybrid search?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:58 AM
Everything should work as expected. Can't think of a situation otherwise.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
07:59 AM
okay thanks will try it

1

08:03
crapthings
08:03 AM
so it looks the highlight mark will lost, because tokenlizer will drop many unnecessary word and save segmented to data, he UI itself implement the highlighting feature based on the query results?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:04 AM
Highlighter tries to find tokens in the query in the text. It might work. If it doesn't please post a small reproduceable example.

1

crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
08:04 AM
ok thanks
Nov 21, 2023 (1 week ago)
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
06:29 AM
hello i have tested with pre_segmented_query
06:29
crapthings
06:29 AM
i’ve found certain word doesn’t work with search?

for example 孔明
06:29
crapthings
06:29 AM
Image 1 for
06:29
crapthings
06:29 AM
Image 1 for
06:31
crapthings
06:31 AM
i use other tool to cut words and save it to typesense without locale
Image 1 for i use other tool to cut words and save it to typesense without locale
06:32
crapthings
06:32 AM
still doesn’t match the result
Image 1 for still doesn’t match the result
06:36
crapthings
06:36 AM
const fs = require('fs');
const _ = require('lodash')
const Typesense = require('typesense')

const { load, cut } = require('@node-rs/jieba');
const { log } = require('console');

load()

const txt = fs.readFileSync('./5.txt', 'utf8')

const databaseName123 = 'okaytest123'

const client = new Typesense.Client({
  'nodes': [{
    'host': 'localhost',
    'port': '8108',
    'protocol': 'http'
  }],
  'apiKey': 'xyz',
  'connectionTimeoutSeconds': 300
})

const schema = {
  'name': databaseName123,
  'fields': [
    { name: 'title', type: 'string', locale: 'zh' },
    { name: 'text', type: 'string' },

    {
      name: 'embedding',
      type: 'float[]',
      embed: {
        from: [
          'title',
        ],
        model_config: {
          model_name: 'ts/all-MiniLM-L12-v2'
        }
      }
    }
  ]
}

;(async function () {
  try {
    await client.collections(databaseName123).delete()
  } catch {

  }

  try {
    await client.collections().create(schema)
  } catch (ex) {
    console.log(ex)
  }

  const text = cut(txt).join(' ')

  fs.writeFileSync('./cut.txt', text)

  try {
    await client.collections(databaseName123).documents().import([
      { id: '1', title: txt, text },
    ])
  } catch (ex) {
    console.log(ex)
  }

  const resp = await client.collections(databaseName123).documents().search({
    q: `孔明`,
    // query_by: 'embedding,title',
    query_by: 'text',
    // query_by: 'title',
    pre_segmented_query: true,
    // drop_tokens_threshold: 0, typo_tokens_threshold: 0, prefix: false
    // sort_by: '_vector_distance:desc'
  }).catch(console.log)

  // const result = _.map(resp.hits, (item) => {
  //   console.log(JSON.stringify(item.highlights, null, 2))
  //   // return {
  //   //   title: item.document.title,
  //   //   tags: item.document.tags,
  //   //   distance: item.vector_distance,
  //   // }
  // })

  console.log(JSON.stringify(resp, null, 2))
  // console.log(JSON.stringify(resp, null, 2))

  // console.log(await client.collections().retrieve())
} ())
06:38
crapthings
06:38 AM
reproduction
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
07:44 AM
Thanks for sharing the files. I'll check and get back to you in a day or so.

1

crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
07:55 AM
i think certain thing hide in 5.txt file break the search
i’ve tried other raw chinese text file which is okay

and random delete large paragraphs and keep some paragraphs, search works too, but undo the delete it broken
08:33
crapthings
08:33 AM
it looks locale english have the same issue if its a long text

if you load this document and search the “crapthings”

it returns random result
Nov 22, 2023 (1 week ago)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
05:15 AM
> it returns random result
The issue is that for very large documents we have to wrap the word offsets after 64,000 positions. The highlighting is not taking that into consideration. So you are getting correct results, but the wrong word is highlighted in the snippet because of the wrap around.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
06:16 AM
is this a bug? or there’s a settings for this
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:17 AM
In general I don't recommend indexing such large docs. The correct practice would be to split it up into chunks and then doing a group by.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
06:19 AM
what is a best break point for such case?
i’ve a document about 60w chars and i try to chunk it into 15w, and the search results looks okay
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
06:20 AM
Maybe 10,000 words per document.
06:20
Kishore Nallan
06:20 AM
Can also be smaller. It depends on the use case and context.
crapthings
Photo of md5-9b5834e06bbe9e0a36194e5f182d8e3a
crapthings
06:30 AM
Should we compare and process highly similar texts outside Typesense, or should we use Typesense's built-in similarity feature and implement a deduplication mechanism?
06:32
crapthings
06:32 AM
Is there a way to deduplicate after partitioning or chunk?

06:34
crapthings
06:34 AM
like for each document, do i search like this, and remove by score?
Image 1 for like for each document, do i search like this, and remove by score?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
08:24 AM
When you split a document into N sub documents, all documents can have a parent_doc_id field. Then during searching just do a group_by=parent_doc_id

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Transitioning from Meilisearch to Typesense - Questions and Suggestions

Al is moving from Meilisearch to Typesense and asked for similar matching information features. They also proposed adding daily backups. Kishore Nallan helped them find a workaround, while noting expected complexities, and agreed to include their suggestions in their backlog.

3

51
33mo

Issue with Query Expectations on Typesense Search

Sean was having an issue with their search query on Typesense. Kishore Nallan suggested adjusting the 'drop_tokens_threshold' parameter. After making the adjustment, Sean found an issue with the order of the results, which was resolved by updating Typesense version.

2

22
2mo
Solved

Resolving Embedding Distance Threshold Error Query Using '*'

Walter was having trouble using the embedding distance threshold which gave out an error with a '*'. Kishore Nallan clarified the issue and suggested a solution. Jason also provided a link to help dynamically set typesense-instantsearch-adapter parameters for future inquiries. The issue was resolved by their inputs.

2

13
1mo
Solved

Issues with Semantic Search in OpenAI Embeddings

Semyon was having issues while creating a schema for semantic search with OpenAI embeddings. Jason gave multiple suggestions for troubleshooting but none of them worked. The error was narrowed down to possibly being related to Semyon's specific OpenAI account or the OpenAI API key. The thread ended with Jason suggesting Semyon to check billing and make a direct API call to OpenAI.

1

75
2mo

Typesense Partial Text Search Functionality Clarification

Davi queries about search behavior in Typesense, not seeing expected results with partial search strings. Jason confirms this behavior as expected and refers to an open feature request to address the query.

2

4
2mo
Solved