Custom Tokenization and Search Issues in Chinese Text
TLDR crapthings inquired about custom tokenizer for Chinese which Kishore Nallan mentioned is unsupported. They discussed tokenization affecting vector search and hybrid search. Testing by crapthings raised issues with certain words not working and problems with larger documents. Kishore Nallan advised splitting larger documents for indexing and suggested group_by=parent_doc_id
for deduplication.
2
1
1
1
Nov 15, 2023 (2 weeks ago)
crapthings
08:16 AMKishore Nallan
08:16 AMcrapthings
08:36 AMcrapthings
08:45 AMhttps://github.com/typesense/typesense/issues/267
Kishore Nallan
08:46 AM1
1
Nov 16, 2023 (1 week ago)
crapthings
07:51 AM``````
crapthings
07:53 AMis there example available
Kishore Nallan
07:56 AMcrapthings
07:58 AMKishore Nallan
07:58 AMcrapthings
07:59 AM1
crapthings
08:03 AMKishore Nallan
08:04 AM1
crapthings
08:04 AMNov 21, 2023 (1 week ago)
crapthings
06:29 AMcrapthings
06:29 AMfor example 孔明
crapthings
06:29 AMcrapthings
06:29 AMcrapthings
06:31 AMcrapthings
06:32 AMcrapthings
06:36 AMconst fs = require('fs');
const _ = require('lodash')
const Typesense = require('typesense')
const { load, cut } = require('@node-rs/jieba');
const { log } = require('console');
load()
const txt = fs.readFileSync('./5.txt', 'utf8')
const databaseName123 = 'okaytest123'
const client = new Typesense.Client({
'nodes': [{
'host': 'localhost',
'port': '8108',
'protocol': 'http'
}],
'apiKey': 'xyz',
'connectionTimeoutSeconds': 300
})
const schema = {
'name': databaseName123,
'fields': [
{ name: 'title', type: 'string', locale: 'zh' },
{ name: 'text', type: 'string' },
{
name: 'embedding',
type: 'float[]',
embed: {
from: [
'title',
],
model_config: {
model_name: 'ts/all-MiniLM-L12-v2'
}
}
}
]
}
;(async function () {
try {
await client.collections(databaseName123).delete()
} catch {
}
try {
await client.collections().create(schema)
} catch (ex) {
console.log(ex)
}
const text = cut(txt).join(' ')
fs.writeFileSync('./cut.txt', text)
try {
await client.collections(databaseName123).documents().import([
{ id: '1', title: txt, text },
])
} catch (ex) {
console.log(ex)
}
const resp = await client.collections(databaseName123).documents().search({
q: `孔明`,
// query_by: 'embedding,title',
query_by: 'text',
// query_by: 'title',
pre_segmented_query: true,
// drop_tokens_threshold: 0, typo_tokens_threshold: 0, prefix: false
// sort_by: '_vector_distance:desc'
}).catch(console.log)
// const result = _.map(resp.hits, (item) => {
// console.log(JSON.stringify(item.highlights, null, 2))
// // return {
// // title: item.document.title,
// // tags: item.document.tags,
// // distance: item.vector_distance,
// // }
// })
console.log(JSON.stringify(resp, null, 2))
// console.log(JSON.stringify(resp, null, 2))
// console.log(await client.collections().retrieve())
} ())
crapthings
06:38 AMKishore Nallan
07:44 AM1
crapthings
07:55 AMi’ve tried other raw chinese text file which is okay
and random delete large paragraphs and keep some paragraphs, search works too, but undo the delete it broken
crapthings
08:33 AMif you load this document and search the “crapthings”
it returns random result
Nov 22, 2023 (1 week ago)
Kishore Nallan
05:15 AMThe issue is that for very large documents we have to wrap the word offsets after 64,000 positions. The highlighting is not taking that into consideration. So you are getting correct results, but the wrong word is highlighted in the snippet because of the wrap around.
crapthings
06:16 AMKishore Nallan
06:17 AMcrapthings
06:19 AMi’ve a document about 60w chars and i try to chunk it into 15w, and the search results looks okay
Kishore Nallan
06:20 AMKishore Nallan
06:20 AMcrapthings
06:30 AMShould we compare and process highly similar texts outside Typesense, or should we use Typesense's built-in similarity feature and implement a deduplication mechanism?
crapthings
06:32 AMIs there a way to deduplicate after partitioning or chunk?
crapthings
06:34 AMKishore Nallan
08:24 AMparent_doc_id
field. Then during searching just do a group_by=parent_doc_id
Typesense
Indexed 3011 threads (79% resolved)
Similar Threads
Transitioning from Meilisearch to Typesense - Questions and Suggestions
Al is moving from Meilisearch to Typesense and asked for similar matching information features. They also proposed adding daily backups. Kishore Nallan helped them find a workaround, while noting expected complexities, and agreed to include their suggestions in their backlog.
Issue with Query Expectations on Typesense Search
Sean was having an issue with their search query on Typesense. Kishore Nallan suggested adjusting the 'drop_tokens_threshold' parameter. After making the adjustment, Sean found an issue with the order of the results, which was resolved by updating Typesense version.
Resolving Embedding Distance Threshold Error Query Using '*'
Walter was having trouble using the embedding distance threshold which gave out an error with a '*'. Kishore Nallan clarified the issue and suggested a solution. Jason also provided a link to help dynamically set typesense-instantsearch-adapter parameters for future inquiries. The issue was resolved by their inputs.
Issues with Semantic Search in OpenAI Embeddings
Semyon was having issues while creating a schema for semantic search with OpenAI embeddings. Jason gave multiple suggestions for troubleshooting but none of them worked. The error was narrowed down to possibly being related to Semyon's specific OpenAI account or the OpenAI API key. The thread ended with Jason suggesting Semyon to check billing and make a direct API call to OpenAI.
Typesense Partial Text Search Functionality Clarification
Davi queries about search behavior in Typesense, not seeing expected results with partial search strings. Jason confirms this behavior as expected and refers to an open feature request to address the query.