#community-help

Solving Keyword Search in Document Chunks with Typesense

TLDR Dima was struggling with keyword search in divided document chunks. Kishore Nallan resolved the issue by suggesting adding a 'para_num' integer field to sorting criteria and trying the updated 0.24 RC builds.

Powered by Struct AI

2

17
9mo
Solved
Join the chat
Jan 12, 2023 (9 months ago)
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
08:55 AM
Hi everyone! Looking for advice 🌚

I have a large text documents with title and content, which I divided into managable chunk (~200 words each). I’m running the search against title and content and then group by results by document_id, so I can get no more than one chunk for each document. In case when keyword in content it’s working pretty good — I got only this chunk in the response. But when only title contains keyword I got random chunk from document, usually last loaded into index. How can I make sure that when no keyword in content found I got the first chunk from document, and not random?
09:01
Dima
09:01 AM
I’m thinking about adding additional no-index field “brief” and add to it the first chunk value for each index row, but it looks like it will extend disk size x2

I have also tried to add chunk number as a sorting parameter, but faced that sort_by isn’t working for 3+ parameters in it: https://github.com/typesense/typesense/issues/634
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:09 AM
Can you share your schema?
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
09:21 AM
fields:
  - name: object_id
    type: string
    index: true

  - name: document_id
    type: string
    index: true
    facet: true

  - name: title
    type: string
    index: true

  - name: pageviews
    type: int32
    index: true
    sort: true```
"query_by": "title,content",
"sort_by": "_text_match:desc,pageviews:desc",
"group_by": "document_id",
"group_limit": 1,
"highlight_full_fields": "title,content,path",```
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:22 AM
content is missing?
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
09:22 AM
Yes, you’re right

Fixed one:

fields:
  - name: object_id
    type: string
    index: true

  - name: document_id
    type: string
    index: true
    facet: true

  - name: title
    type: string
    index: true

  - name: pageviews
    type: int32
    index: true
    sort: true

  - name: content
    type: string
    index: true```
"query_by": "title,content",
"sort_by": "_text_match:desc,pageviews:desc",
"group_by": "document_id",
"group_limit": 1,
"highlight_full_fields": "title,content",```
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:23 AM
Let's say you have a document with a title and which has 4 paragraphs and you store each paragraph as a document in Typesense. Do you have the same title field on all those 4 docs? Or is title just omitted?
09:23
Kishore Nallan
09:23 AM
I also don't follow this fully. Can you elaborate?

> But when only title contains keyword I got random chunk from document, usually last loaded into index.
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
09:24 AM
The same title for all those 4 index rows, right
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:25 AM
Ok I think I got it. When title is found, you want the result with the first para to be returned for the match.

1

Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
09:25 AM
If I run search against title and content, and keyword only in title. In this case group_by will return me only one index row, because _text_match weight is same for all. But it will return me random index row (random paragraph from your example)
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:27 AM
You can just add a para_num integer field and add this to the sorting criteria. The 3-way sorting issue that you have posted on the issue is fixed on recent 0.24 RC builds. You can use them (many people already use on production). We will soon be releasing it fully.
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
09:29 AM
Great! Do you have binaries for RC builds?
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
09:31 AM
DEB / RPM or tar gz?
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
10:34 AM
tar.gz 🙏
Kishore Nallan
Photo of md5-4e872368b2b2668460205b409e95c2ea
Kishore Nallan
10:36 AM
Dima
Photo of md5-1b62114a658b760944aa7d2b4c274460
Dima
10:36 AM
Thank you!

1