Hi I am trying to perform this query it s taking more than 2 typesense #community-help

Hi, I am trying to perform this query. it's taking...

Manav Kothari

11/05/2024, 1:27 PM

Hi, I am trying to perform this query. it's taking more than 25 seconds. can I get some assistance on optimizing latency? people collection size: 100 M - (no of fields: 18) organizations collection size: 4M - (no of fields: 96) development collection size: 100K - (no of fields: 4) index built on title, city, employee_count_range, workspace_id

Copy code

{
  highlight_fields: 'none',
  collection: 'people,
  include_fields: 'id,organization_id,linkedin_url,title,seniority,job_start_date,first_name,last_name,name,city,state,country,country_code,country_region,$organizations(id,name,website_url, primary_domain, founded_year, linkedin_url, phone, industry_name) as organizations,$development(stage) AS "stage"',
  filter_by: "title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']) && (id:* || $development(workspace_id : ws30ngrb03luqxcl27))",
  page: 2,
  per_page: 25,
  q: '*'
}

Kishore Nallan

11/05/2024, 1:29 PM

['51 to 200']

is not a valid Typesense range filter syntax. It must be

[51 .. 200]

Manav Kothari

11/05/2024, 1:30 PM

oh actually employee_count_range is string field it has values like

1 to 10

11 to 100

etc.

Kishore Nallan

11/05/2024, 3:20 PM

Ok I think this is because of the

id:*

condition. We've an idea to optimize that. I will get back to you in a few days with a new build where we address that.

👍 1

Manav Kothari

11/07/2024, 4:59 AM

Hi, @Kishore Nallan could you share relevant pr, so I can keep an eye on this update.

Harpreet Sangar

11/07/2024, 9:51 AM

@Manav Kothari Can you check with the following filter:

Copy code

"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"

Harpreet Sangar

11/07/2024, 9:58 AM

could you share relevant pr, so I can keep an eye on this update.

This PR does improve the evaluation of

id: *

filter when

enable_lazy_filter

is true but it won't work with your query since you're doing a wildcard search (

q: '*'

). The change in filter_by should be more relevant to you.

Manav Kothari

11/07/2024, 10:10 AM

Copy code

"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"

@Harpreet Sangar the meaning of this filter is different right as I want to perform given conditions on records as well as wanna fetch the details from the development table if that record exists

Harpreet Sangar

11/07/2024, 10:34 AM

This filter will help you achieve exactly that. Suppose

Copy code

title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])

is going to only match 1 document out of 100M, so this filter:

Copy code

title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']) && (id:* || $development(workspace_id : ws30ngrb03luqxcl27))

is effectively doing:

Copy code

1 doc && (100M docs || ...)

which is wasteful. The docs mention doing

id: * || $JoinCollectionName(...)

for left join in case you don't have any filter to apply on the collection you're searching.

👍 1

Manav Kothari

11/07/2024, 10:45 AM

Okay let me check.

Manav Kothari

11/07/2024, 1:18 PM

Copy code

"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"

i have tried this filter but no luck actually it gets timed out.

Harpreet Sangar

11/07/2024, 2:02 PM

I'll try to replicate the issue.

Harpreet Sangar

11/07/2024, 2:04 PM

Can you check if it times out and what the

found

count is with:

Copy code

"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']))"

Manav Kothari

11/07/2024, 2:08 PM

this is the no record with above query

Harpreet Sangar

11/08/2024, 2:46 AM

Also, what is the

found

count with just

Copy code

$development(workspace_id : ws30ngrb03luqxcl27)

Kishore Nallan

11/08/2024, 3:04 AM

One easy way to speed things up is to logically split your data across collections. For e.g. you can create 50 collections and then assign records in a given workspace or organization (depending on how your data is partitioned) into a given collection by doing org_id modulus 50. This way we don't have to search through all 100M docs because we will always only need into search into a partition

Manav Kothari

11/08/2024, 3:40 AM

@Harpreet Sangar found a count with

$development(workspace_id : ws30ngrb03luqxcl27)

is approx 100

Harpreet Sangar

11/08/2024, 3:44 AM

Then I don't see how

Copy code

"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"

filter_by times out. Can you check if sending

enable_lazy_filter: true

makes any difference?

Manav Kothari

11/08/2024, 3:53 AM

title:['Software Engineer'] ==> 1sec latency title:['Software Engineer'] && $organizations(country:['United States']) ==> 28 sec latency any idea why? i have make index: true, infix: true, facet: true in field

Manav Kothari

11/08/2024, 3:53 AM

enable_lazy_filter: true

no there is no improvement in latency with this

✅ 1

Harpreet Sangar

11/08/2024, 4:08 AM

any idea why? i have make index: true, infix: true, facet: true in field

filter_by

only requires to have

index: true

. Is it possible for you to share the data so I can analyse and figure out any possible improvement? Documents having only

title

from

people

and

country

field from

organizations

collection will do.

Manav Kothari

11/08/2024, 4:11 AM

sure

Manav Kothari

11/08/2024, 4:12 AM

Org table : 3M records.

Copy code

{
    
    "city": "",
    "country": "China",
    "country_code": "CN",
    "country_region": "APAC",
    "headquarters.country_name": "China",
    "headquarters.state": "",
    "id": "tu_oacxnseb3",
    "industry": "Import and Export",
    "industry_details.naics_code": "444180;423320",
    "industry_details.naics_description": "Other Building Material Dealers;Brick; Stone; and Related Construction Material Merchant Wholesalers",
    "industry_details.sic_code": "",
    "industry_details.sic_description": "",
    "is_linkedin_url_claimed": "true",
    "monthly_google_adspend": "0",
    "name": "Foshan Hotaqi Bath Ware Co.,ltd",
    "primary_domain": "<http://hotaqibath.com|hotaqibath.com>",
    "specialties": "",
    "state": "",
    "state_code": "",
    "street": ""m
    "total_reviews": "0",
    "valid_email_count": "",
    "website_traffic.monthly_organic": "0",
    "website_traffic.monthly_paid": "0",
    "website_traffic.total_monthly": "0",
    "website_url": "<http://www.hotaqibath.com>",
    ...other
  }

Manav Kothari

11/08/2024, 4:13 AM

people : 100M record

Copy code

{
  "city": "São Paulo",
  "country": "Brazil",
  "country_code": "BR",
  "country_region": "LATAM",
  "email": "",
  "email_status": "Not Available",
  "first_name": "Luciana",
  "id": "tu_pacfr6hnj",
  "job_start_date": "2023-07-01",
  "last_name": "Franco",
  "mobile_number": "",
  "name": "Luciana Ferreira Franco",
  "organization_id": "tu_oactbh3ln",
  "other_mobile_number": "",
  "seniority": "Staff",
  "state": "",
  "state_code": "",
  "title": "Coordenadora Técnica Laboratório Central Hsp"
}

Harpreet Sangar

11/08/2024, 4:20 AM

I would need the output of the following requests, if possible:

Copy code

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X GET \
    "<http://localhost:8108/collections/people/documents/export?include_fields=title>"

Copy code

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X GET \
    "<http://localhost:8108/collections/organizations/documents/export?include_fields=organizations>"

Manav Kothari

11/08/2024, 8:04 AM

HI @Harpreet Sangar exporting such a large dataset could be hectic, you can use this git repos to replicate this https://github.com/jneidel/job-titles/blob/master/job-titles.json https://github.com/kmmuntasir/random-name-dataset/blob/master/names/yob1880.txt

👀 1

Manav Kothari

11/08/2024, 8:10 AM

@Kishore Nallan I think partitioning won't work considering our query pattern: For example, we want to fetch the list of people who work with organizations which has an employee count of 10000+, and it's possible that these organizations easily exist across partitions right. which endsup scanning all the tables.

Kishore Nallan

11/08/2024, 8:11 AM

Yeah, if you want global counts then that's not possible.

✅ 1

Manav Kothari

11/11/2024, 5:36 AM

HI, I have denormalized (Merged people and org table) the data in order to achieve better latency I see a lot of improvement now (30 sec to 8 sec) but I am still struggling with optimizing left join here.

Copy code

title:['Manager'] && organizations.country:['United States'] && (id: * || $tuesday_development(workspace_id : ws30ngrb03luqxcl27))

any suggestion to improve this, I have already tried this but it's fetching the wrong results.

Copy code

(title:['Manager'] && organizations.country:['United States']) ||($tuesday_development(workspace_id : ws30ngrb03luqxcl27))

Harpreet Sangar

11/11/2024, 5:39 AM

Can you provide a small example that shows the wrong results being returned?

Manav Kothari

11/11/2024, 5:47 AM

Sure

Copy code

[
     {
            "document": {
                "city": "San Diego",
                "country": "United States",
                "country_code": "US",
                "country_region": "NORAM",
                "email": "<mailto:kiana.west@envedabio.com|kiana.west@envedabio.com>",
                "email_status": "Verified",
                "first_name": "Kiana",
                "id": "tu_padk6um2w",
                "job_start_date": "2022-11-01",
                "organizations.founded_year": "2019",
                "organizations.name": "Enveda Biosciences",
                "organizations.phone": "",
                "organizations.primary_domain": "<http://envedabio.com|envedabio.com>",
                "organizations.website_url": "<https://envedabio.com>",
                "seniority": "Manager",
                "state": "California",
                "title": "Product Manager"
            },
            "highlight": {},
            "highlights": []
        },
        {
            "document": {
                "city": "Longmont",
                "country": "United States",
                "country_code": "US",
                "country_region": "NORAM",
                "email": "",
                "email_status": "Not Available",
                "first_name": "Jackson",
                "id": "tu_pacw23pjg",
                "job_start_date": "2022-08-01",
                "last_name": "Starkey",
                "organizations.name": "Amazon",
                "organizations.phone": "",
                "organizations.primary_domain": "<http://amazon.com|amazon.com>",
                "organizations.website_url": "<https://www.amazon.com>",
                "seniority": "Staff",
                "state": "Colorado",
                "title": "Software Engineer",
                "development": {
                    "stage": "saved"
                }
            },
            "highlight": {},
            "highlights": []
        },
]

if you see the second result it gives the title = "software engineer" which is wrong.

Harpreet Sangar

11/11/2024, 5:52 AM

To recreate the bug I would need the following: • Schemas of the collections • Sample documents • The query that returns unexpected results preferably in this format: https://gist.github.com/jasonbosco/7c3432713216c378472f13e72246f46b

Manav Kothari

11/11/2024, 5:55 AM

okay will get back to you

Manav Kothari

11/11/2024, 6:13 AM

Copy code

### Run Typesense via Docker ########################################
export TYPESENSE_API_KEY=xyz
    
mkdir "$(pwd)"/typesense-data

docker run -p 8108:8108 \
            -v"$(pwd)"/typesense-data:/data typesense/typesense:27.1 \
            --data-dir /data \
            --api-key=$TYPESENSE_API_KEY \
            --enable-cors

### Reproduction Steps ###############################################
export TYPESENSE_API_KEY=xyz

curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
         "name": "companies",
         "fields": [
           {"name": "company_name", "type": "string" },
           {"name": "num_employees", "type": "int32" },
           {"name": "country", "type": "string", "facet": true }
         ],
         "default_sorting_field": "num_employees"
       }'
       
curl "<http://localhost:8108/collections/companies/documents/import?action=create>" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -H "Content-Type: text/plain" \
        -X POST \
        -d '{"id": "124","company_name": "Stark Industries","num_employees": 5215,"country": "USA"}
	{"id": "125","company_name": "Acme Corp 1","num_employees": 2133,"country": "CA"}
	{"id": "126","company_name": "Acme Corp 2","num_employees": 2133,"country": "USA"}
	{"id": "127","company_name": "Acme Corp 3","num_employees": 2133,"country": "INDIA"}
	{"id": "128","company_name": "Stark Industries 2","num_employees": 5215,"country": "USA"}
	{"id": "129","company_name": "Acme Corp 4","num_employees": 2133,"country": "CA"}
	{"id": "120","company_name": "Acme Corp 5","num_employees": 2133,"country": "USA"}
	{"id": "122","company_name": "Acme Corp 6","num_employees": 2133,"country": "INDIA"}'
		   
curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
	    "name": "development",
	    "fields": [
		{
		    "index": true,
		    "name": "workspace_id",
		    "type": "string"
		},
		{
		    "name": "user_id",
		    "type": "string"
		},
		{
		    "name": "company_id",
		    "reference": "companies.id",
		    "optional": true,
		    "index": true,
		    "type": "string"
		},
		{
		    "index": true,
		    "name": "stage",
		    "type": "string"
		}
	    ]
	}'
        
curl "<http://localhost:8108/collections/development/documents/import?action=create>" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -H "Content-Type: text/plain" \
        -X POST \
        -d '{"workspace_id": "a","user_id": "test","company_id": "124","stage": "saved"}
'

curl "<http://localhost:8108/multi_search>" \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
          "searches": [
            {
              "collection": "companies",
              "q": "*",
              "filter_by":"(country:[`India`] && company_name:[`Acme Corp 6`]) || $development(workspace_id: 'a')"
            }
          ]
        }'

### Documentation ######################################################################################
# Visit the API reference section: <https://typesense.org/docs/27.1/api/collections.html>
# Click on the "Shell" tab under each API resource's docs, to get shell commands for other API endpoints

Manav Kothari

11/11/2024, 6:15 AM

Result with (Not expected) : (country:[

India

] && company_name:[

Acme Corp 6

]) || $development(workspace_id: 'a')

Copy code

{
  "facet_counts": [],
  "found": 2,
  "hits": [
    {
      "document": {
        "company_name": "Stark Industries",
        "country": "USA",
        "development": {
          "company_id": "124",
          "id": "1",
          "stage": "saved",
          "user_id": "test",
          "workspace_id": "a"
        },
        "id": "124",
        "num_employees": 5215
      },
      "highlight": {},
      "highlights": []
    },
    {
      "document": {
        "company_name": "Acme Corp 6",
        "country": "INDIA",
        "id": "122",
        "num_employees": 2133
      },
      "highlight": {},
      "highlights": []
    }
  ],
  "out_of": 8,
  "page": 1,
  "request_params": {
    "collection_name": "companies",
    "first_q": "*",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}

Manav Kothari

11/11/2024, 6:15 AM

Result with(Expected): (country:[

India

] && company_name:[

Acme Corp 6

]) && (id:* || $development(workspace_id: 'a'))

Copy code

{
  "facet_counts": [],
  "found": 1,
  "hits": [
    {
      "document": {
        "company_name": "Acme Corp 6",
        "country": "INDIA",
        "id": "122",
        "num_employees": 2133
      },
      "highlight": {},
      "highlights": []
    }
  ],
  "out_of": 8,
  "page": 1,
  "request_params": {
    "collection_name": "companies",
    "first_q": "*",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}

Manav Kothari

11/11/2024, 6:34 AM

BTW, i am using 27.1 version let me know if that's okay

Harpreet Sangar

11/11/2024, 9:49 AM

The correct way to get your desired result is with the following `filter_by`:

Copy code

country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')

Now to understand how this is equivalent to the

filter_by

that produces the correct result:

Copy code

(country:[`India`] && company_name:[`Acme Corp 6`]) && (id:* || $development(workspace_id: 'a'))

Let's suppose we have 3 documents: • id: 0 that matches `country:[

India

] && company_name:[

Acme Corp 6

]` filter and references

workspace_id: 'a'

• id: 1 that matches `country:[

India

] && company_name:[

Acme Corp 6

]` filter but references

workspace_id: 'b'

• id: 2 that does not match `country:[

India

] && company_name:[

Acme Corp 6

]` filter and references

workspace_id: 'a'

This is how the

filter_by

is evaluated:

Copy code

(0, 1) && ((0, 1, 2) || (0, 2))

That further is evaluated as:

Copy code

(0, 1) && (0, 1, 2)

That finally returns only the following documents:

Copy code

0, 1

Harpreet Sangar

11/11/2024, 9:52 AM

So doing

&& (id:* || ...)

is wasted evaluation.

Manav Kothari

11/11/2024, 9:55 AM

but doing this does the inner join right? i want filter to still work if workspace_id: 'x' doesn't exist in development table.

Copy code

country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')

Harpreet Sangar

11/11/2024, 10:16 AM

I can only think of two ways to achieve this right now • The simple way - Query

development

like:

Copy code

{
  q: *,
  filter_by: workspace_id: x,
  per_page: 0
}

and the check if the

found

count is 0 or not. If 0, your query will be

Copy code

country:[`India`] && company_name:[`Acme Corp 6`]

otherwise:

Copy code

country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')

• If you wish to achieve the result with a single query, you'll have to send:

Copy code

(country:[`India`] && company_name:[`Acme Corp 6`]) || (country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a'))

✅ 1

Manav Kothari

11/14/2024, 4:52 AM

Hey guys, I have now denormalized the data and made some tradeoffs with left joins on our side, but now I am seeing this new issue where I am just executing this simple query but facing 12 sec of latency.

Copy code

"title:['Software Engineer'] && job_start_date_epoch :< 1638247099140"

this is on people's collection of 100M records any idea why it's happening or any way to optimize this. btw this doesn't happen with other filters like.

Copy code

"title:['Software Engineer'] && country :['india']"

Harpreet Sangar

11/14/2024, 4:57 AM

You can enable

range_index

on your

job_start_date_epoch

field so a range filter like

job_start_date_epoch :< 1638247099140

can evaluate faster.

👍 1

Harpreet Sangar

11/14/2024, 4:57 AM

https://typesense.org/docs/27.1/api/collections.html#field-parameters

Manav Kothari

11/14/2024, 5:02 AM

Thanks

Harpreet Sangar

11/15/2024, 1:06 PM

@Manav Kothari Can you share what's the response time for

Copy code

"title:['Software Engineer'] && job_start_date_epoch :< 1638247099140"

now with

range_index

Manav Kothari

11/15/2024, 1:24 PM

To enable the

range_index

i need to remigrate the data again, as we went to production we are carefully doing this, will update you on this Thanks

Harpreet Sangar

11/15/2024, 1:29 PM

Just a heads up,

range_index

will increase the ram usage.

👀 1

Manav Kothari

11/15/2024, 1:30 PM

do you know any number(banchmark) how much difference it can make?

Harpreet Sangar

11/15/2024, 1:39 PM

It will be at least 3x more than the normal numeric index. It also depends upon how sparse your dataset is but there should be significant improvement in the response time.

Manav Kothari

11/20/2024, 6:56 AM

hi is there a way to update the reference table without migration?

Harpreet Sangar

11/20/2024, 7:02 AM

No. Any schema changes related to references can only be implemented by dropping and re-indexing the collection.

👍 1

Manav Kothari

11/27/2024, 5:25 AM

Hey @Harpreet Sangar, how can I do left join on this

Copy code

title:['Software Engineer','Assistant Manager'] && $development_v2(workspace_id : '1' && stage:!['saved','blocked'])

Harpreet Sangar

11/27/2024, 5:35 AM

i want filter to still work if workspace_id: 'x' doesn't exist in development table.

Is this still a requirement?

Manav Kothari

11/27/2024, 5:37 AM

Yeah actually, want to get a record that is not in the development_v2 table. performance is not a constraint here. though.

Harpreet Sangar

11/27/2024, 5:40 AM

So you want a document that has either of the titles but is not present in development_v2?

Manav Kothari

11/27/2024, 5:42 AM

yes i don't want to fetch the record that are present on development_v2 table which met this condition

Copy code

$development_v2(workspace_id : '1' && stage:['saved','blocked'])

Harpreet Sangar

11/27/2024, 5:54 AM

We don't have a negation operator in the filter_by so you'll have to update your filter like:

Copy code

title:['Software Engineer','Assistant Manager'] && $development_v2(workspace_id :!= '1' || stage:!=['saved','blocked'])

following De Morgan's laws.

Manav Kothari

11/27/2024, 9:43 AM

this query doesn't work for me

Harpreet Sangar

11/27/2024, 10:34 AM

Can you provide an example?

Manav Kothari

11/27/2024, 11:14 AM

yes i mean from table one if there is a, b ,c and in table 2 c is present then I want to fetch a, b

Harpreet Sangar

11/27/2024, 11:17 AM

Give me some time to think about this. I'll get back to you.

👍 1

Manav Kothari

11/28/2024, 4:47 AM

Hi @Harpreet Sangar have you found any solution?

Harpreet Sangar

11/28/2024, 4:58 AM

To get clarity, in this example, there is no relation to

product_c

in the

CustomerProductPrices

collection. Is the requirement to get

product_c

in the result?

Untitled

Manav Kothari

11/28/2024, 5:02 AM

Yes

Harpreet Sangar

11/28/2024, 5:06 AM

We don't have a way to achieve this right now.

Harpreet Sangar

11/28/2024, 5:07 AM

Can you create a feature request regarding this on Github issues?

👀 1

Manav Kothari

11/28/2024, 5:08 AM

Okay

Manav Kothari

11/28/2024, 5:17 AM

is there any workaround?

Harpreet Sangar

11/28/2024, 5:20 AM

You can get the ids of the document first by sending this:

Copy code

filter_by: $development_v2(workspace_id : '1' && stage:['saved','blocked'])

and then send another query like:

Copy code

filter_by: id:!=[...]

Harpreet Sangar

11/28/2024, 5:25 AM

It will be better to use the export endpoint for the first query.

✅ 1

Manav Kothari

11/28/2024, 6:05 AM

Thanks

Open in Slack

Previous Next