Hi, I am trying to perform this query. it's taking...
# community-help
m
Hi, I am trying to perform this query. it's taking more than 25 seconds. can I get some assistance on optimizing latency? people collection size: 100 M - (no of fields: 18) organizations collection size: 4M - (no of fields: 96) development collection size: 100K - (no of fields: 4) index built on title, city, employee_count_range, workspace_id
Copy code
{
  highlight_fields: 'none',
  collection: 'people,
  include_fields: 'id,organization_id,linkedin_url,title,seniority,job_start_date,first_name,last_name,name,city,state,country,country_code,country_region,$organizations(id,name,website_url, primary_domain, founded_year, linkedin_url, phone, industry_name) as organizations,$development(stage) AS "stage"',
  filter_by: "title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']) && (id:* || $development(workspace_id : ws30ngrb03luqxcl27))",
  page: 2,
  per_page: 25,
  q: '*'
}
k
['51 to 200']
is not a valid Typesense range filter syntax. It must be
[51 .. 200]
m
oh actually employee_count_range is string field it has values like
1 to 10
,
11 to 100
etc.
k
Ok I think this is because of the
id:*
condition. We've an idea to optimize that. I will get back to you in a few days with a new build where we address that.
👍 1
m
Hi, @Kishore Nallan could you share relevant pr, so I can keep an eye on this update.
h
@Manav Kothari Can you check with the following filter:
Copy code
"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"
could you share relevant pr, so I can keep an eye on this update.
This PR does improve the evaluation of
id: *
filter when
enable_lazy_filter
is true but it won't work with your query since you're doing a wildcard search (
q: '*'
). The change in filter_by should be more relevant to you.
m
Copy code
"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"
@Harpreet Sangar the meaning of this filter is different right as I want to perform given conditions on records as well as wanna fetch the details from the development table if that record exists
h
This filter will help you achieve exactly that. Suppose
Copy code
title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])
is going to only match 1 document out of 100M, so this filter:
Copy code
title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']) && (id:* || $development(workspace_id : ws30ngrb03luqxcl27))
is effectively doing:
Copy code
1 doc && (100M docs || ...)
which is wasteful. The docs mention doing
id: * || $JoinCollectionName(...)
for left join in case you don't have any filter to apply on the collection you're searching.
👍 1
m
Okay let me check.
Copy code
"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"
i have tried this filter but no luck actually it gets timed out.
h
I'll try to replicate the issue.
Can you check if it times out and what the
found
count is with:
Copy code
"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200']))"
m
99497
this is the no record with above query
h
Also, what is the
found
count with just
Copy code
$development(workspace_id : ws30ngrb03luqxcl27)
k
One easy way to speed things up is to logically split your data across collections. For e.g. you can create 50 collections and then assign records in a given workspace or organization (depending on how your data is partitioned) into a given collection by doing org_id modulus 50. This way we don't have to search through all 100M docs because we will always only need into search into a partition
m
@Harpreet Sangar found a count with
$development(workspace_id : ws30ngrb03luqxcl27)
is approx 100
h
Then I don't see how
Copy code
"(title:['Sales Engineer'] && city:['mumbai'] && $organizations(employee_count_range:['51 to 200'])) || $development(workspace_id : ws30ngrb03luqxcl27)"
filter_by times out. Can you check if sending
enable_lazy_filter: true
makes any difference?
m
title:['Software Engineer'] ==> 1sec latency title:['Software Engineer'] && $organizations(country:['United States']) ==> 28 sec latency any idea why? i have make index: true, infix: true, facet: true in field
enable_lazy_filter: true
no there is no improvement in latency with this
1
h
any idea why? i have make index: true, infix: true, facet: true in field
filter_by
only requires to have
index: true
. Is it possible for you to share the data so I can analyse and figure out any possible improvement? Documents having only
title
from
people
and
country
field from
organizations
collection will do.
m
sure
Org table : 3M records.
Copy code
{
    
    "city": "",
    "country": "China",
    "country_code": "CN",
    "country_region": "APAC",
    "headquarters.country_name": "China",
    "headquarters.state": "",
    "id": "tu_oacxnseb3",
    "industry": "Import and Export",
    "industry_details.naics_code": "444180;423320",
    "industry_details.naics_description": "Other Building Material Dealers;Brick; Stone; and Related Construction Material Merchant Wholesalers",
    "industry_details.sic_code": "",
    "industry_details.sic_description": "",
    "is_linkedin_url_claimed": "true",
    "monthly_google_adspend": "0",
    "name": "Foshan Hotaqi Bath Ware Co.,ltd",
    "primary_domain": "<http://hotaqibath.com|hotaqibath.com>",
    "specialties": "",
    "state": "",
    "state_code": "",
    "street": ""m
    "total_reviews": "0",
    "valid_email_count": "",
    "website_traffic.monthly_organic": "0",
    "website_traffic.monthly_paid": "0",
    "website_traffic.total_monthly": "0",
    "website_url": "<http://www.hotaqibath.com>",
    ...other
  }
people : 100M record
Copy code
{
  "city": "São Paulo",
  "country": "Brazil",
  "country_code": "BR",
  "country_region": "LATAM",
  "email": "",
  "email_status": "Not Available",
  "first_name": "Luciana",
  "id": "tu_pacfr6hnj",
  "job_start_date": "2023-07-01",
  "last_name": "Franco",
  "mobile_number": "",
  "name": "Luciana Ferreira Franco",
  "organization_id": "tu_oactbh3ln",
  "other_mobile_number": "",
  "seniority": "Staff",
  "state": "",
  "state_code": "",
  "title": "Coordenadora Técnica Laboratório Central Hsp"
}
h
I would need the output of the following requests, if possible:
Copy code
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X GET \
    "<http://localhost:8108/collections/people/documents/export?include_fields=title>"
Copy code
curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -X GET \
    "<http://localhost:8108/collections/organizations/documents/export?include_fields=organizations>"
m
HI @Harpreet Sangar exporting such a large dataset could be hectic, you can use this git repos to replicate this https://github.com/jneidel/job-titles/blob/master/job-titles.json https://github.com/kmmuntasir/random-name-dataset/blob/master/names/yob1880.txt
👀 1
@Kishore Nallan I think partitioning won't work considering our query pattern: For example, we want to fetch the list of people who work with organizations which has an employee count of 10000+, and it's possible that these organizations easily exist across partitions right. which endsup scanning all the tables.
k
Yeah, if you want global counts then that's not possible.
1
m
HI, I have denormalized (Merged people and org table) the data in order to achieve better latency I see a lot of improvement now (30 sec to 8 sec) but I am still struggling with optimizing left join here.
Copy code
title:['Manager'] && organizations.country:['United States'] && (id: * || $tuesday_development(workspace_id : ws30ngrb03luqxcl27))
any suggestion to improve this, I have already tried this but it's fetching the wrong results.
Copy code
(title:['Manager'] && organizations.country:['United States']) ||($tuesday_development(workspace_id : ws30ngrb03luqxcl27))
h
Can you provide a small example that shows the wrong results being returned?
m
Sure
Copy code
[
     {
            "document": {
                "city": "San Diego",
                "country": "United States",
                "country_code": "US",
                "country_region": "NORAM",
                "email": "<mailto:kiana.west@envedabio.com|kiana.west@envedabio.com>",
                "email_status": "Verified",
                "first_name": "Kiana",
                "id": "tu_padk6um2w",
                "job_start_date": "2022-11-01",
                "organizations.founded_year": "2019",
                "organizations.name": "Enveda Biosciences",
                "organizations.phone": "",
                "organizations.primary_domain": "<http://envedabio.com|envedabio.com>",
                "organizations.website_url": "<https://envedabio.com>",
                "seniority": "Manager",
                "state": "California",
                "title": "Product Manager"
            },
            "highlight": {},
            "highlights": []
        },
        {
            "document": {
                "city": "Longmont",
                "country": "United States",
                "country_code": "US",
                "country_region": "NORAM",
                "email": "",
                "email_status": "Not Available",
                "first_name": "Jackson",
                "id": "tu_pacw23pjg",
                "job_start_date": "2022-08-01",
                "last_name": "Starkey",
                "organizations.name": "Amazon",
                "organizations.phone": "",
                "organizations.primary_domain": "<http://amazon.com|amazon.com>",
                "organizations.website_url": "<https://www.amazon.com>",
                "seniority": "Staff",
                "state": "Colorado",
                "title": "Software Engineer",
                "development": {
                    "stage": "saved"
                }
            },
            "highlight": {},
            "highlights": []
        },
]
if you see the second result it gives the title = "software engineer" which is wrong.
h
To recreate the bug I would need the following: • Schemas of the collections • Sample documents • The query that returns unexpected results preferably in this format: https://gist.github.com/jasonbosco/7c3432713216c378472f13e72246f46b
m
okay will get back to you
Copy code
### Run Typesense via Docker ########################################
export TYPESENSE_API_KEY=xyz
    
mkdir "$(pwd)"/typesense-data

docker run -p 8108:8108 \
            -v"$(pwd)"/typesense-data:/data typesense/typesense:27.1 \
            --data-dir /data \
            --api-key=$TYPESENSE_API_KEY \
            --enable-cors

### Reproduction Steps ###############################################
export TYPESENSE_API_KEY=xyz

curl "<http://localhost:8108/debug>" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}"


curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
         "name": "companies",
         "fields": [
           {"name": "company_name", "type": "string" },
           {"name": "num_employees", "type": "int32" },
           {"name": "country", "type": "string", "facet": true }
         ],
         "default_sorting_field": "num_employees"
       }'
       
curl "<http://localhost:8108/collections/companies/documents/import?action=create>" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -H "Content-Type: text/plain" \
        -X POST \
        -d '{"id": "124","company_name": "Stark Industries","num_employees": 5215,"country": "USA"}
	{"id": "125","company_name": "Acme Corp 1","num_employees": 2133,"country": "CA"}
	{"id": "126","company_name": "Acme Corp 2","num_employees": 2133,"country": "USA"}
	{"id": "127","company_name": "Acme Corp 3","num_employees": 2133,"country": "INDIA"}
	{"id": "128","company_name": "Stark Industries 2","num_employees": 5215,"country": "USA"}
	{"id": "129","company_name": "Acme Corp 4","num_employees": 2133,"country": "CA"}
	{"id": "120","company_name": "Acme Corp 5","num_employees": 2133,"country": "USA"}
	{"id": "122","company_name": "Acme Corp 6","num_employees": 2133,"country": "INDIA"}'
		   
curl "<http://localhost:8108/collections>" \
       -X POST \
       -H "Content-Type: application/json" \
       -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
       -d '{
	    "name": "development",
	    "fields": [
		{
		    "index": true,
		    "name": "workspace_id",
		    "type": "string"
		},
		{
		    "name": "user_id",
		    "type": "string"
		},
		{
		    "name": "company_id",
		    "reference": "companies.id",
		    "optional": true,
		    "index": true,
		    "type": "string"
		},
		{
		    "index": true,
		    "name": "stage",
		    "type": "string"
		}
	    ]
	}'
        
curl "<http://localhost:8108/collections/development/documents/import?action=create>" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -H "Content-Type: text/plain" \
        -X POST \
        -d '{"workspace_id": "a","user_id": "test","company_id": "124","stage": "saved"}
'

curl "<http://localhost:8108/multi_search>" \
        -X POST \
        -H "Content-Type: application/json" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
        -d '{
          "searches": [
            {
              "collection": "companies",
              "q": "*",
              "filter_by":"(country:[`India`] && company_name:[`Acme Corp 6`]) || $development(workspace_id: 'a')"
            }
          ]
        }'

### Documentation ######################################################################################
# Visit the API reference section: <https://typesense.org/docs/27.1/api/collections.html>
# Click on the "Shell" tab under each API resource's docs, to get shell commands for other API endpoints
Result with (Not expected) : (country:[
India
] && company_name:[
Acme Corp 6
]) || $development(workspace_id: 'a')
Copy code
{
  "facet_counts": [],
  "found": 2,
  "hits": [
    {
      "document": {
        "company_name": "Stark Industries",
        "country": "USA",
        "development": {
          "company_id": "124",
          "id": "1",
          "stage": "saved",
          "user_id": "test",
          "workspace_id": "a"
        },
        "id": "124",
        "num_employees": 5215
      },
      "highlight": {},
      "highlights": []
    },
    {
      "document": {
        "company_name": "Acme Corp 6",
        "country": "INDIA",
        "id": "122",
        "num_employees": 2133
      },
      "highlight": {},
      "highlights": []
    }
  ],
  "out_of": 8,
  "page": 1,
  "request_params": {
    "collection_name": "companies",
    "first_q": "*",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}
Result with(Expected): (country:[
India
] && company_name:[
Acme Corp 6
]) && (id:* || $development(workspace_id: 'a'))
Copy code
{
  "facet_counts": [],
  "found": 1,
  "hits": [
    {
      "document": {
        "company_name": "Acme Corp 6",
        "country": "INDIA",
        "id": "122",
        "num_employees": 2133
      },
      "highlight": {},
      "highlights": []
    }
  ],
  "out_of": 8,
  "page": 1,
  "request_params": {
    "collection_name": "companies",
    "first_q": "*",
    "per_page": 10,
    "q": "*"
  },
  "search_cutoff": false,
  "search_time_ms": 0
}
BTW, i am using 27.1 version let me know if that's okay
h
The correct way to get your desired result is with the following `filter_by`:
Copy code
country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')
Now to understand how this is equivalent to the
filter_by
that produces the correct result:
Copy code
(country:[`India`] && company_name:[`Acme Corp 6`]) && (id:* || $development(workspace_id: 'a'))
Let's suppose we have 3 documents: • id: 0 that matches `country:[
India
] && company_name:[
Acme Corp 6
]` filter and references
workspace_id: 'a'
• id: 1 that matches `country:[
India
] && company_name:[
Acme Corp 6
]` filter but references
workspace_id: 'b'
• id: 2 that does not match `country:[
India
] && company_name:[
Acme Corp 6
]` filter and references
workspace_id: 'a'
This is how the
filter_by
is evaluated:
Copy code
(0, 1) && ((0, 1, 2) || (0, 2))
That further is evaluated as:
Copy code
(0, 1) && (0, 1, 2)
That finally returns only the following documents:
Copy code
0, 1
So doing
&& (id:* || ...)
is wasted evaluation.
m
but doing this does the inner join right? i want filter to still work if workspace_id: 'x' doesn't exist in development table.
Copy code
country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')
h
I can only think of two ways to achieve this right now • The simple way - Query
development
like:
Copy code
{
  q: *,
  filter_by: workspace_id: x,
  per_page: 0
}
and the check if the
found
count is 0 or not. If 0, your query will be
Copy code
country:[`India`] && company_name:[`Acme Corp 6`]
otherwise:
Copy code
country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a')
• If you wish to achieve the result with a single query, you'll have to send:
Copy code
(country:[`India`] && company_name:[`Acme Corp 6`]) || (country:[`India`] && company_name:[`Acme Corp 6`] && $development(workspace_id: 'a'))
1
m
Hey guys, I have now denormalized the data and made some tradeoffs with left joins on our side, but now I am seeing this new issue where I am just executing this simple query but facing 12 sec of latency.
Copy code
"title:['Software Engineer'] && job_start_date_epoch :< 1638247099140"
this is on people's collection of 100M records any idea why it's happening or any way to optimize this. btw this doesn't happen with other filters like.
Copy code
"title:['Software Engineer'] && country :['india']"
h
You can enable
range_index
on your
job_start_date_epoch
field so a range filter like
job_start_date_epoch :< 1638247099140
can evaluate faster.
👍 1
m
Thanks
h
@Manav Kothari Can you share what's the response time for
Copy code
"title:['Software Engineer'] && job_start_date_epoch :< 1638247099140"
now with
range_index
?
m
To enable the
range_index
i need to remigrate the data again, as we went to production we are carefully doing this, will update you on this Thanks
h
Just a heads up,
range_index
will increase the ram usage.
👀 1
m
do you know any number(banchmark) how much difference it can make?
h
It will be at least 3x more than the normal numeric index. It also depends upon how sparse your dataset is but there should be significant improvement in the response time.
m
hi is there a way to update the reference table without migration?
h
No. Any schema changes related to references can only be implemented by dropping and re-indexing the collection.
👍 1
m
Hey @Harpreet Sangar, how can I do left join on this
Copy code
title:['Software Engineer','Assistant Manager'] && $development_v2(workspace_id : '1' && stage:!['saved','blocked'])
h
i want filter to still work if workspace_id: 'x' doesn't exist in development table.
Is this still a requirement?
m
Yeah actually, want to get a record that is not in the development_v2 table. performance is not a constraint here. though.
h
So you want a document that has either of the titles but is not present in development_v2?
m
yes i don't want to fetch the record that are present on development_v2 table which met this condition
Copy code
$development_v2(workspace_id : '1' && stage:['saved','blocked'])
h
We don't have a negation operator in the filter_by so you'll have to update your filter like:
Copy code
title:['Software Engineer','Assistant Manager'] && $development_v2(workspace_id :!= '1' || stage:!=['saved','blocked'])
following De Morgan's laws.
m
this query doesn't work for me
h
Can you provide an example?
m
yes i mean from table one if there is a, b ,c and in table 2 c is present then I want to fetch a, b
h
Give me some time to think about this. I'll get back to you.
👍 1
m
Hi @Harpreet Sangar have you found any solution?
h
To get clarity, in this example, there is no relation to
product_c
in the
CustomerProductPrices
collection. Is the requirement to get
product_c
in the result?
m
Yes
h
We don't have a way to achieve this right now.
Can you create a feature request regarding this on Github issues?
👀 1
m
Okay
is there any workaround?
h
You can get the ids of the document first by sending this:
Copy code
filter_by: $development_v2(workspace_id : '1' && stage:['saved','blocked'])
and then send another query like:
Copy code
filter_by: id:!=[...]
It will be better to use the export endpoint for the first query.
1
m
Thanks