Integrating OpenAI Embeddings with DocSearch Scraper
TLDR Marcos was looking for how to use OpenAI embeddings with DocSearch. Jason guided with an update to the scraper config, and suggested the GTE built-in model for generic use.
Sep 03, 2023 (3 months ago)
Marcos
02:19 PMAny recipes on how to use OpenAI embeddings with DocSearch scrapper?
Jason
03:35 PMtypesense/docsearch-scraper:0.9.0.rc1
which adds support for setting custom field definitions, using which you can add a new auto-embedding field like below in the scraper config:{
"index_name": "typesense_docs",
"start_urls": [
{
"url": "<version>.*?)/",
"variables": {
"version": [
"0.21.0"
]
}
}
],
"selectors": {
"default": {
"lvl0": ".content__default h1",
"lvl1": ".content__default h2",
"lvl2": ".content__default h3",
"lvl3": ".content__default h4",
"lvl4": ".content__default h5",
"text": ".content__default p, .content__default ul li, .content__default table tbody tr"
}
},
"custom_settings": {
"field_definitions": [ // <==== ADD THIS
{"name": "anchor", "type": "string", "optional": true},
{"name": "content", "type": "string", "optional": true},
{"name": "url", "type": "string", "facet": true},
{"name": "url_without_anchor", "type": "string", "facet": true, "optional": true},
{"name": "version", "type": "string[]", "facet": true, "optional": true},
{"name": "hierarchy.lvl0", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl1", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl2", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl3", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl4", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl5", "type": "string", "facet": true, "optional": true},
{"name": "hierarchy.lvl6", "type": "string", "facet": true, "optional": true},
{"name": "type", "type": "string", "facet": true, "optional": true},
{"name": ".*_tag", "type": "string", "facet": true, "optional": true},
{"name": "language", "type": "string", "facet": true, "optional": true},
{"name": "tags", "type": "string[]", "facet": true, "optional": true},
{"name": "item_priority", "type": "int64"},
{
"name": "embedding",
"type": "float[]",
"embed": {
"from": [
"content",
"hierarchy.lvl0",
"hierarchy.lvl1",
"hierarchy.lvl2",
"hierarchy.lvl3",
"hierarchy.lvl4",
"hierarchy.lvl5",
"hierarchy.lvl6",
"tags"
],
"model_config": {
"model_name": "openai/text-embedding-ada-002",
"api_key": "your_openai_api_key"
}
}
}
]
}
}
Jason
03:38 PMquery_by
parameter to typesenseSearchParameters
or typesenseSearchParams
(depending on what it's called in that particular version of docsearch.js):{
query_by: "hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,embedding",
prefix: false
}
Jason
03:40 PMprefix: false
is required for OpenAI)Jason
03:40 PMMarcos
10:06 PMJason
10:07 PMJason
10:07 PMMarcos
10:10 PMSep 04, 2023 (3 months ago)
Jason
02:31 AMTypesense
Indexed 3011 threads (79% resolved)
Similar Threads
Finding Similar Documents Using JSON and Embeddings
Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.
Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense
Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.
Optimum Cluster for 1M Documents with OpenAI Embedding
Denny inquired about the ideal cluster configuration for handling 1M documents with openAI embedding. Jason recommended a specific configuration, explained record size calculation, and clarified embedding generation speed factors and the conditions that trigger openAI.
Issues with Cluster Upgrade and Embedding Field
Gustavo had issues upgrading their cluster and their embedding field wasn't being filled. Jason helped to solve the upgrade issue and advised re-indexing the documents to solve the embedding field issue. Both problems were successfully resolved.
Issues with Embeddings on Collection with 80K Documents
Samuel experienced issues when enabling embeddings on a large collection, leading to an unhealthy cluster. Kishore Nallan suggested rolling back to a previous snapshot, advised on memory calculations for OpenAI embeddings, and confirmed that creating a new cluster should solve the problem.