#community-help

Integrating OpenAI Embeddings with DocSearch Scraper

TLDR Marcos was looking for how to use OpenAI embeddings with DocSearch. Jason guided with an update to the scraper config, and suggested the GTE built-in model for generic use.

Powered by Struct AI
10
3mo
Solved
Join the chat
Sep 03, 2023 (3 months ago)
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
02:19 PM
Hey folks!

Any recipes on how to use OpenAI embeddings with DocSearch scrapper?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
03:35 PM
I just published typesense/docsearch-scraper:0.9.0.rc1 which adds support for setting custom field definitions, using which you can add a new auto-embedding field like below in the scraper config:

{
  "index_name": "typesense_docs",
  "start_urls": [
    {
      "url": "<version>.*?)/",
      "variables": {
        "version": [
          "0.21.0"
        ]
      }
    }
  ],
  "selectors": {
    "default": {
      "lvl0": ".content__default h1",
      "lvl1": ".content__default h2",
      "lvl2": ".content__default h3",
      "lvl3": ".content__default h4",
      "lvl4": ".content__default h5",
      "text": ".content__default p, .content__default ul li, .content__default table tbody tr"
    }
  },
  "custom_settings": {
    "field_definitions": [ // <==== ADD THIS
      {"name": "anchor", "type": "string", "optional": true},
      {"name": "content", "type": "string", "optional": true},
      {"name": "url", "type": "string", "facet": true},
      {"name": "url_without_anchor", "type": "string", "facet": true, "optional": true},
      {"name": "version", "type": "string[]", "facet": true, "optional": true},
      {"name": "hierarchy.lvl0", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl1", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl2", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl3", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl4", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl5", "type": "string", "facet": true, "optional": true},
      {"name": "hierarchy.lvl6", "type": "string", "facet": true, "optional": true},
      {"name": "type", "type": "string", "facet": true, "optional": true},
      {"name": ".*_tag", "type": "string", "facet": true, "optional": true},
      {"name": "language", "type": "string", "facet": true, "optional": true},
      {"name": "tags", "type": "string[]", "facet": true, "optional": true},
      {"name": "item_priority", "type": "int64"},
      {
        "name": "embedding",
        "type": "float[]",
        "embed": {
          "from": [
            "content",
            "hierarchy.lvl0",
            "hierarchy.lvl1",
            "hierarchy.lvl2",
            "hierarchy.lvl3",
            "hierarchy.lvl4",
            "hierarchy.lvl5",
            "hierarchy.lvl6",
            "tags"
          ],
          "model_config": {
            "model_name": "openai/text-embedding-ada-002",
            "api_key": "your_openai_api_key"
          }
        }
      }
    ]
  }
}
03:38
Jason
03:38 PM
Then in your docsearch.js configuration you want to add a custom query_by parameter to typesenseSearchParameters or typesenseSearchParams (depending on what it's called in that particular version of docsearch.js):

{ 
  query_by:  "hierarchy.lvl0,hierarchy.lvl1,hierarchy.lvl2,hierarchy.lvl3,hierarchy.lvl4,hierarchy.lvl5,hierarchy.lvl6,content,embedding",
  prefix: false
}
03:40
Jason
03:40 PM
(The prefix: false is required for OpenAI)
03:40
Jason
03:40 PM
Could you give it a shot and let me know how it goes? I can then update the docs
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
10:06 PM
Quick question: is GPU required for using embeddings?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:07 PM
No, it's not required, especially if you're using OpenAI it's unecessary since it's openai that's doing generating the embeddings and sending them to Typesense
10:07
Jason
10:07 PM
If you use one of the built-in models, then adding a GPU will speed things up especially if you have say 100s of thousands of docs or more, the difference in embedding generation becomes significant when using a CPU (slower) vs a GPU (much much faster)
Marcos
Photo of md5-190d44ed75b5c212aad1deb8ffdf1b6c
Marcos
10:10 PM
Jason, is there any built-in model you consider great for docs? Have you tried any for such a purpose?
Sep 04, 2023 (3 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
02:31 AM
I haven’t experimented with the docs use-case, but in general the GTE model seems to perform well for generic use cases

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3011 threads (79% resolved)

Join Our Community

Similar Threads

Finding Similar Documents Using JSON and Embeddings

Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.

8

64
7mo
Solved

Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense

Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.

11

225
4mo
Solved

Optimum Cluster for 1M Documents with OpenAI Embedding

Denny inquired about the ideal cluster configuration for handling 1M documents with openAI embedding. Jason recommended a specific configuration, explained record size calculation, and clarified embedding generation speed factors and the conditions that trigger openAI.

2

12
3mo
Solved

Issues with Cluster Upgrade and Embedding Field

Gustavo had issues upgrading their cluster and their embedding field wasn't being filled. Jason helped to solve the upgrade issue and advised re-indexing the documents to solve the embedding field issue. Both problems were successfully resolved.

8

72
3mo
Solved

Issues with Embeddings on Collection with 80K Documents

Samuel experienced issues when enabling embeddings on a large collection, leading to an unhealthy cluster. Kishore Nallan suggested rolling back to a previous snapshot, advised on memory calculations for OpenAI embeddings, and confirmed that creating a new cluster should solve the problem.

1

39
1w
Solved