#community-help

Converting finbert to ONNX for Typesense Model Repository

TLDR Walter requested the conversion of finbert to onnx from Jason. Conversation included discussions about model differences and technical adjustments, ultimately ending in a successful conversion and a plan to handle a known bug.

Powered by Struct AI

1

Sep 21, 2023 (2 months ago)
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
10:52 PM
Hey Jason, about a month ago you offered to do the conversion of finbert to onnx for the typesense model repo on hugging face. I started looking into it but the parameters for converting pytorch to onnx required quite a bit of learning. If you guys have capacity it would be awesome if you could add it, especially if you know what needs to be done for onnx conversion. This is the repo I was looking at (https://huggingface.co/yiyanghkust/finbert-pretrain)

Also no big deal if you can't! I already appreciate everything you guys have done.
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
10:56 PM
Happy to add it!
10:56
Jason
10:56 PM
Is that different from this model: https://huggingface.co/ProsusAI/finbert
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
11:10 PM
prosus finbert seems to be fine-tuned for sentiment e.g. positive, neutral, negative. finbert-pretrain was the original base finbert. But the one on hugging face is for mask-filling. I am not sure how you've converted the other ones to get the embeddings. Do they have to be tagged with the "sentence similarity" pipeline?
Sep 22, 2023 (2 months ago)
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
06:42 PM
Walter We had to modify some of the internals to get it it to generate embeddings. Could you now try using ts/finbert?
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
09:19 PM
wow. you guys are legends.

I tried dropping our current embedding field and re-adding it with the model name replaced (swap e5-small for finbert). It says:

> Error: e: Request failed with HTTP code 400 | Server said: Schema change is incompatible with the type of documents already stored in this collection. error: Field embedding must have 768 dimensions.
My guess is that e5-small has fewer dimensions. that vector is still stored in the typesense document, and if I want to use the finbert embeddings I need to use a different field?
Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:21 PM
Yeah, you would have to use a different field. Btw, there's a quirk with altering an existing collection and adding a new embedding field (it currently doesn't remebed existing docs). We're fixing this bug.

So until then you want to create a new collection and reindex your docs in it
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
09:24 PM
ok gotcha. if we update all the docs regularly, it should fix itself right?

Is that bug fix a few days or few weeks away? If a few days, I'll wait, if a few weeks, I'll probably create new collections.

Again, thanks for being so responsive and adding finbert so quickly 🙏

1

Jason
Photo of md5-8813087cccc512313602b6d9f9ece19f
Jason
09:33 PM
> if we update all the docs regularly, it should fix itself right?
If the field values change, then it will fix itself. Otherwise, if it's an upsert with the same data, then the embeddings won't be regenerated.

The bug fix is about 1-2 weeks away
Walter
Photo of md5-b0a343a23053bb091cc198f636ad4103
Walter
09:34 PM
Ok great. I can wait for that.

Typesense

Lightning-fast, open source search engine for everyone | Knowledge Base powered by Struct.AI

Indexed 3015 threads (79% resolved)

Join Our Community

Similar Threads

Customizing Embedding Models for Finance and Economics App

Walter asked for help implementing a finance-focused model to his application. Jason provided instructions on how to use custom models and offered to convert and upload a finbert model for Walter to use on Typesense Cloud.

1

8
4mo

Utilizing Vector Search and Word Embeddings for Comprehensive Search in Typesense

Bill sought clarification on using vector search with multiple word embeddings in Typesense and using them instead of OpenAI's embedding. Kishore Nallan and Jason informed him that their development version 0.25 supports open source embedding models. They also resolved Bill's concerns regarding search performance, language support, and limitations in the search parameters.

11

225
4mo

Issues with Cluster Upgrade and Embedding Field

Gustavo had issues upgrading their cluster and their embedding field wasn't being filled. Jason helped to solve the upgrade issue and advised re-indexing the documents to solve the embedding field issue. Both problems were successfully resolved.

8

72
3mo

Finding Similar Documents Using JSON and Embeddings

Manish wants to find similar JSON documents and asks for advice. Jason suggests using Sentence-BERT with vector query and provides guidance on working with OpenAI embeddings and Typesense. They discuss upcoming Typesense features and alternative models.

8

64
7mo

Discussion on Performance and Scalability for Multiple Term Search

Bill asks the best way for multi-term searches in a recommendation system they developed. Kishore Nallan suggested using embeddings and remote embedder or storing and averaging vectors. Despite testing several suggested solutions, Bill continued to face performance issues, leading to unresolved discussions about scalability and recommendation system performance.

3

105
1w