typesense

That said. I’m creating embeddings off images. Not captions, but images.
Is this the best way to find images by a description - let’s say I want to find images by searching for  „sunset“.
Or should I caption the images and create embeddings off that caption instead?

You can do either with Typesense (embed images directly) or embed the text. The answer to which one is better, depends on your dataset and each model's training dataset, so you will have to experiment and see which ones works best for you

Here's a live demo of the image search feature: <https://ai-image-search.typesense.org>

It's going off of direct image embeddings (that Typesense generates using CLIP internally)

It’s also possible to create a combined embedding of two fields (image + caption).