Nomic Blog: Nomic Embed Vision: Expanding The Latent Space

Author: Nomic Team

Nomic Embed Vision: Expanding The Nomic Latent Space

We are excited to announce Nomic Embed Vision v1 and Nomic Embed Vision v1.5: high quality, fully replicable vision embedding models that share the same latent space as our popular Nomic Embed Text v1 and Nomic Embed Text v1.5 models.

This means that all existing Nomic Embed Text embeddings are now multimodal; Nomic Embed Text embeddings can be used to query the new Nomic Embed Vision embeddings out of the box, and visa versa. Together, Nomic Embed is the only unified embedding space that outperforms OpenAI CLIP and OpenAI Text Embedding 3 Small on multimodal and text tasks respectively.

You can use the Nomic Embed models to:

embed image and text data
perform unimodal semantic search within image and text datasets
perform multimodal semantic search across image and text datasets

With a vision encoder of just 92M parameters, Nomic Embed Vision is ideal for high volume production use cases alongside the 137M Nomic Embed Text.

Additionally, we have open-sourced the training code and replication instructions for Nomic Embed Vision to enable researchers to reproduce and build upon our models.

A Unified Latent Space

Existing multimodal models like CLIP have impressive zero-shot multimodal capabilities. However, CLIP text encoders perform poorly outside of image retrieval tasks, like on MTEB, a benchmark that measures the quality of text embedding models. Nomic Embed Vision is designed to overcome these limitations by aligning a vision encoder to the existing Nomic Embed Text latent space.

The result is a unified multimodal latent space that achieves high performance on image, text, and multimodal tasks, as measured by the Imagenet 0-Shot, MTEB, and Datacomp benchmarks. This unified latent space outperforms the modality specific latent spaces of models like OpenAI Clip and OpenAI Text Embedding 3 Small, and making it the first open weights model to do so.

Model	Imagenet 0-shot	Datacomp Avg.	MTEB Avg.
Nomic Embed v1	70.70	56.7	62.39
Nomic Embed v1.5	71.0	56.8	62.28
OpenAI CLIP ViT B/16	68.34	56.26	43.82
OpenAI Text Embedding 3 Small	N/A	N/A	62.26
Jina CLIP v1	59.08	52.20	60.12

Multimodal Search

Nomic Embed Vision powers multimodal search in Atlas. To showcase the power of multimodal vector search, we uploaded a dataset of 100,000 images and captions from CC3M, and found all the animals that are cute to cuddle with:

This query demonstrates Nomic Embed has a semantic understanding of both the question being asked and the content of the images!

Try it out yourself: https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map/ad24e7d9-4e82-484d-a52c-79fed0da2c60#iK2R

Multimodal Embedding Models

Multimodal embedding models are trained contrastively on large-scale image-caption datasets. These Contrastive Language Image Pretrained (CLIP) models learn relationships between images and text by learning to predict image-caption pairs. Similar to text embedding models, a large batch size (16k-32k) and a large dataset (400M image-text pairs and larger!) are used to train CLIP models.

Training CLIP Models (courtesy of OpenAI) — Contrastive Training of Image and Text Encoders (courtesy of OpenAI)

For example, OpenAI's CLIP model was trained on 400 million image-text pairs for 32 epochs, totaling ~13 billion image-text pairs seen during training. While existing CLIP models excel on tasks like zero-shot multimodal classification and retrieval, they underperform on unimodal text tasks like semantic similarity and text retrieval.

Training Nomic Embed Vision

To address this shortcoming, we trained Nomic Embed Vision, a vision encoder that is compatible with Nomic Embed Text, our existing long-context text encoder. Training Nomic Embed Vision requires aligning a vision encoder with an existing text encoder without destroying the downstream performance of the text encoder.

We iterated on several approaches to solving this challenge. In each approach, we initialized the text encoder with Nomic Embed Text and the vision encoder with a pretrained ViT. We experimented with:

Training the vision and text encoders on image-text pairs
Training the vision and text encoders on image-text pairs and text-only data
Three Towers training with image-text pairs only
Three Towers training with image-text pairs and text-only data
Locked-Image Text Tuning (LiT) training with a frozen vision encoder
Locked Text Image Tuning, training with a frozen text encoder

Ultimately, we found that freezing the text encoder and training the vision encoder on image-text pairs only yielded the best results and provided the added bonus of backward compatibility with Nomic Embed Text embeddings.

We trained a ViT B/16 on DFN-2B image-text pairs for 3 full epochs for a total of ~4.5B image-text pairs. Our crawled subset ended up being ~1.5 billion image-text pairs.

We initialized the vision encoder with Eva02 MIM ViT B/16 and the text encoder with Nomic Embed Text. We trained on 16 H100 GPUs, with a global batch size of 65,536, a peak learning rate of 1.0e-3, and a cosine learning rate schedule with 2000 warmup steps.

For more details on the training hyperparameters and to replicate our model, please see the contrastors repository, or our forthcoming tech report.

Nomic Embed Vision was trained on Digital Ocean compute with early experiments on Lambda Labs compute clusters.

Evaluating Multimodal Embeddings

The original CLIP paper evaluated the multimodal model across 27 datasets including Imagenet zero-shot accuracy. Since then, progress on multimodal evaluation progressed with the introduction of the Datacomp benchmark. Much like how MTEB is a collection of tasks used to evaluate text embeddings, Datacomp is a collection of 38 image classification and retrieval tasks used to evaluate multimodal embeddings. Following Datacomp, we evaluated Nomic Embed Vision on the 38 downstream tasks, and found that Nomic Embed Vision outperforms previous models including OpenAI CLIP ViT B/16 and Jina CLIP v1.

Versioning the Nomic Latent Space

We have released two versions of Nomic Embed Vision, v1 and v1.5, which are compatible with Nomic Embed Text v1 and v1.5, respectively. All Nomic Embed models with the same version have compatible latent spaces and can be used for multimodal tasks.

Using Nomic Embed Vision

Model License

We are releasing Nomic-Embed-Vision under a CC-BY-NC-4.0 license. This will enable researchers and hackers to continue experimenting with our models, as well as enable Nomic to continue releasing great models in the future. As Nomic releases future models, we intend to re-license less recent models in our catalogue under the Apache-2.0 license. For inquiries regarding commercial or production use, please reach out to sales@nomic.ai.

Nomic Embed on AWS Marketplace

In addition to using our hosted inference API, you can purchase dedicated inference endpoints on the AWS Marketplace. Please contact sales@nomic.ai with any questions.

Nomic Embed API

Nomic Embed can be used through our production-ready Nomic Embedding API.

You can access the API via HTTP and your Nomic API Key:

curl -X POST \
    -H "Authorization: Bearer $NOMIC_API_KEY" \
    -H "Content-Type: multipart/form-data" \
    -F "model=nomic-embed-vision-v1.5" \
    -F "images=@<path to image>" \
    https://api-atlas.nomic.ai/v1/embedding/image

Additionally, you can embed via static URLs!

curl -X POST \
    -H "Authorization: Bearer $NOMIC_API_KEY" \
    -d "model=nomic-embed-vision-v1.5" \
    -d "urls=https://static.nomic.ai/secret-model.png" \
    -d "urls=https://static.nomic.ai/secret-model.png" \
    https://api-atlas.nomic.ai/v1/embedding/image

In the official Nomic Python Client after you pip install nomic, embedding images is as simple as:

from nomic import embed
import numpy as np

output = embed.image(
    images=[
        "image_path_1.jpeg",
        "image_path_2.png",
    ],
    model='nomic-embed-vision-v1.5',
)

print(output['usage'])
embeddings = np.array(output['embeddings'])
print(embeddings.shape)

...

“Out beyond our ideas of tasks and modalities there is a latent space. I’ll meet you there."

Rumi (probably)