OpenAI text-embedding-3-largetext-embedding-3-small / Local Models

Reduce Embedding Costs by 80%: OpenAI vs Cohere vs Local Models

7 min read

OpenAICost OptimizationRAGEmbeddings

Embeddings are a hidden cost driver in RAG applications. Most teams are using text-embedding-3-large when text-embedding-3-small is 80% cheaper and nearly identical in quality for most use cases. This guide shows you the right model for each scenario.

Why Embedding Costs Sneak Up On You

For every user query in a RAG application, you embed:

The query itself (at query time)

New documents being added to your index (at ingestion time)

At scale, ingestion costs dominate. A legal tech company ingesting 1M documents/day at 500 tokens each spends:

text-embedding-3-large: $500M tokens × $0.13/1M = $65/day

text-embedding-3-small: $500M tokens × $0.02/1M = $10/day

Same quality for most use cases. $55/day difference. $20K/year difference.

Found this guide useful?

Get weekly AI credit updates — new programs, price drops, migration tips. Free, always.

Using our affiliate links supports free access to all guides.

The Embedding Model Landscape (2026)

Model	Cost per 1M tokens	Dimensions	MTEB Score	Best For
text-embedding-3-large	$0.13	3,072	64.6	High-precision retrieval
text-embedding-3-small	$0.02	1,536	62.3	General RAG
Cohere embed-v3	$0.10	1,024	64.5	Multilingual
Cohere embed-v3 (English)	$0.10	1,024	64.5	Production RAG
nomic-embed-text (local)	$0	768	62.0	High volume / private data
sentence-transformers/all-MiniLM	$0	384	56.2	Low-resource scenarios

The Decision Tree

Is your data multilingual?
├── Yes → Cohere embed-multilingual-v3.0 ($0.10/1M)
└── No:
    Is your data sensitive / can't leave your servers?
    ├── Yes → nomic-embed-text (local, free)
    └── No:
        Is MTEB accuracy critical (legal, medical)?
        ├── Yes → text-embedding-3-large ($0.13/1M)
        └── No → text-embedding-3-small ($0.02/1M) ✓

For 85% of RAG use cases: text-embedding-3-small is the right answer.

Switching from text-embedding-3-large to small

python

# BEFORE
response = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts
)

# AFTER — 6.5x cheaper, 1.5% less accurate
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts
)

Important: Switching models requires re-embedding your entire index. Plan for this:

python

def re_embed_index(texts: list[str], batch_size: int = 100):
    # Re-embed in batches of 100 to avoid rate limits
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        all_embeddings.extend([e.embedding for e in response.data])
    return all_embeddings

Going Local: nomic-embed-text

For teams with sensitive data or high volume, local embeddings eliminate costs entirely:

bash

# Install Ollama (runs models locally)
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull nomic-embed-text

python

import requests

def embed_local(text: str) -> list[float]:
    response = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": text}
    )
    return response.json()["embedding"]

# Or use the Ollama Python client
from ollama import Client
client = Client()
response = client.embeddings(model="nomic-embed-text", prompt="Your text here")
embeddings = response["embedding"]

Benchmarks on a MacBook M3 Pro:

nomic-embed-text: ~2,000 tokens/second

For 1M tokens: ~8 minutes, $0 cost

Dimension Reduction: Another Cost Lever

OpenAI's text-embedding-3 models support dimensions to reduce vector size:

python

# Reduce from 1536 to 512 dimensions — 3x cheaper storage
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
    dimensions=512  # loses ~1% accuracy, saves 66% on vector storage
)

For Pinecone or pgvector, smaller vectors mean:

3x more vectors per GB of storage

2x faster similarity search

Migration Checklist

[ ] Benchmark your top-100 queries with text-embedding-3-small before switching

[ ] Plan re-indexing job (off-peak, batched, with rate limit handling)

[ ] Keep old vectors temporarily — rollback by re-enabling the old embedding model

[ ] Update the model name in all code paths (ingestion + query)

[ ] Recalculate your similarity threshold — different models have different score distributions

[ ] Monitor retrieval quality with user feedback for 2 weeks post-migration

Reduce Embedding Costs by 80%: OpenAI vs Cohere vs Local Models

Why Embedding Costs Sneak Up On You

The Embedding Model Landscape (2026)

The Decision Tree

Switching from text-embedding-3-large to small

Going Local: nomic-embed-text

Dimension Reduction: Another Cost Lever

Migration Checklist

If this saved you research time...