Back to Guides
OpenAI text-embedding-3-largetext-embedding-3-small / Local Models

Reduce Embedding Costs by 80%: OpenAI vs Cohere vs Local Models

7 min read
OpenAICost OptimizationRAGEmbeddings
Share: Tweet Share

Embeddings are a hidden cost driver in RAG applications. Most teams are using text-embedding-3-large when text-embedding-3-small is 80% cheaper and nearly identical in quality for most use cases. This guide shows you the right model for each scenario.

Why Embedding Costs Sneak Up On You

For every user query in a RAG application, you embed:

  • The query itself (at query time)
  • New documents being added to your index (at ingestion time)
  • At scale, ingestion costs dominate. A legal tech company ingesting 1M documents/day at 500 tokens each spends:

  • text-embedding-3-large: $500M tokens × $0.13/1M = $65/day
  • text-embedding-3-small: $500M tokens × $0.02/1M = $10/day
  • Same quality for most use cases. $55/day difference. $20K/year difference.


    Found this guide useful?

    Get weekly AI credit updates — new programs, price drops, migration tips. Free, always.

    Using our affiliate links supports free access to all guides.

    The Embedding Model Landscape (2026)

    ModelCost per 1M tokensDimensionsMTEB ScoreBest For
    text-embedding-3-large$0.133,07264.6High-precision retrieval
    text-embedding-3-small$0.021,53662.3General RAG
    Cohere embed-v3$0.101,02464.5Multilingual
    Cohere embed-v3 (English)$0.101,02464.5Production RAG
    nomic-embed-text (local)$076862.0High volume / private data
    sentence-transformers/all-MiniLM$038456.2Low-resource scenarios

    The Decision Tree

    Is your data multilingual?
    ├── Yes → Cohere embed-multilingual-v3.0 ($0.10/1M)
    └── No:
        Is your data sensitive / can't leave your servers?
        ├── Yes → nomic-embed-text (local, free)
        └── No:
            Is MTEB accuracy critical (legal, medical)?
            ├── Yes → text-embedding-3-large ($0.13/1M)
            └── No → text-embedding-3-small ($0.02/1M) ✓

    For 85% of RAG use cases: text-embedding-3-small is the right answer.


    Switching from text-embedding-3-large to small

    python
    # BEFORE
    response = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts
    )
    
    # AFTER — 6.5x cheaper, 1.5% less accurate
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )

    Important: Switching models requires re-embedding your entire index. Plan for this:

    python
    def re_embed_index(texts: list[str], batch_size: int = 100):
        # Re-embed in batches of 100 to avoid rate limits
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            response = client.embeddings.create(
                model="text-embedding-3-small",
                input=batch
            )
            all_embeddings.extend([e.embedding for e in response.data])
        return all_embeddings

    Going Local: nomic-embed-text

    For teams with sensitive data or high volume, local embeddings eliminate costs entirely:

    bash
    # Install Ollama (runs models locally)
    curl -fsSL https://ollama.ai/install.sh | sh
    ollama pull nomic-embed-text
    python
    import requests
    
    def embed_local(text: str) -> list[float]:
        response = requests.post(
            "http://localhost:11434/api/embeddings",
            json={"model": "nomic-embed-text", "prompt": text}
        )
        return response.json()["embedding"]
    
    # Or use the Ollama Python client
    from ollama import Client
    client = Client()
    response = client.embeddings(model="nomic-embed-text", prompt="Your text here")
    embeddings = response["embedding"]

    Benchmarks on a MacBook M3 Pro:

  • nomic-embed-text: ~2,000 tokens/second
  • For 1M tokens: ~8 minutes, $0 cost

  • Dimension Reduction: Another Cost Lever

    OpenAI's text-embedding-3 models support dimensions to reduce vector size:

    python
    # Reduce from 1536 to 512 dimensions — 3x cheaper storage
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
        dimensions=512  # loses ~1% accuracy, saves 66% on vector storage
    )

    For Pinecone or pgvector, smaller vectors mean:

  • 3x more vectors per GB of storage
  • 2x faster similarity search

  • Migration Checklist

  • [ ] Benchmark your top-100 queries with text-embedding-3-small before switching
  • [ ] Plan re-indexing job (off-peak, batched, with rate limit handling)
  • [ ] Keep old vectors temporarily — rollback by re-enabling the old embedding model
  • [ ] Update the model name in all code paths (ingestion + query)
  • [ ] Recalculate your similarity threshold — different models have different score distributions
  • [ ] Monitor retrieval quality with user feedback for 2 weeks post-migration
  • Platform actively maintained

    If this saved you research time...

    No ads, no paywalls. A quick share on Reddit or LinkedIn goes a long way for an independent project.  ·  53 verified AI credit programs  ·  Content refreshed June 2026.

    We use cookies & analytics

    We use cookies for analytics (GA4, Umami) and to improve your experience. No personal data is sold. Privacy Policy