Effloow / Articles / LiteLLM: One Proxy for 140+ LLMs — Setup & Cost Guide
LiteLLM: One Proxy for 140+ LLMs — Setup & Cost Guide

LiteLLM: One Proxy for 140+ LLMs — Setup & Cost Guide

LiteLLM unifies 100+ LLM APIs behind one OpenAI-compatible endpoint. Learn to self-host, control costs, and set provider fallbacks in 2026.

· Effloow Content Factory
#litellm #llm-gateway #ai-infrastructure #self-hosting #cost-management #openai-compatible #llm-proxy
Share

Why Every Production AI Team Needs an LLM Gateway

Enterprise spending on LLM APIs surpassed $8.4 billion in 2026. The teams burning the most aren't the ones with the largest models — they're the ones without a routing layer between their code and the provider.

The pattern plays out like this: you start on OpenAI, get a good prototype, then want to test Anthropic's Claude for cost reasons or Google Gemini for speed. Before long your codebase has three separate SDK imports, four different authentication flows, and zero visibility into which model is spending what. When OpenAI has an outage, the entire product goes down.

An LLM gateway breaks this coupling. You point your code at one endpoint, configure your providers once, and the gateway handles routing, authentication, retries, cost tracking, and fallbacks — invisibly.

LiteLLM is the most widely deployed open-source solution for this problem. With 43,500+ GitHub stars, 240M+ Docker pulls, and over 1 billion production requests processed, it has become the de facto starting point for teams standardizing their LLM infrastructure. This guide covers everything from five-minute local setup to production-grade cost enforcement.


What LiteLLM Actually Is

LiteLLM ships as two distinct tools that work together or independently:

1. Python SDK — a library you pip install and call directly in code. It normalizes the API surface of 100+ providers to match OpenAI's chat.completions interface. Swap openai.chat.completions.create(model="gpt-4o") to litellm.completion(model="anthropic/claude-opus-4-6") and the rest of your code is unchanged.

2. Proxy Server (AI Gateway) — a self-hosted HTTP server your entire team or organization points to. It exposes an OpenAI-compatible /v1/chat/completions endpoint, handles all provider credentials centrally, and adds production features: virtual API keys, per-team budgets, semantic caching, rate limiting, guardrails, and a web dashboard.

Most teams start with the SDK for development, then graduate to the proxy when they have multiple developers, multiple models, or real cost concerns.

Key stats (April 2026):

  • Latest version: v1.83.8 (released April 15, 2026)
  • License: MIT (open core — some enterprise features paywalled)
  • 1,397+ contributors
  • Supports: OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure, Mistral, Ollama, vLLM, Cohere, HuggingFace, NVIDIA NIM, and 90+ more

Core Concepts Before You Deploy

Understanding four concepts will save you from the most common configuration mistakes.

Unified API Translation

Every LLM provider has a slightly different API shape. Anthropic uses messages with a system parameter at the top level. Bedrock uses its own request format. Gemini has contents arrays with parts. LiteLLM's core job is translating your single OpenAI-format request into whatever the downstream provider expects — and translating the response back.

This translation is transparent. Your client code never changes; only the model string changes.

import litellm

# Same code, different provider — just swap the model string
response = litellm.completion(
    model="anthropic/claude-sonnet-4-6",   # or "gpt-4o", "gemini/gemini-3-pro", etc.
    messages=[{"role": "user", "content": "Explain vector embeddings briefly."}]
)
print(response.choices[0].message.content)

Virtual Keys

Instead of distributing your actual provider API keys to every team member or service, LiteLLM creates virtual keys — proxy-issued tokens that look like API keys but map to real credentials stored server-side. You issue a virtual key to a team, set a monthly spend cap, and revoke it without touching the real credentials.

Provider Fallbacks

You define a primary model and one or more fallbacks. If the primary returns a 429 (rate limit), 503, or context-length error, LiteLLM automatically retries with the next provider in the chain. The client sees a single successful response and never knows a failover happened.

Semantic Caching

Rather than exact-match caching (where "Hello" and "hello" are different keys), semantic caching uses vector similarity. Two prompts that mean the same thing return the same cached response. In workloads with repetitive queries — customer support, RAG retrieval, code generation templates — this can cut token spend by 30–60%.


Quick Start: Docker Compose in 5 Minutes

The fastest path to a running LiteLLM proxy is Docker Compose. You need Docker and your provider API keys.

1. Create the project folder and config file:

# litellm-config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-pro
    litellm_params:
      model: gemini/gemini-3-pro
      api_key: os.environ/GEMINI_API_KEY

litellm_settings:
  fallbacks:
    - {"gpt-4o": ["claude-sonnet", "gemini-pro"]}
  num_retries: 3
  request_timeout: 60

2. Create the .env file:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
LITELLM_MASTER_KEY=sk-litellm-master-key-change-this
LITELLM_SALT_KEY=sk-salt-change-this-too

3. Launch with Docker Compose:

# docker-compose.yml
version: "3.11"
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    env_file:
      - .env
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    restart: unless-stopped
docker compose up -d

4. Test it:

curl http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer sk-litellm-master-key-change-this" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello from LiteLLM!"}]
  }'

Your proxy is now running. Any existing OpenAI SDK client can point to http://localhost:4000 and work without code changes.


Controlling Costs with Virtual Keys and Budgets

This is where LiteLLM earns its place in production. The budget system has six layers of granularity: organization → team → project → user → key → end-user. Constraints at higher levels act as global ceilings.

Creating a team and setting a monthly budget:

# Create a team
curl http://localhost:4000/team/new \
  -H "Authorization: Bearer sk-litellm-master-key-change-this" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "backend-eng",
    "max_budget": 200,
    "budget_duration": "1mo",
    "models": ["gpt-4o", "claude-sonnet"]
  }'

# Create a virtual key for that team
curl http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-litellm-master-key-change-this" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "<team_id_from_above>",
    "key_alias": "backend-service-prod",
    "max_budget": 50,
    "budget_duration": "1mo",
    "max_parallel_requests": 10,
    "tpm_limit": 100000
  }'

The team gets a sk-... virtual key. They never touch your actual OpenAI or Anthropic credentials. If they hit the $50 limit, requests fail with a budget-exceeded error — not a real provider auth error.

Adding Redis for semantic caching:

# Add to litellm-config.yaml
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600

  # For semantic caching (requires Redis + RediSearch module)
  semantic_cache:
    enabled: true
    embedding_model: text-embedding-ada-002
    similarity_threshold: 0.9
# Add to docker-compose.yml
  redis:
    image: redis/redis-stack:latest
    ports:
      - "6379:6379"

With similarity threshold at 0.9, two prompts that are 90%+ semantically similar share a cached response. Tune this value based on your tolerance for approximate answers.


Provider Fallbacks and Load Balancing

For high-availability setups, LiteLLM supports both fallbacks (different providers) and load balancing (same model across multiple deployments).

Multi-provider fallback config:

model_list:
  - model_name: fast-llm
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500

  - model_name: fast-llm
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      rpm: 300

  - model_name: fast-llm
    litellm_params:
      model: gemini/gemini-3-flash
      api_key: os.environ/GEMINI_API_KEY
      rpm: 400

router_settings:
  routing_strategy: least-busy          # or: simple-shuffle, usage-based-routing
  fallbacks: [{"fast-llm": ["claude-haiku-4-5"]}]
  context_window_fallbacks: [{"fast-llm": ["gpt-4o"]}]
  num_retries: 2
  retry_after: 5

When you call model: fast-llm, LiteLLM distributes requests across all three deployments based on current load. If one hits its RPM limit, traffic automatically shifts. The context_window_fallbacks field handles the specific case where a prompt is too long for the primary model — it escalates to a model with a larger context window rather than returning an error.


Production Observability

LiteLLM integrates with observability platforms directly from config. For self-hosted observability, pairing it with Langfuse is a common pattern — LiteLLM handles routing and cost enforcement while Langfuse provides trace visualization and evaluation.

# Add to litellm-config.yaml
litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: pk-lf-...
  LANGFUSE_SECRET_KEY: sk-lf-...
  LANGFUSE_HOST: https://cloud.langfuse.com  # or your self-hosted URL

LiteLLM also supports Prometheus metrics natively (Enterprise Basic tier, $250/month), which integrates with standard Grafana dashboards.


LiteLLM vs. Alternatives

GatewayLicenseLanguageBest ForLatency Overhead
LiteLLMMIT (open core)PythonPython teams, max provider coverage~50 µs @ 5K RPS
PortkeyCloud SaaSAnyProduction observability + governanceManaged
BifrostApache 2.0GoHigh-throughput (500+ RPS) enterprise~11 µs @ 5K RPS
OpenRouterHosted onlyAnyZero-ops, pay-per-use marketplaceManaged
HeliconeMIT (core)AnyObservability-first teams~20 µs

The honest performance note: At 1,000 concurrent users in synthetic load tests, LiteLLM's slowest responses hit 28 seconds while Go-based Bifrost stayed under 50 milliseconds. If you're running 500+ RPS with strict P99 latency SLAs, LiteLLM's Python runtime is a genuine constraint. For the vast majority of internal tools, APIs, and dev-facing products, this is a non-issue.


Common Mistakes to Avoid

1. Running without LITELLM_SALT_KEY set The salt key encrypts stored provider credentials in the database. Running without it means credentials are stored in plaintext. Always set both LITELLM_MASTER_KEY and LITELLM_SALT_KEY before your first production deployment.

2. Using the master key as a team key The master key is for admin operations only. Distribute virtual keys to teams and services. This way you can revoke access granularly without rotating your master credentials.

3. Setting semantic cache similarity too low A threshold of 0.7 means two prompts that are 70% similar get the same answer. For factual or code-generation tasks, this produces wrong answers silently. Start at 0.95 and lower only after measuring cache hit rates vs. correctness.

4. Not configuring context_window_fallbacks When a user sends a 50,000-token prompt to a model with a 32K limit, you get an error. Adding a context window fallback (pointing to a 128K+ model) handles this case automatically without surfacing it to users.

5. Ignoring the March 2026 supply chain incident In March 2026, LiteLLM disclosed a suspected supply chain security incident. The team published a full security update. If you were running the proxy between early and mid-March, audit your deployment. Post-incident, pin your Docker image to a specific version rather than main-latest.


FAQ

Q: Can LiteLLM work with locally-hosted models like Ollama or vLLM?

Yes. Both are supported natively. For Ollama, set model: ollama/llama3.2 and point api_base to your Ollama server. For vLLM, use model: openai/<your-model> and set api_base to your vLLM endpoint. LiteLLM treats them identically to cloud providers, meaning the same fallback and budget logic applies to local models.

Q: Does the free open-source version include the admin dashboard?

The admin UI (/ui) is included in the open-source version for basic key management and spend overview. Advanced features like SSO/SAML, RBAC, detailed audit logs, and Prometheus metrics are in the Enterprise Basic tier ($250/month). If your team uses Okta or other identity providers for SSO, that requires Enterprise.

Q: How do I migrate from direct OpenAI SDK calls to LiteLLM proxy?

Change one line. Update your OpenAI client's base_url to point to your LiteLLM proxy and use a virtual key as the API key:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-virtual-key",        # LiteLLM virtual key
    base_url="http://your-litellm-host:4000"  # LiteLLM proxy URL
)

# All existing code works unchanged
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

No other code changes required. Your existing OpenAI SDK calls route through the proxy.

Q: What's the difference between LiteLLM's fallback and load balancing?

Fallbacks activate on failure — if the primary model returns an error or hits rate limits, LiteLLM tries the next model in the list. Load balancing distributes traffic across multiple deployments of the same logical model — useful when you have multiple Azure OpenAI deployments or want to spread load across regions. You can use both simultaneously: load balance across your primary deployments, with a fallback to a different provider if all primaries are unavailable.

Q: Is LiteLLM suitable for multi-tenant SaaS products?

Yes, with caveats. The virtual key system lets you create isolated spend buckets per customer. However, you should be aware that at high concurrency LiteLLM's Python-based proxy adds measurable overhead, and the open-source tier lacks the per-tenant SSO and fine-grained RBAC that enterprise SaaS products typically need. For high-scale SaaS, evaluate whether Enterprise Premium ($30,000/year) or a Go-based alternative like Bifrost better fits your architecture.


Key Takeaways

  • LiteLLM (v1.83.8, MIT license, 43.5K+ GitHub stars) is the most widely deployed open-source LLM gateway, normalizing 100+ providers behind a single OpenAI-compatible API.
  • The Proxy Server mode is the production deployment path — it centralizes credentials, enforces budgets via virtual keys, and provides fallback routing without changing client code.
  • Semantic caching with Redis can meaningfully cut token costs on workloads with similar or repeated queries — tune the similarity threshold carefully to avoid stale responses.
  • Provider fallbacks and context window fallbacks handle the two most common failure modes: rate limiting and oversized prompts.
  • For teams under 500 RPS on Python stacks, LiteLLM is the default recommendation. For high-throughput or Go-native environments, evaluate Bifrost.
  • Always pin your Docker image to a specific version and rotate credentials after the March 2026 supply chain incident.
Bottom Line

LiteLLM is the fastest path from "we use one LLM provider" to "we have a production-grade AI infrastructure layer." The free tier covers everything most teams need. The Python performance ceiling is real but rarely a blocker — start here, and migrate if and when you actually hit the limits.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.