Choosing an AI API
Three providers dominate the market: OpenAI, Anthropic, and Google. All are capable. The right choice depends on your use case, budget, and latency requirements.
| Provider | Best Model | Context Window | Strengths | Price (input/output per M tokens) |
|---|---|---|---|---|
| OpenAI | GPT-4o | 128K tokens | Function calling, ecosystem, speed | $2.50 / $10.00 |
| OpenAI (budget) | GPT-4o-mini | 128K tokens | Fast, cheap, good quality | $0.15 / $0.60 |
| Anthropic | Claude 3.5 Sonnet | 200K tokens | Instruction following, large docs | $3.00 / $15.00 |
| Gemini 1.5 Pro | 1M tokens | Multimodal, massive context | $1.25 / $5.00 |
Scale fact: OpenAI API processes over 1 billion requests daily. The infrastructure is mature and reliable - 99.9%+ uptime for most months. For production applications, all three major providers are enterprise-ready.
Source: OpenAI developer reports, 2025
For most new integrations, start with GPT-4o-mini. It is fast, cheap, and handles the majority of tasks well. Upgrade specific features to GPT-4o or Claude when quality clearly matters - like for final report generation or complex reasoning tasks.
Authentication and Setup
All three APIs use API key authentication. The key rule: never put API keys in client-side code or commit them to git.
- Get your API key - Create an account at platform.openai.com, console.anthropic.com, or ai.google.dev. Generate an API key from the dashboard.
- Store it securely - Add it to your environment variables:
OPENAI_API_KEY=sk-...in a .env file (add .env to .gitignore). Never hardcode it. - Install the SDK - Python:
pip install openai anthropic. JavaScript:npm install openai @anthropic-ai/sdk. - Set spending limits - Before writing any code, set a hard monthly spending limit in the API dashboard. This prevents surprise bills from bugs or loops.
API Key Security is Non-Negotiable
Leaked API keys get found and used within hours by automated scanners. Set up GitHub secret scanning alerts. Rotate your key immediately if you think it has been exposed. A $1,000 bill from a leaked key is not uncommon.
Making Your First API Call
Every AI API follows the same basic pattern: send a messages array with roles (system, user, assistant) and get a completion back.
Python Example - OpenAI
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize this in 3 bullet points: " + your_text}
]
)
print(response.choices[0].message.content)
The Anthropic Claude API is similar:
Python Example - Anthropic Claude
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Your prompt here"}]
)
print(message.content[0].text)
Prompt Engineering for APIs
Prompts in production code behave differently than in a chat interface. You are engineering instructions that will run thousands of times, so precision matters more than conversational flow.
Key principles for API prompt engineering:
- System prompts define behavior: Use the system message to set the persona, format requirements, and constraints. This is where you tell the model to always respond in JSON, to be concise, or to refuse certain topics.
- Be specific about output format: "Respond with a JSON object with keys: 'summary' (string), 'sentiment' ('positive'|'negative'|'neutral'), 'confidence' (0-1 float)." Vague format instructions produce inconsistent output.
- Few-shot examples work: Include 2-3 examples of your desired input/output pair in the prompt. Quality improves significantly for structured extraction tasks.
- Temperature controls creativity: For factual extraction and classification, use temperature 0. For creative writing, use 0.7-1.0. Most business tasks benefit from 0-0.3.
Streaming reduces perceived latency: Streaming responses (receiving tokens as they generate) reduces perceived latency by 60% compared to waiting for the full response. Users see the first word in under a second rather than waiting 3-10 seconds for a complete response.
Source: OpenAI developer documentation, 2025
Streaming Responses
For any user-facing feature, use streaming. It dramatically improves the perceived responsiveness of your application. The user sees text appearing in real time rather than staring at a loading spinner.
Streaming is one extra parameter and a loop to handle chunks:
Python Streaming Example
with client.chat.completions.stream(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Write a short story"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
In web applications, use Server-Sent Events (SSE) to stream from your backend to the browser. Your server calls the API with streaming, receives chunks, and immediately forwards them to the browser. The user sees instant responses.
Error Handling and Rate Limits
Production AI applications need robust error handling. The most common issues are rate limit errors (429), context length exceeded (400), and server errors (500). All require different responses.
Rate limits (429): Implement exponential backoff. Wait 1 second, retry. If it fails again, wait 2 seconds. Then 4. Then 8. Cap at some maximum. Most rate limit issues resolve within 30 seconds.
Context too long (400): Your input exceeds the model's context window. Truncate or summarize the input before retrying. Use tiktoken (OpenAI's library) to count tokens before sending.
Server errors (500): Retry once after 5 seconds. If it fails again, fail gracefully and queue for retry later. Do not hammer a server that is having issues.
Always Set a Timeout
Set an explicit timeout on every API call. Without a timeout, a slow or hung request can block your application indefinitely. 30 seconds is reasonable for most requests. For streaming, set a longer timeout or use a token-streaming approach with per-token timeouts.
Cost Optimization
AI API costs can grow fast at scale. Here are the highest-impact optimizations:
- Use smaller models where quality is sufficient: GPT-4o-mini costs 17x less than GPT-4o. For classification, summarization, and simple extraction tasks, the smaller model is usually good enough.
- Cache identical prompts: If you are sending the same system prompt thousands of times, enable prompt caching. This can cut costs by 40-60% on cached tokens.
- Trim your system prompt: Every token costs money. Audit your system prompt - every redundant sentence is wasted spend across millions of calls.
- Batch non-urgent requests: OpenAI's Batch API costs 50% less than real-time calls. For overnight processing jobs, use it.
- Log and monitor spend: Set up alerts when daily spend exceeds a threshold. A bug that loops API calls can rack up hundreds in minutes.
Caching impact: Caching identical prompts can cut API costs by 40-60%. If your system prompt is 2,000 tokens and you send 10,000 requests per day, caching saves you 20 million input tokens daily.
Source: Anthropic prompt caching documentation, 2025
Production Best Practices
- Never call AI APIs from the client - Always proxy through your backend. This protects your API key and lets you add rate limiting, logging, and validation.
- Log inputs and outputs - Store prompt/completion pairs for debugging and cost analysis. Do not log sensitive user data, but do log enough to reproduce issues.
- Validate outputs - If you expect JSON, validate it. If you expect a specific format, check it. LLMs occasionally produce malformed output even with explicit instructions.
- Add fallback behavior - What happens if the API is down or times out? Show a helpful error message and offer alternatives. Never let an AI API failure bring down core functionality.
- Implement per-user rate limiting - Prevent any single user from generating excessive API costs. Even if the API is per-token, a user in a loop can spike your bill.