Overview
Fireworks AI provides fast, reliable inference for open-source and proprietary language models with a developer-first approach and competitive pricing. Key Features:- Sub-second latency with optimized inference stack
- 100+ models including Llama, Mixtral, and proprietary options
- Fully OpenAI-compatible API
- Fine-tuning and model deployment capabilities
Authentication
Fireworks AI uses Bearer token authentication with the OpenAI-compatible format. Header:Popular Models (October 2025)
| Model | Context | Description | Use Case |
|---|---|---|---|
| accounts/fireworks/models/llama-v3p3-70b-instruct | 128K | Meta’s Llama 3.3 flagship | General reasoning, coding |
| accounts/fireworks/models/mixtral-8x7b-instruct | 32K | Mistral’s MoE model | Fast, balanced performance |
| accounts/fireworks/models/qwen2p5-72b-instruct | 128K | Alibaba’s Qwen 2.5 | Multilingual, math |
Quick Start Example
Available Endpoints
Fireworks AI supports OpenAI-compatible endpoints:| Endpoint | Method | Description |
|---|---|---|
/inference/v1/chat/completions | POST | Text generation with conversation context |
/inference/v1/models | GET | List available models |
Usage Tracking
Usage data is returned in the response body (OpenAI format):data.usage
Format: Standard OpenAI usage object
Lava Tracking: Automatically tracked via x-lava-request-id header
Features & Capabilities
Streaming:BYOK Support
Status: ✅ Supported (managed keys + BYOK) BYOK Implementation:- Append your Fireworks AI API key to the forward token:
${TOKEN}.${YOUR_FIREWORKS_KEY} - Lava tracks usage and billing while you maintain key control
- No additional Lava API key costs (metering-only mode available)
- Sign up at Fireworks AI Console
- Navigate to API Keys section
- Create a new API key
- Use in Lava forward token (4th segment)
Best Practices
- Model Selection: Use Llama 3.3 for reasoning, Mixtral for speed/cost, Qwen for multilingual
- Model Names: Use full account paths (e.g.,
accounts/fireworks/models/llama-v3p3-70b-instruct) - Temperature: 0.7 for creative tasks, 0.1-0.3 for factual outputs
- Context Management: Leverage 128K context models for long-form content
- Error Handling: Fireworks returns standard OpenAI error formats
Performance Characteristics
Latency: Sub-second first-token latency for most models Throughput: Optimized for high-concurrency workloads Reliability: 99.9% uptime SLA for production deployments Use Cases:- Production chatbots
- Content generation pipelines
- Code assistance tools
- Multi-modal applications