Skip to main content

Overview

Groq provides lightning-fast AI inference powered by their custom Language Processing Unit (LPU™) architecture, delivering industry-leading speed for open-source models. Key Features:
  • Ultra-low latency inference (up to 10x faster than GPUs)
  • Fully OpenAI-compatible API
  • Support for top open-source models (Llama, Mixtral, Gemma)
  • Competitive pricing with generous free tier
Official Documentation: Groq Documentation

Authentication

Groq uses Bearer token authentication with the OpenAI-compatible format. Header:
Authorization: Bearer YOUR_GROQ_API_KEY
Lava Forward Token:
${LAVA_SECRET_KEY}.${CONNECTION_SECRET}.${PRODUCT_SECRET}
For BYOK (Bring Your Own Key):
${LAVA_SECRET_KEY}.${CONNECTION_SECRET}.${PRODUCT_SECRET}.${YOUR_GROQ_API_KEY}

ModelContextDescriptionSpeed
llama-3.3-70b-versatile128KMeta’s Llama 3.3 flagship~300 tokens/sec
mixtral-8x7b-3276832KMistral’s mixture-of-experts~500 tokens/sec
gemma2-9b-it8KGoogle’s efficient instruction model~800 tokens/sec
Pricing: See Groq Pricing for current rates. Speed Advantage: Groq’s LPU™ architecture delivers 5-10x faster inference than traditional GPU deployments.

Quick Start Example

// 1. Set up your environment variables
const LAVA_FORWARD_TOKEN = process.env.LAVA_FORWARD_TOKEN;

// 2. Define the Groq endpoint
const GROQ_ENDPOINT = 'https://api.groq.com/openai/v1/chat/completions';

// 3. Make the request through Lava
const response = await fetch(
  `https://api.lavapayments.com/v1/forward?u=${encodeURIComponent(GROQ_ENDPOINT)}`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${LAVA_FORWARD_TOKEN}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: 'llama-3.3-70b-versatile',
      messages: [
        {
          role: 'user',
          content: 'Write a haiku about speed.'
        }
      ],
      temperature: 0.7,
      max_tokens: 100
    })
  }
);

// 4. Parse response and extract usage
const data = await response.json();
console.log('Response:', data.choices[0].message.content);

// 5. Track usage (from response body)
const usage = data.usage;
console.log('Tokens used:', usage.total_tokens);

// 6. Get Lava request ID (from headers)
const requestId = response.headers.get('x-lava-request-id');
console.log('Lava Request ID:', requestId);

Available Endpoints

Groq supports standard OpenAI-compatible endpoints:
EndpointMethodDescription
/openai/v1/chat/completionsPOSTText generation with conversation context
/openai/v1/modelsGETList available models
/openai/v1/audio/transcriptionsPOSTWhisper audio transcription

Usage Tracking

Usage data is returned in the response body (OpenAI format):
{
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40,
    "queue_time": 0.002,
    "prompt_time": 0.005,
    "completion_time": 0.050,
    "total_time": 0.057
  }
}
Location: data.usage Format: Standard OpenAI usage object + Groq-specific timing metrics Lava Tracking: Automatically tracked via x-lava-request-id header

Features & Capabilities

JSON Mode:
{
  "response_format": { "type": "json_object" }
}
Streaming:
{
  "stream": true
}
Audio Transcription (Whisper):
// Endpoint: https://api.groq.com/openai/v1/audio/transcriptions
const formData = new FormData();
formData.append('file', audioFile);
formData.append('model', 'whisper-large-v3');

BYOK Support

Status: ✅ Supported (managed keys + BYOK) BYOK Implementation:
  • Append your Groq API key to the forward token: ${TOKEN}.${YOUR_GROQ_KEY}
  • Lava tracks usage and billing while you maintain key control
  • No additional Lava API key costs (metering-only mode available)
Getting a Groq API Key:
  1. Sign up at Groq Console
  2. Navigate to API Keys section
  3. Create a new API key
  4. Use in Lava forward token (4th segment)

Best Practices

  1. Model Selection: Use Llama 3.3 for reasoning, Gemma2 for speed, Mixtral for balanced performance
  2. Speed Optimization: Groq excels at streaming - use stream: true for real-time UX
  3. Temperature: Keep between 0.5-0.9 for open models (they tend to be deterministic)
  4. Context Management: Llama 3.3 supports 128K context - ideal for long documents
  5. Rate Limits: Groq has generous limits - check console for current tier

Speed Benchmarks

Groq LPU™ vs Traditional GPU:
  • Llama 3.3 70B: ~300 tokens/sec (vs ~30 tokens/sec on GPU)
  • Mixtral 8x7B: ~500 tokens/sec (vs ~50 tokens/sec on GPU)
  • Gemma2 9B: ~800 tokens/sec (vs ~80 tokens/sec on GPU)
Use Cases:
  • Real-time chat applications
  • Low-latency voice assistants
  • Streaming content generation
  • High-throughput batch processing

Additional Resources