How It Works
InfraPrism uses a unique SDK-only architecture that enables cost tracking without ever seeing your prompts or responses. This page explains how it works under the hood.
Architecture Overview
┌──────────────────────────────────────────────────────────────────┐
│ Your Application │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ │
│ │ Your Code │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ Prompts & Responses ┌─────────────┐ │
│ │ InfraPrism │─────────────────────────────▶│ OpenAI / │ │
│ │ SDK │◀─────────────────────────────│ Anthropic │ │
│ └──────┬──────┘ └─────────────┘ │
│ │ │
│ │ Metadata only │
│ │ (async, batched) │
│ ▼ │
└──────────┼───────────────────────────────────────────────────────┘
│
│ Token counts, model, latency, cost, tags
▼
┌─────────────┐
│ InfraPrism │
│ Cloud │
└─────────────┘
The SDK-Only Approach
Traditional observability tools use a proxy architecture:
Your App ──▶ Proxy Server ──▶ LLM Provider
(sees all data)
InfraPrism uses an SDK-only approach:
Your App (with SDK) ──▶ LLM Provider
│
└──▶ InfraPrism (metadata only)
Why This Matters
- Privacy - Your prompts never leave your infrastructure
- Compliance - HIPAA, PCI, and other regulations are easier to meet
- Performance - No proxy latency
- Reliability - No dependency on a third-party proxy
Data Flow
Step 1: You Make an API Call
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
entity_type="customer",
entity_id="acme-corp",
)
Step 2: SDK Intercepts and Forwards
The SDK:
- Captures the request metadata (model, timestamp)
- Forwards the full request to OpenAI/Anthropic
- Receives the full response
- Returns the response to your code
Your prompts and responses go directly to the LLM provider.
Step 3: Metadata Extraction
After the response, the SDK extracts:
- Input token count
- Output token count
- Model identifier
- Request latency
- Your entity tags
- Your custom tags
Step 4: Cost Calculation
The SDK calculates cost locally using current pricing:
Cost = (input_tokens × input_price) + (output_tokens × output_price)
Prices are updated regularly and cached in the SDK.
Step 5: Async Upload
Metadata is added to a batch queue and uploaded asynchronously:
# This happens in a background thread
{
"timestamp": "2025-01-15T10:30:00Z",
"model": "gpt-4o",
"input_tokens": 150,
"output_tokens": 500,
"latency_ms": 1200,
"cost_usd": 0.0065,
"entity_type": "customer",
"entity_id": "acme-corp",
"tags": {"feature": "chatbot"},
"success": true
}
Note: No prompt or response content is included.
Batching and Efficiency
To minimize overhead, the SDK batches metadata:
- Events are queued in memory
- Batches are uploaded every 5 seconds (configurable)
- Or when the batch reaches 100 events
- Or on graceful shutdown
This means:
- Minimal network overhead
- No latency impact on your calls
- Efficient use of resources
Failure Handling
If InfraPrism is unreachable:
- Your LLM calls continue working - We never block your application
- Events are queued - Up to 1000 events buffered locally
- Automatic retry - Failed batches retry with exponential backoff
- Graceful degradation - Events are dropped only if the buffer is full
# This always works, even if InfraPrism is down
response = client.chat.completions.create(...)
Token Counting
OpenAI
For OpenAI, token counts come from the API response:
response.usage.prompt_tokens # Input tokens
response.usage.completion_tokens # Output tokens
Anthropic
For Anthropic, token counts also come from the response:
response.usage.input_tokens # Input tokens
response.usage.output_tokens # Output tokens
Streaming
For streaming responses, tokens are counted after the stream completes. The SDK buffers the stream, counts tokens, and reports a single event.
Security Model
What We Receive
| Data | Included |
|---|---|
| Token counts | ✅ Yes |
| Model identifier | ✅ Yes |
| Latency | ✅ Yes |
| Calculated cost | ✅ Yes |
| Entity tags | ✅ Yes |
| Custom tags | ✅ Yes |
| Timestamp | ✅ Yes |
| Success/failure | ✅ Yes |
What We Never Receive
| Data | Included |
|---|---|
| Prompt content | ❌ Never |
| Response content | ❌ Never |
| System messages | ❌ Never |
| Function/tool definitions | ❌ Never |
| Function/tool results | ❌ Never |
| Images | ❌ Never |
| Audio | ❌ Never |
| API keys | ❌ Never |
Open Source SDK
Our SDK is open source. You can inspect exactly what data is collected:
Next Steps
- Privacy Architecture - Deep dive on privacy
- Data We Collect - Complete data inventory
- Configuration - Customize SDK behavior