How It Works

InfraPrism uses a unique SDK-only architecture that enables cost tracking without ever seeing your prompts or responses. This page explains how it works under the hood.

Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        Your Application                          │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┐                                                │
│   │ Your Code   │                                                │
│   └──────┬──────┘                                                │
│          │                                                       │
│          ▼                                                       │
│   ┌─────────────┐     Prompts & Responses      ┌─────────────┐  │
│   │ InfraPrism  │─────────────────────────────▶│  OpenAI /   │  │
│   │    SDK      │◀─────────────────────────────│  Anthropic  │  │
│   └──────┬──────┘                              └─────────────┘  │
│          │                                                       │
│          │ Metadata only                                         │
│          │ (async, batched)                                      │
│          ▼                                                       │
└──────────┼───────────────────────────────────────────────────────┘
           │
           │ Token counts, model, latency, cost, tags
           ▼
    ┌─────────────┐
    │ InfraPrism  │
    │   Cloud     │
    └─────────────┘

The SDK-Only Approach

Traditional observability tools use a proxy architecture:

Your App ──▶ Proxy Server ──▶ LLM Provider
              (sees all data)

InfraPrism uses an SDK-only approach:

Your App (with SDK) ──▶ LLM Provider
       │
       └──▶ InfraPrism (metadata only)

Why This Matters

Privacy - Your prompts never leave your infrastructure
Compliance - HIPAA, PCI, and other regulations are easier to meet
Performance - No proxy latency
Reliability - No dependency on a third-party proxy

Data Flow

Step 1: You Make an API Call

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    entity_type="customer",
    entity_id="acme-corp",
)

Step 2: SDK Intercepts and Forwards

The SDK:

Captures the request metadata (model, timestamp)
Forwards the full request to OpenAI/Anthropic
Receives the full response
Returns the response to your code

Your prompts and responses go directly to the LLM provider.

Step 3: Metadata Extraction

After the response, the SDK extracts:

Input token count
Output token count
Model identifier
Request latency
Your entity tags
Your custom tags

Step 4: Cost Calculation

The SDK calculates cost locally using current pricing:

Cost = (input_tokens × input_price) + (output_tokens × output_price)

Prices are updated regularly and cached in the SDK.

Step 5: Async Upload

Metadata is added to a batch queue and uploaded asynchronously:

# This happens in a background thread
{
    "timestamp": "2025-01-15T10:30:00Z",
    "model": "gpt-4o",
    "input_tokens": 150,
    "output_tokens": 500,
    "latency_ms": 1200,
    "cost_usd": 0.0065,
    "entity_type": "customer",
    "entity_id": "acme-corp",
    "tags": {"feature": "chatbot"},
    "success": true
}

Note: No prompt or response content is included.

Batching and Efficiency

To minimize overhead, the SDK batches metadata:

Events are queued in memory
Batches are uploaded every 5 seconds (configurable)
Or when the batch reaches 100 events
Or on graceful shutdown

This means:

Minimal network overhead
No latency impact on your calls
Efficient use of resources

Failure Handling

If InfraPrism is unreachable:

Your LLM calls continue working - We never block your application
Events are queued - Up to 1000 events buffered locally
Automatic retry - Failed batches retry with exponential backoff
Graceful degradation - Events are dropped only if the buffer is full

# This always works, even if InfraPrism is down
response = client.chat.completions.create(...)

Token Counting

OpenAI

For OpenAI, token counts come from the API response:

response.usage.prompt_tokens      # Input tokens
response.usage.completion_tokens  # Output tokens

Anthropic

For Anthropic, token counts also come from the response:

response.usage.input_tokens   # Input tokens
response.usage.output_tokens  # Output tokens

Streaming

For streaming responses, tokens are counted after the stream completes. The SDK buffers the stream, counts tokens, and reports a single event.

Security Model

What We Receive

Data	Included
Token counts	✅ Yes
Model identifier	✅ Yes
Latency	✅ Yes
Calculated cost	✅ Yes
Entity tags	✅ Yes
Custom tags	✅ Yes
Timestamp	✅ Yes
Success/failure	✅ Yes

What We Never Receive

Data	Included
Prompt content	❌ Never
Response content	❌ Never
System messages	❌ Never
Function/tool definitions	❌ Never
Function/tool results	❌ Never
Images	❌ Never
Audio	❌ Never
API keys	❌ Never

Open Source SDK

Our SDK is open source. You can inspect exactly what data is collected:

Next Steps

Privacy Architecture - Deep dive on privacy
Data We Collect - Complete data inventory
Configuration - Customize SDK behavior