How It Works

InfraPrism uses a unique SDK-only architecture that enables cost tracking without ever seeing your prompts or responses. This page explains how it works under the hood.

Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                        Your Application                          │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌─────────────┐                                                │
│   │ Your Code   │                                                │
│   └──────┬──────┘                                                │
│          │                                                       │
│          ▼                                                       │
│   ┌─────────────┐     Prompts & Responses      ┌─────────────┐  │
│   │ InfraPrism  │─────────────────────────────▶│  OpenAI /   │  │
│   │    SDK      │◀─────────────────────────────│  Anthropic  │  │
│   └──────┬──────┘                              └─────────────┘  │
│          │                                                       │
│          │ Metadata only                                         │
│          │ (async, batched)                                      │
│          ▼                                                       │
└──────────┼───────────────────────────────────────────────────────┘

           │ Token counts, model, latency, cost, tags

    ┌─────────────┐
    │ InfraPrism  │
    │   Cloud     │
    └─────────────┘

The SDK-Only Approach

Traditional observability tools use a proxy architecture:

Your App ──▶ Proxy Server ──▶ LLM Provider
              (sees all data)

InfraPrism uses an SDK-only approach:

Your App (with SDK) ──▶ LLM Provider

       └──▶ InfraPrism (metadata only)

Why This Matters

  1. Privacy - Your prompts never leave your infrastructure
  2. Compliance - HIPAA, PCI, and other regulations are easier to meet
  3. Performance - No proxy latency
  4. Reliability - No dependency on a third-party proxy

Data Flow

Step 1: You Make an API Call

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    entity_type="customer",
    entity_id="acme-corp",
)

Step 2: SDK Intercepts and Forwards

The SDK:

  1. Captures the request metadata (model, timestamp)
  2. Forwards the full request to OpenAI/Anthropic
  3. Receives the full response
  4. Returns the response to your code

Your prompts and responses go directly to the LLM provider.

Step 3: Metadata Extraction

After the response, the SDK extracts:

  • Input token count
  • Output token count
  • Model identifier
  • Request latency
  • Your entity tags
  • Your custom tags

Step 4: Cost Calculation

The SDK calculates cost locally using current pricing:

Cost = (input_tokens × input_price) + (output_tokens × output_price)

Prices are updated regularly and cached in the SDK.

Step 5: Async Upload

Metadata is added to a batch queue and uploaded asynchronously:

# This happens in a background thread
{
    "timestamp": "2025-01-15T10:30:00Z",
    "model": "gpt-4o",
    "input_tokens": 150,
    "output_tokens": 500,
    "latency_ms": 1200,
    "cost_usd": 0.0065,
    "entity_type": "customer",
    "entity_id": "acme-corp",
    "tags": {"feature": "chatbot"},
    "success": true
}

Note: No prompt or response content is included.

Batching and Efficiency

To minimize overhead, the SDK batches metadata:

  • Events are queued in memory
  • Batches are uploaded every 5 seconds (configurable)
  • Or when the batch reaches 100 events
  • Or on graceful shutdown

This means:

  • Minimal network overhead
  • No latency impact on your calls
  • Efficient use of resources

Failure Handling

If InfraPrism is unreachable:

  1. Your LLM calls continue working - We never block your application
  2. Events are queued - Up to 1000 events buffered locally
  3. Automatic retry - Failed batches retry with exponential backoff
  4. Graceful degradation - Events are dropped only if the buffer is full
# This always works, even if InfraPrism is down
response = client.chat.completions.create(...)

Token Counting

OpenAI

For OpenAI, token counts come from the API response:

response.usage.prompt_tokens      # Input tokens
response.usage.completion_tokens  # Output tokens

Anthropic

For Anthropic, token counts also come from the response:

response.usage.input_tokens   # Input tokens
response.usage.output_tokens  # Output tokens

Streaming

For streaming responses, tokens are counted after the stream completes. The SDK buffers the stream, counts tokens, and reports a single event.

Security Model

What We Receive

DataIncluded
Token counts✅ Yes
Model identifier✅ Yes
Latency✅ Yes
Calculated cost✅ Yes
Entity tags✅ Yes
Custom tags✅ Yes
Timestamp✅ Yes
Success/failure✅ Yes

What We Never Receive

DataIncluded
Prompt content❌ Never
Response content❌ Never
System messages❌ Never
Function/tool definitions❌ Never
Function/tool results❌ Never
Images❌ Never
Audio❌ Never
API keys❌ Never

Open Source SDK

Our SDK is open source. You can inspect exactly what data is collected:

Next Steps