Python SDK
The Python SDK package name is redraven and supports Python 3.10+.
Install
Use uv:
uv add redraven
Or with pip:
pip install redraven
Configure
Set your Redraven credentials:
export REDRAVEN_API_KEY="rr_..."
export REDRAVEN_ORGANIZATION_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
export REDRAVEN_BASE_URL="https://app.redraven.fireraven.ai"
Pass only the host URL. The SDK automatically adds the /api/v1 prefix.
The organization ID is the UUID of your Redraven organization (the same scope as the API key). The public API requires X-Organization-Id together with X-API-Key so keys are verified per organization without scanning all keys in the database.
Get the API key and organization id from your Redraven organization settings.
If you pass credentials directly, include organization_id:
client = redraven.Client(
api_key="rr_...",
organization_id="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
base_url="https://app.redraven.fireraven.ai",
)
Core methods
Typical sequence
Redraven runs in three phases. Do not call wait_for_evaluation_ready or expect a finished eval until call_agent has returned — that return means every case response was submitted; only then does the backend start (or finish) scoring.
- Dataset ready — create/generate the test and wait until dataset artifacts exist:
generate_test(..., wait_for_dataset=...)and/orwait_for_dataset_ready(test_id). - Agent phase — run your LLM on each case and submit responses:
call_agent(test_id, llm, ...) -> RunAgainstClient. When this await completes, the client-response phase is done and evaluation can proceed on the server. - Evaluation phase — scoring runs asynchronously. Either wait then read, or block in one call:
- Wait:
wait_for_evaluation_ready(test_id)— call after step 2 (blocks until eval artifacts reach a terminal state). - Read:
get_eval_summary(test_id, ...)— call after waiting if you usedwait_for_evaluation_ready; usewait_for_completion=Trueif you want a single call that waits and then returns the summary.
- Wait:
By default, get_eval_summary without waiting performs a single GET (like generate_test without wait_for_dataset). If eval is not ready yet, you get state="pending" (or another non-terminal state) and summary=None.
All-at-once convenience: generate_and_run_test(generate_kwargs, llm, ...) chains dataset generation, call_agent, waiting for eval, and returning a terminal EvalSummary.
Method reference
| Method | Role |
|---|---|
generate_test(generate_kwargs, wait_for_dataset=False, ...) -> str | Returns test_id; optionally waits for dataset materialization. |
wait_for_dataset_ready(test_id, ...) -> None | Blocks until dataset manifest is ready. |
call_agent(test_id, llm, *, resume=True, ...) -> RunAgainstClient | Submits all case responses; must complete before eval can finish. With resume=True (default) cases that already have an ok or failed response on the server are skipped; pass resume=False to re-run every case. In image-mode tests, cases can include messages payloads; declare messages in your LLM signature to receive multimodal inputs. |
wait_for_evaluation_ready(test_id, ...) -> None | Waits for server-side scoring to finish (eval manifest completed / failed), not for call_agent. “Agent done” = await call_agent has returned; this is the next phase. On entry it calls ensure_evaluation_from_client_responses once so a stuck run can schedule repair after resume skipped all POSTs. |
ensure_evaluation_from_client_responses(test_id, *, force=False) | Asks the backend to insert or repair the pipeline job that builds eval artifacts from submitted client responses. Rarely needed directly; used internally before waiting for eval. |
get_eval_summary(test_id, wait_for_completion=False, ...) -> EvalSummary | Reads eval summary; non-blocking by default, or pass wait_for_completion=True to wait inside this call. |
generate_and_run_test(generate_kwargs, llm, ...) -> EvalSummary | End-to-end: generate → dataset → agent → wait for eval → summary. |
Reading results via HTTP (without SDK helpers)
After evaluation, you can read metrics, recommendations, and the PDF report with plain HTTP (curl or httpx), using the same credentials as above:
| HTTP route | Role |
|---|---|
GET /tests/{id}/results | Dashboard-style pass rates by certification and policy (pass_rate as 0.0–1.0) |
GET /tests/{id}/recommendations | Stored policy recommendations; limit=0 exports all (pass_rate typically 0–100) |
GET /tests/{id}/report/download | PDF report bytes (after generate + status ready) |
Do not confuse these with the SDK eval pipeline:
get_eval_summaryandGET /tests/{id}/results/summary?kind=eval— pipeline manifest / worker materializationGET /tests/{id}/results— stable metrics overview for integrations and exports
The Python SDK does not yet expose typed methods for the three read routes above. Example with httpx:
import os
import httpx
base = os.environ["REDRAVEN_BASE_URL"].rstrip("/")
headers = {
"X-API-Key": os.environ["REDRAVEN_API_KEY"],
"X-Organization-Id": os.environ["REDRAVEN_ORGANIZATION_ID"],
}
test_id = "..."
overview = httpx.get(f"{base}/api/v1/tests/{test_id}/results", headers=headers, timeout=30)
overview.raise_for_status()
Pipeline worker and local development
Materializing eval results (GET /tests/{id}/results/summary?kind=eval) is done by the redraven-pipeline-worker process against Postgres (pipeline_jobs). The API and worker must use the same DATABASE_URL; otherwise jobs enqueue in one database while the worker polls another and nothing appears to run.
If every client response is already in object storage and the client manifest is completed, but evaluation never starts, run the worker and retry: wait_for_evaluation_ready triggers POST /tests/{id}/evaluate/ensure-from-client-responses first so the evaluate job can be created or moved back to queued when appropriate. Until the worker processes that job, GET …/summary?kind=eval returns HTTP 404 — that is normal and does not mean the ensure call failed.
Run the worker from the same virtualenv / env as the API, for example: uv run redraven-pipeline-worker.
Image mode and multimodal LLMs
When a test is created with image mode (metadata.modes.image=true), dataset cases can include OpenAI-style messages content (text and/or image blocks). The SDK supports both text-only and multimodal callables:
- Text-only callable (works for all tests):
def llm(prompt: str) -> str
- Multimodal callable (recommended for image understanding):
def llm(prompt: str, messages: list[dict] | None = None) -> str
If your callable does not accept messages, the SDK forwards only prompt and logs a warning when multimodal payloads are present. For image-only rows with no text, the SDK now uses a safe placeholder prompt so runs can still complete and evaluation can be enqueued.
Example multimodal signature:
def my_llm(prompt: str, messages: list[dict] | None = None) -> str:
payload = messages or [{"role": "user", "content": prompt}]
# call your provider with payload
return "..."
Quickstart
import asyncio
import redraven
async def my_llm(prompt: str, messages: list[dict] | None = None) -> str:
# Call your own LLM provider here.
_ = messages
return f"echo: {prompt}"
async def main():
async with redraven.Client() as client:
handshake = await client.call_agent(
test_id="<your-existing-test-id>",
llm=my_llm,
concurrency=4,
retries=2,
)
await client.wait_for_evaluation_ready(test_id="<your-existing-test-id>")
result = await client.get_eval_summary(
test_id="<your-existing-test-id>",
expected_cases=handshake.expected_cases,
allow_partial=True,
)
print(f"state={result.state} received={result.received} failed={result.failed}")
asyncio.run(main())
Prepare a test first
Create a test in the Redraven app (or via the SDK), then use its test_id with call_agent(...) and get_eval_summary(...).
Generate tests with the SDK
test_id = await client.generate_test(
generate_kwargs={
"project_id": "11111111-1111-1111-1111-111111111111",
"test_name": "SDK generated test",
"business_context": "Healthcare SaaS for clinicians.",
"use_case": "Symptom triage assistant.",
"certifications": ["HIPAA"],
"max_policies": 5,
"max_prompts_per_policy": 2,
},
wait_for_dataset=True,
)
generate_test(...) returns the created test_id (string).
You can also run an explicit wait step:
await client.wait_for_dataset_ready(test_id)
Run an existing test
handshake = await client.call_agent(
test_id=test_id,
llm=my_llm,
concurrency=8,
)
await client.wait_for_evaluation_ready(test_id=test_id)
result = await client.get_eval_summary(
test_id=test_id,
expected_cases=handshake.expected_cases,
)
Alternatively, without a separate wait call:
result = await client.get_eval_summary(
test_id=test_id,
expected_cases=handshake.expected_cases,
wait_for_completion=True,
)
Resuming an interrupted agent run
call_agent is resumable by default: cases that already have a terminal (ok or failed) response on the server are skipped and the user LLM is not invoked for them. Just call it again with the same test_id:
handshake = await client.call_agent(test_id=test_id, llm=my_llm)
Force a full re-run with resume=False:
handshake = await client.call_agent(test_id=test_id, llm=my_llm, resume=False)
One-call flow (generate + run)
result = await client.generate_and_run_test(
generate_kwargs={
"project_id": "11111111-1111-1111-1111-111111111111",
"test_name": "SDK generated test",
"business_context": "Healthcare SaaS for clinicians.",
"use_case": "Symptom triage assistant.",
"certifications": ["HIPAA"],
"max_policies": 5,
"max_prompts_per_policy": 2,
},
llm=my_llm,
concurrency=8,
)
Result fields
After the eval has reached a terminal state, get_eval_summary(...) returns an EvalSummary with:
state,expected_cases,received,failed,failed_case_idssummary(aggregated evaluation output)manifest(evaluation trace metadata)
If you call get_eval_summary without waiting and the eval is not ready yet, summary may be None and state may be non-terminal (for example pending).
Common errors
RedravenConfigError: missing API key, organization ID, or base URLRedravenHTTPError: backend returned non-2xx responseRedravenTimeoutError: raised whenwait_for_evaluation_ready(...)orget_eval_summary(..., wait_for_completion=True)does not see a terminal eval state within the timeoutRedravenPartialRunError: raised whenallow_partial=Falseand the eval summary reports failed cases (only applies once the eval has reached a terminal state)
Resumability
If a run is interrupted, call call_agent(...) again with the same test_id.
Previously submitted cases are safely reused by the backend.
Notes
- The LLM call runs in your process, so your model API key stays local.
my_llmcan be sync or async.