file: ./content/docs/changelog.mdx
meta: {
"title": "Changelog"
}
# Changelog
## Week of 2025-05-19
* Improved playground prompt editor stability and performance.
* Capture cached tokens from OpenAI and Anthropic models in a unified format and surface them in the UI.
* Create experiments from the experiments list page using saved prompts/agents.
* New BTQL sandbox page and editor with autocomplete

* Added a 'Copy page' button to the top of every docs page.
* Brainstore now supports vacuuming data from object storage to reclaim space. If you are self-hosted, please
reach out to Braintrust support to learn more about the feature.
* Organization owners can manage API keys for all users in their organization in the UI.
### JS SDK version 0.0.205
* Make the `_xact_id` field in `origin` optional.
* Added `span.link()` as a synchronous means of generating permalinks.
### Python SDK version 0.1.1
* Update cached token accounting in `wrap_anthropic` to correctly capture cached tokens.
* Pull additional metadata in `braintrust pull` for prompts and functions to improve tracing.
### JS SDK version 0.0.204
* Update cached token accounting in `wrapAnthropic` to correctly capture cached tokens.
## Week of 2025-05-12
* Collapsible sidebar navigation
* Command bar (CMD/CTRL+K) to quickly navigate and between pages and projects
* View monitor page logs across all projects in an organization
### SDK (version 0.1.0)
* Allow custom model descriptions in Braintrust.
* Improve support for PDF attachments to multimodal OpenAI models.
The Python library no longer has a dependency on `braintrust_core`.
`braintrust_core` will be deprecated in the near future, but the package will
remain on PyPI. If you wrote code that directly imports from `braintrust_core`, you can
either:
* Change your imports to `from braintrust.score import Score, Scorer` (preferred)
* or, add `braintrust_core` to your project's dependencies.
The TypeScript SDK also includes a small packaging bugfix.
### SDK (version 0.0.203)
* Add new reasoning to OpenAI messages
## Week of 2025-05-05
* Added Mistral Medium 3 and Gemini 2.5 Pro Preview to the AI proxy and playground.
* Self-hosted builds now log in a structured JSON format that is easier to parse.
### SDK (version 0.0.202)
* Gracefully handle experiment summarization failures in Eval()
* Fix a bug where `wrap_openai` was breaking `pydantic_ai run_stream` func.
* Add tracing to the `client.beta.messages` calls in the TypeScript Anthropic
library.
* Fix some deprecation warnings in the Python SDK.
## Week of 2025-04-28
* Permission groups settings page now allows admins to set group-level permissions
(i.e. which users can read, delete, and add/remove members from a particular group)
* Automations alpha: trigger webhooks based on log events
## Week of 2025-04-21
* Preview attachments in playground input cells.
* Playground now support list mode which includes score and metric summaries.
* Handle structured outputs from OpenAI's responses API in the "Try prompt" experience.
### SDK (version 0.0.201)
* Support OpenAI `client.beta.chat.completions.parse` in the Python wrapper.
### SDK (version 0.0.200)
* Ensure the prompt cache properly handles any manner of prompt names.
* Ensure the output of `anthropic.messages.create` is properly traced when called with `stream=True` in an async program.
## Week of 2025-04-14
* Allow users to remove themselves from any organization they are part of using
the `/v1/organization/members` REST endpoint.
* Group monitor page charts by metadata path.
* Download playground contents as CSV.
* Add pending and streaming state indicators to playground cells.
* Distinguish per-row and global playground progress.
* Added GPT-4.1, o4-mini and o3 to the AI proxy and playground.
* On the monitor page, add aggregate values to chart legends.
* Add Gemini 2.5 Flash Preview model to the AI proxy and playground.
* Add support for audio and video inputs for Gemini models in the AI proxy and playground.
* Add support for PDF files for OpenAI models.
* Native tracing support in the proxy has finally arrived! Read more in [the docs](/docs/guides/proxy#tracing)
* Upload attachments directly in the UI in datasets, playgrounds, and prompts (requires a stack update to 0.0.67).
### SDK (version 0.0.199)
* Fix a bug that broke async calls to the Python version of
`anthropic.messages.create`.
* Store detailed metrics from OpenAI's `chat.completion` TypeScript API.
### SDK (version 0.0.198)
* Trace the `openai.responses` endpoint in the Typescript SDK.
* Store the `token_details` metrics return by the `openai/responses` API.
## Week of 2025-04-07
* Playground option to append messages from a dataset to the end of a prompt
* A new toggle that lets you skip tracing scoring info for online scoring. This is useful when you are scoring
old logs and don't want to hurt search performance as a result.
* GIF and image support in comments
* Add embedded view and download action for inline attachments of supported file types
### API (version 0.0.65)
* Improve error messages when trying to insert invalid unicode
### SDK (version 0.0.197)
* Fix a bug in `init_function` in the Python SDK which prevented the `input` argument from being passed to the function correctly when it was used as a scorer.
* Support setting `description` and `summarizeScores`/`summarize_scores` in `Eval(...)`.
### API (version 0.0.65)
* Backend support for appending messages.
## Week of 2025-03-31
* Many improvements to the playground experience:
* Fixed many crashes and infinite loading spinner states
* Improved performance across large datasets
* Better support for running single rows for the first time
* Fixed re-ordering prompts
* Fixed adding and removing dataset rows
* You can now re-run specific prompts for individual cells and columns
* You can now do "does not contain" filters for tags in experiments and datasets. Coming soon to logs!
* When you `invoke()` a function, inline base64 payloads will be automatically logged as attachments.
* Add a strict mode to evals and functions which allows you to fail test cases when a variable is not present in a prompt. Without strict mode,
prompts will always render (and sometimes miss variables). With strict mode on, these variables show clearly as errors in the playground and experiments.
* Add Fireworks' DeepSeek V3 03-24 and DeepSeek R1 (Basic), along with Qwen QwQ 32B in Fireworks and Together.ai, to the playground and AI proxy.
* Fix bug that prevented Databricks custom provider form from being submitted without toggling authentication types.
* Unify Vertex AI, Azure, and Databricks custom provider authentication inputs.
* Add Llama 4 Maverick and Llama 4 Scout models to Together.ai, Fireworks, and Groq providers in the playground and AI proxy.
* Add Mistral Saba and Qwen QwQ 32B models to the Groq provider in the playground and AI proxy.
* Add Gemini 2.5 Pro Experimental and Gemini 2.0 Flash Thinking Mode models to the Vertex provider in the playground and AI proxy.
### API (version 0.0.64)
* Brainstore is now set as the default storage option
* Improved backfilling performance and overall database load
* Enabled relaxed search mode for ClickHouse to improve query flexibility
* Added strict mode option to prompts that fails when required template arguments are missing
* Enhanced error reporting for missing functions and eval failures
* Fixed streaming errors that previously resulted in missing cells instead of visible error states
* Abort evaluations on server when stopped from playground
* Added support for external bucket attachments
* Improved handling of large base64 images by converting them to attachments
* Fixed proper handling of UTF-8 characters in attachment filenames
* Added the ability to set telemetry URL through admin settings
### SDK (version 0.0.196)
* Adding Anthropic tracing for our TypeScript SDK. See `braintrust.wrapAnthropic`.
* The SDK now paginates datasets and experiments, which should improve performance for large datasets and experiments.
* Add `strict` flag to `invoke` which implements the strict mode described above.
* Raise if a Python tool is pushed without without defined parameters,
instead of silently not showing the tool in the UI.
* Fix Python OpenAI wrapper to work for older versions of the OpenAI library without `responses`.
* Set time\_to\_first\_token correctly from AI SDK wrapper
## Week of 2025-03-24
* Add OpenAI's [o1-pro](https://platform.openai.com/docs/models/o1-pro) model to the playground and AI proxy.
* Support OpenAI Responses API in the AI proxy.
* Add support for the Gemini 2.5 Pro Experimental model in the playground and AI proxy.
* Option to disable the experiment comparison auto-select behavior
* Add support for Databricks custom provider as a default cloud provider in the playground and AI proxy.
* Allow supplying a base API URL for Mistral custom providers in the playground and AI proxy.
* Support pushed code bundles larger than 50MB.
### SDK (version 0.0.195)
* Improve the metadata collected by the Anthropic client.
* Anthropic client can now be run with `braintrust.wrap_anthropic`
* Fix a bug when `messages.create` was called with `stream=True`
### SDK (version 0.0.194)
* Add Anthropic tracing to the Python SDK with `wrap_anthropic_client`
* Fix a bug calling `braintrust.permalink` with `NoopSpan`
### SDK (version 0.0.193)
* Fix retry bug when downloading large datasets/experiments from the SDK
* Background logger will load environment variables upon first use rather than
when module is imported.
## Week of 2025-03-17
* The OTEL endpoint now understands structured output calls from the Vercel AI
SDK. Logging via `generateObject` and `streamObject` will populate the schema
in Braintrust, allowing the full prompt to be run.
* Added support for `concat`, `lower`, and `upper` string functions in BTQL.
* Correctly propagate Bedrock streaming errors through the AI proxy and playground.
* Online scoring supports sampling rates with decimal precision.
### SDK (version 0.0.192)
* Improve default retry handler in the python SDK to cover more network-related
exceptions.
### Autoevals (version 0.0.124)
* Added `init` to set a global default client for all evaluators (Python and Node.js).
* Added `client` argument to all evaluators to specify the client to use.
* Improved the Autoevals docs with more examples, and Python reference docs now include moderation, ragas, and other evaluators that were missing from the initial release.
## Week of 2025-03-10
* Added support for OpenAI GPT-4o Search Preview and GPT-4o mini Search Preview
in the playground and AI proxy.
* Add support for making Anthropic and Google-format requests to corresponding models in the AI proxy.
* Fix bug in model provider key modal that prevents submitting a Vertex provider with an empty base URL.
* Add column menu in grid layout with sort and visibility options.
* Enable logging the `origin` field through the REST API
### Autoevals (version 0.0.123)
* Swapped `polyleven` for `levenshtein` for faster string matching.
### SDK Integrations: LangChain (Python) (version 0.0.2)
* Add a new `braintrust-langchain` integration with an improved `BraintrustCallbackHandler` and `set_global_handler` to set the handler globally for all LangChain components.
### SDK Integrations: LangChain.js (version 0.0.6)
* Small improvement to avoid logging unhelpful LangGraph spans.
* Updated peer dependencies with LangChain core that fixes the global handler for LangGraph runs.
### SDK Integrations: Val Town
* New `val.town` integration with example vals to quickly get started with Braintrust.
### SDK (version 0.0.190)
* Fix `prompt pull` for long prompts.
* Fix a bug in the Python SDK which would not retry requests that were severed after a connection timeout.
### SDK (version 0.0.189)
* Added integration with [OpenAI Agents SDK](/docs/guides/traces/integrations#openai-agents-sdk).
### SDK (version 0.0.188)
* Deprecated `braintrust.wrapper.langchain` in favor of the new `braintrust-langchain` package.
## Week of 2025-03-03
* Add support for "image" pdfs in the AI proxy.
See the [proxy docs](/docs/guides/proxy#pdf-input) for more details.
* Fix issue in which code function executions could hang indefinitely.
* Add support for custom base URLs for Vertex AI providers.
* Add dataset column to experiments table.
* Add python3.13 support to user-defined functions.
* Fix bug that prevented calling Python functions from the new unified playground.
### SDK (version 0.0.187)
* Always bundle default python packages when pushing code with `braintrust push`.
* Fix bug in the TypeScript SDK where `asyncFlush` was not correctly defaulted to false.
* Fix a bug where `span_attributes` failed to propagate to child spans through propagated events.
## Week of 2025-02-24
* Add support for removing all permissions for a group/user on an object with a single click.
* Add support for Claude 3.7 Sonnet model.
* Add [llms.txt](/docs/llms.txt) for docs content.
* Enable spellcheck for prompt message editors.
* Add support for Anthropic Claude models in Vertex AI.
* Add support for Claude 3.7 Sonnet in Bedrock and Vertex AI.
* Add support for Perplexity R1 1776, Mistral Saba, Gemini LearnLM, and more Groq models.
* Support system instructions in Gemini models.
* Add support for Gemini 2.0 Flash-Lite, and remove preview model,
which no longer serves requests.
* Add support for default Bedrock cross-region inference profiles in the playground and AI proxy.
* Move score distribution charts to the experiment sidebar.
* Add support for OpenAI GPT-4.5 model in the playground and AI proxy.
* Add deprecation warning for `_parent_id` field in the REST API
([docs](docs/reference/api/Logs#request-body)). This field will be removed in a
future release.
### API (version 0.0.63)
* Support for Claude 3.7 Sonnet, Gemini 2.0 Flash-Lite, and several other models in the proxy.
* Stability and performance improvements for ETL processes.
* A new `/status` endpoint to check the health of Braintrust services.
### SDK (version 0.0.187)
* Added support for handling score values when an Eval has errored.
## Week of 2025-02-17
* Add support for stop sequences in Anthropic, Bedrock, and Google models.
* Resolve JSON Schema references when translating structured outputs
to Gemini format.
* Add button to copy table cell contents to clipboard.
* Add support for basic Cache-Control headers in the AI proxy.
* Add support for selecting all or none in the categories of permission dialogs.
* Respect Bedrock providers not supporting streaming in the AI proxy.
### SDK (version 0.0.187)
* Improve support for binary packages in `npx braintrust eval`.
* Support templated structured outputs.
* Fix dataset summary types in Typescript.
## Week of 2025-02-10
* Store table grouping, row height, and layout options in the view configuration.
* Add the ability to set a default table view.
* Add support for Google Cloud Vertex AI in the playground and proxy.
Google Cloud auth is supported for principals and service accounts
via either OAuth 2.0 token or service account key.
* Add default cloud providers section to the organization AI providers page.
* Support streaming responses from OpenAI o1 models in the playground and AI proxy.
## Week of 2025-02-03
* Add complete support for Bedrock models in the playground and AI proxy;
this includes support for system prompts, tool calls, and multimodal inputs.
* Fix model provider configuration issues in which custom models could clobber
default models, and different providers of the same type could clobber each other.
* Fix bug in streaming JSON responses from non-OpenAI providers.
* Supported templated structured outputs in experiments run from the playground.
* Support structured outputs in the playground and AI proxy for Anthropic models, Bedrock models,
and any OpenAI-flavored models that support tool calls, e.g. LLaMa on Together.ai.
* Support templated custom headers for custom AI providers.
See the [proxy docs](/docs/guides/proxy#custom-models) for more details.
* Added and updated models across all providers in the playground and AI proxy.
* Support tool usage and structured outputs for Gemini models in the playground and AI proxy.
* Simplify playground model dropdown by showing model variations in a nested dropdown.
## Week of 2025-01-27
* Add support for duplicating prompts, scorers, and tools.
* Fix pagination for the `/v1/prompt` REST API endpoint.
* "Unreviewed" default view on experiment and logs tables to filter out rows that have been human reviewed.
* Add o3-mini to the AI proxy and playground.
* Scorer dropdown now supports using custom scoring functions across projects.
### SDK Integrations: LangChain.js (version 0.0.5)
* Less noisy logging from the LangChain.js integration.
* You can now pass a `NOOP_SPAN` to the `BraintrustCallbackHandler` to disable logging.
* Fixes a bug where the LangChain.js integration could not handle null/undefined values in chain inputs/outputs.
### SDK (version 0.0.184)
* `span.export()` will no longer throw if braintrust is down
* Improvement to the Python prompt rendering to correctly render formatted messages, LLM tool calls, and other structured outputs.
## Week of 2025-01-20
* Drag and drop to reorder span fields in experiment/log traces and dataset rows. On wider screens,
fields can also be arranged side-by-side.
* Small convenience improvement to the BTQL Sandbox to avoid having to add include `filter:` to an advanced filter clause.
* Add an attachments browser to view all attachments for a span in a sidebar. To open the attachments browser, expand the
trace and click the arrow icon in the attachments section. It will only be visible when the trace panel is wide enough.

### SDK (version 0.0.183)
* Fix a bug related to `initDataset()` in the Typescript SDK creating links in `Eval()` calls.
* Fix a few type checking issues in the Python SDK.
## Week of 2025-01-13
* Add support for setting a baseline experiment for experiment comparisons. If a baseline experiment is set, it will be chosen by default as the comparison when clicking on an experiment.
* UI updates to experiment and log tables.
* Trace audit log now displays granular changes to span data.
* Start/end columns shown as dates/times.
* Non-existent trace records display an error message instead of loading indefinitely.
### SDK Integrations: LangChain.js (version 0.0.4)
* Support logging spans from inside evals in the LangChain.js integration.
### SDK (version 0.0.182)
* Improved logging for moderation models from the SDK wrappers.
## Week of 2025-01-06
* Creating an experiment from a playground now correctly renders prompts with `input`, `metadata`, `expected`, and `output` mapped fields.
* Fixes small bug where `input.output` data could pollute the dataset's `output` when rendering the prompts.
* The [AI proxy](/docs/guides/proxy) now includes `x-bt-used-endpoint` as a response header. It specifies which of your configured AI providers was used to complete the request.
* Add support for deeplinking to comments within spans, allowing users to easily copy and share links to comments.
* In Human Review mode, display all scores in a form.
* Experiment table rows can now be sorted based on score changes and regressions for each group, relative to a selected comparison experiment.
* The OTEL endpoint now converts attributes under the `braintrust` namespace directly to the corresponding Braintrust fields. For example, `braintrust.input` will appear as `input` in Braintrust. See the [tracing guide](/docs/guides/tracing/integrations#manual-tracing) for more details.
* New OTEL attributes that accept JSON-serialized values have been added for convenience:
* `gen_ai.prompt_json`
* `gen_ai.completion_json`
* `braintrust.input_json`
* `braintrust.output_json`
For more details, see the [tracing guide](/docs/guides/tracing/integrations#manual-tracing).
* Experiment tables and individual traces now support comparing trial data between experiments.
### SDK (version 0.0.181)
* Add `ReadonlyAttachment.metadata` helper method to fetch a signed URL for
downloading the attachment metadata.
### SDK (version 0.0.179)
* New `hook.expected` for reading and updating expected values in the Eval framework.
* Small type improvements for `hook` objects.
* Fixed a bug to enable support for `init_function` with LLM scorers in Python.
* Support nested attachments in Python.
## Week of 2024-12-30
* Add support for free-form human review scores (written to the `metadata` field).
### SDK (version 0.0.179) (unreleased)
* Add support for imports in Python functions pushed to Braintrust via `braintrust push`.
### SDK (version 0.0.178)
* Cache prompts locally in a two-layered memory/disk cache,
and attempt to use this cache if the prompt cannot be fetched from the Braintrust server.
* Support for using custom functions that are stored in Braintrust in evals.
See the [docs](/docs/guides/evals/write#using-custom-promptsfunctions-from-braintrust) for more details.
* Add support for running traced functions in a `ThreadPoolExecutor`
in the Python SDK. See the [customize traces guide](/docs/guides/traces/customize)
for more information.
* Improved formatting of spans logged from the Vercel AI SDK's `generateObject` method.
The logged output now matches the format of OpenAI's structured outputs.
* Default to `asyncFlush: true` in the TypeScript SDK.
This is usually safe since Vercel and Cloudflare both have `waitUntil`,
and async flushes mean that clients will not be blocked if Braintrust is down.
### SDK integrations: LangChain.js (version 0.0.2)
* Add support for initializing global LangChain callback handler to avoid manually passing the handler to each LangChain object.
## Week of 2024-12-16
### API (version 0.0.61)
* Upgraded to Node.js 22 in Docker containers.
### SDK (version 0.0.177)
* Support for creating and pushing custom scorers from your codebase
with `braintrust push`.
Read the guides to [scorers](/docs/guides/functions/scorers)
for more information.
## Week of 2024-12-09
* Add support for structured outputs in the playground.

* Sparkline charts added to the project home page.
* Better handling of missing data points in monitor charts.
* Clicking on monitor charts now opens a link to traces filtered to the selected time range.
* Add `Endpoint supports streaming` flag to custom provider configuration. The [AI proxy](/docs/guides/proxy) will convert non-streaming endpoints to streaming format, allowing the provider's models to be used in the playground.
* Experiments chart can be resized vertically by dragging the bottom of the chart.
* BTQL sandbox to explore project data using [Braintrust Query Language](/docs/reference/btql).
* Add support for updating span data from custom span iframes.
### Autoevals (version 0.0.110)
* Python Autoevals now support custom clients when calling evaluators. See [docs](https://pypi.org/project/autoevals/) for more details.
### SDK (version 0.0.176)
* New `hook.metadata` for reading and updating Eval metadata when using the `Eval` framework. Previous `hook.meta` is now deprecated.
### SDK integrations: LangChain.js (version 0.0.1)
* New LangChain.js integration to export traces from `langchainjs` runs.
### SDK integrations: LangChain.js (version 0.0.1)
* New LangChain.js integration to export traces from `langchainjs` runs.
## Week of 2024-12-02
* Significantly speed up loading performance for experiments and logs, especially with lots of spans.
This speed up comes with a few changes in behavior:
* Searches inside experiments will only work over content in the tabular view, rather than over the full trace.
* While searching on the logs page, realtime updates are disabled.
* Starring rows in experiment and dataset tables now supported.
* "Order by regression" option in experiment column menu can now be toggled on and off without losing previous order.
* Add expanded timeline view for traces.
* Added a 'Request count' chart to the monitor page.
* Add headers to custom provider configuration which the [AI proxy](/docs/guides/proxy) will include in the request to the custom endpoint.
* The logs viewer now supports exporting the currently loaded rows as a CSV or JSON file.
### API (version 0.0.60)
* Make PG\_URL configuration more uniform between nodeJS and python clients.
### SDK (version 0.0.175)
* Fix bug with serializing ReadonlyAttachment in logs
## Week of 2024-11-25
* Experiment columns can now be reordered from the column menu.
* You can now customize legends in monitor charts. Select a legend item to highlight its data, Shift (⇧) + Click to select multiple items, or Command (⌘) / Ctrl (⌃) + Click to deselect.
### SDK (version 0.0.174)
* AI SDK fixes: support for image URLs and properly formatted tool calls so "Try prompt" works in the UI.
### SDK (version 0.0.173)
* Attachments can now be loaded when iterating an experiment or dataset.
### SDK (version 0.0.172)
* Fix a bug where `braintrust eval` did not respect certain configuration options, like `base_experiment_id`.
* Fix a bug where `invoke` in the Python SDK did not properly stream responses.
## Week of 2024-11-18
* The Traceloop OTEL integration now uses the input and output attributes to populate the corresponding fields in Braintrust.
* The monitor page now supports querying experiment metrics.
* Removed the `filters` param from the REST API fetch endpoint. For complex
queries, we recommend using the `/btql` endpoint ([docs](/docs/reference/btql)).
* New experiment summary layout option, a url-friendly view for experiment summaries that respects all filters.
* Add a default limit of 10 to all fetch and `/btql` requests for project\_logs.
* You can now export your prompts from the playground as code snippets and run them through the [AI proxy](/docs/guides/proxy).
* Add a fallback for the "add prompt" dropdown button in the playground, which
will search for prompts within the current project if the cross-org prompts
query fails.
### SDK (version 0.0.171)
* Add a `.data` method to the `Attachment` class, which lets you inspect the
loaded attachment data.
## Week of 2024-11-12
* Support for creating and pushing custom Python tools and prompts from your codebase with `braintrust push`. Read the guides to [tools](/docs/guides/functions/tools) and [prompts](/docs/guides/functions/prompts) for more information.
* You can now view grouped summary data for all experiments by selecting **Include comparisons in group** from the **Group by** dropdown inside an experiment.
* The experiments page now supports downloading as CSV/JSON.
* Downloading or duplicating a dataset in the UI now properly copies all dataset rows.
* You can now view a score data as a bar chart for your experiments data by selecting **Score comparison** from the X axis selector.
* Trials information is now shown as a separate column in diff mode in the experiment table.
* Cmd/Ctrl + S hotkey to save from prompts in the playground and function dialogs.
### SDK (version 0.0.170)
* Support uploading [file attachments in the Python SDK](/docs/reference/libs/python#attachment-objects).
* Log, feedback, and dataset inputs to the Python SDK are now synchronously deep-copied for more consistent logging.
### SDK (version 0.0.169)
* The Python SDK `Eval()` function has been split into `Eval()` and `EvalAsync()` to make it clear which one should be called in an asynchronous context. The behavior of `Eval()` remains unchanged. However, `Eval()` callers running in an asynchronous context are strongly recommended to switch to `EvalAsync()` to improve type safety.
* Improved type annotations in the Python SDK.
### SDK (version 0.0.168)
* A new `Span.permalink()` method allows you to format a permalink for the current span. See [TypeScript docs](/docs/reference/libs/nodejs/interfaces/Span#permalink) or [Python docs](/docs/reference/libs/python#permalink) for details.
* `braintrust push` support for Python tools and prompts.
## Week of 2024-11-04
* The Braintrust [AI Proxy](/docs/guides/proxy) now supports the [OpenAI Realtime API](https://platform.openai.com/docs/guides/realtime), providing observability for voice-to-voice model sessions and simplifying backend infrastructure.
* Add "Group by" functionality to the monitor page.
* The experiment table can now be visualized in a [grid layout](/docs/guides/evals/interpret#grid-layout), where each column represents an experiment to compare long-form outputs side-by-side.
* 'Select all' button in permission dialogs
* Create custom columns on dataset, experiment and logs tables from `JSON` values in `input`, `output`, `expected`, or `metadata` fields.
### API (version 0.0.59)
* Fix permissions bug with updating org-scoped env vars
## Week of 2024-10-28
* The Braintrust [AI Proxy](/docs/guides/proxy) can now [issue temporary credentials](/docs/guides/proxy#api-key-management) to access the proxy for a limited time. This can be used to make AI requests directly from frontends and mobile apps, minimizing latency without exposing your API keys.
* Move experiment score summaries to the table column headers. To view improvements and regressions per metadata or input group, first group the table by the relevant field. Sooo much room for \[table] activities!
* You now receive a clear error message if you run out of free tier capacity while running an experiment from the playground.
* Filters on JSON fields now support array indexing, e.g. `metadata.foo[0] = 'bar'`. See [docs](/docs/reference/btql#Expressions).
### SDK (version 0.0.168)
* `initDataset()`/`init_dataset()` used in `Eval()` now tracks the dataset ID and links to each row in the dataset properly.
## Week of 2024-10-21
* Preview [file attachments](/docs/guides/tracing#uploading-attachments) in the trace view.
* View and filter by comments in the experiment table.
* Add table row numbers to experiments, logs, and datasets.
### SDK (version 0.0.167)
* Support uploading [file attachments in the TypeScript SDK](/docs/reference/libs/nodejs/classes/Attachment).
* Log, feedback, and dataset inputs to the TypeScript SDK are now synchronously deep-copied for more consistent logging.
* Address an issue where the TypeScript SDK could not make connections when running in a Cloudflare Worker.
### API (version 0.0.59)
* Support uploading [file attachments](/docs/reference/libs/nodejs/classes/Attachment).
* You can now export [OpenTelemetry (OTel)](https://opentelemetry.io/docs/specs/otel/) traces to Braintrust. See
the [tracing guide](/docs/guides/tracing/integrations#opentelemetry-otel) for more details.
## Week of 2024-10-14
* The Monitor page now shows an aggregate view of log scores over time.
* Improvement/Regression filters between experiments are now saved to the URL.
* Add `max_concurrency` and `trial_count` to the playground when kicking off evals. `max_concurrency` is useful to
avoid hitting LLM rate limits, and `trial_count` is useful for evaluating applications that have
non-deterministic behavior.
* Show a button to scroll to a single search result in a span field when using trace search.
* Indicate spans with errors in the trace span list.
### SDK (version 0.0.166)
* Allow explicitly specifying git metadata info in the Eval framework.
### SDK (version 0.0.165)
* Support specifying dataset-level metadata in `initDataset/init_dataset`.
### SDK (version 0.0.164)
* Add `braintrust.permalink` function to create deep links pointing to
particular spans in the Braintrust UI.
## Week of 2024-10-07
* After using "Copy to Dataset" to create a new dataset row, the audit log of the new row now links back to the original experiment, log, or other dataset.
* Tools now stream their `stdout` and `stderr` to the UI. This is helpful for debugging.
* Fix prompt, scorer, and tool dropdowns to only show the correct function types.
### SDK (version 0.0.163)
* Fix Python SDK compatibility with Python 3.8.
### SDK (version 0.0.162)
* Fix Python SDK compatibility with Python 3.9 and older.
### SDK (version 0.0.161)
* Add utility function `spanComponentsToObjectId` for resolving the object ID
from an exported span slug.
## Week of 2024-09-30
* The [Github action](/docs/guides/evals/run#github-action) now supports Python runtimes.
* Add support for [Cerebras](https://cerebras.ai/) models in the proxy, playground, and saved prompts.
* You can now create [span iframe viewers](/docs/guides/tracing#custom-span-iframes) to visualize span data in a custom iframe.
In this example, the "Table" section is a custom span iframe.

* `NOT LIKE`, `NOT ILIKE`, `NOT INCLUDES`, and `NOT CONTAINS` supported in BTQL.
* Add "Upload Rows" button to insert rows into an existing dataset from CSV or JSON.
* Add "Maximum" aggregate score type.
* The experiment table now supports grouping by input (for trials) or by a metadata field.
* The Name and Input columns are now pinned
* Gemini models now support multimodal inputs.
## Week of 2024-09-23
* Basic monitor page that shows aggregate values for latency, token count, time to first token, and cost for logs.
* Create custom tools to use in your prompts and in the playground. See the [docs](/docs/guides/prompts#calling-external-tools) for more details.
* Set org-wide environment variables to use in these tools
* Pull your prompts to your codebase using the `braintrust pull` command.
* Select and compare multiple experiments in the experiment view using the `compared with` dropdown.
* The playground now displays aggregate scores (avg/max/min) for each prompt and supports sorting rows by a score.
* Compare span field values side-by-side in the trace viewer when fullscreen and diff mode is enabled.
### SDK (version 0.0.160)
* Fix a bug with `setFetch()` in the TypeScript SDK.
### SDK (version 0.0.159)
* In Python, running the CLI with `--verbose` now uses the `INFO` log level, while still printing full stack traces. Pass the flag twice (`-vv`) to use the `DEBUG` log level.
* Create and push custom tools from your codebase with `braintrust push`. See [docs](/docs/guides/prompts#calling-external-tools) for more details. TypeScript only for now.
* A long awaited feature: you can now pull prompts to your codebase using the `braintrust pull` command. TypeScript only for now.
### API (version 0.0.56)
* Hosted tools are now available in the API.
* Environment variables are now supported in the API (not yet in the standard REST API). See the [docker compose file](https://github.com/braintrustdata/braintrust-deployment/blob/main/docker/docker-compose.api.yml#L65)
for information on how to configure the secret used to encrypt them if you are using Docker.
* Automatically backfill `function_data` for prompts created via the API.
## Week of 2024-09-16
* The tag picker now includes tags that were added dynamically via API, in addition to the tags configured for your project.
* Added a REST API for managing AI secrets. See [docs](/docs/reference/api/AiSecrets).
### SDK (version 0.0.158)
* A dedicated `update` method is now available for datasets.
* Fixed a Python-specific error causing experiments to fail initializing when git diff --cached encounters invalid or inaccessible Git repositories.
* Token counts have the correct units when printing `ExperimentSummary` objects.
* In Python, `MetricSummary.metric` could have an `int` value. The type annotation has been updated.
## Week of 2024-09-09
* You can now create server-side online evaluations for your logs. Online evals support both [autoevals](/docs/reference/autoevals) and
[custom scorers](/docs/guides/playground) you define as LLM-as-a-judge, TypeScript, or Python functions. See
[docs](/docs/guides/evals/write#online-evaluation) for more details.
* New member invitations now support being added to multiple permission groups.
* Move datasets and prompts to a new Library navigation tab, and include a list of custom scorers.
* Clean up tree view by truncating the root preview and showing a preview of a node only if collapsed.

* Automatically save changes to table views.
## Week of 2024-09-02
* You can now upload typescript evals from the command line as functions, and then use them in the playground.
* Click a span field line to highlight it and pin it to the URL.
* Copilot tab autocomplete for prompts and data in the Braintrust UI.
```bash
# This will bundle and upload the task and scorer functions to Braintrust
npx braintrust eval --bundle
```
### API (version 0.0.54)
* Support for bundled eval uploads.
* The `PATCH` endpoint for prompts now supports updating the `slug` field.
### SDK (version 0.0.157)
* Enable the `--bundle` flag for `braintrust eval` in the TypeScript SDK.
## Week of 2024-08-26
* Basic filter UI (no BTQL necessary)
* Add to dataset dropdown now supports adding to datasets across projects.
* Add REST endpoint for batch-updating ACLs: `/v1/acl/batch_update`.
* Cmd/Ctrl click on a table row to open it in a new tab.
* Show the last 5 basic filters in the filter editor.
* You can now explicitly set and edit prompt slugs.
### Autoevals (version 0.0.86)
* Add support for Azure OpenAI in node.
### SDK (version 0.0.155)
* The client wrappers `wrapOpenAI()`/`wrap_openai()` now support [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs).
### API (version 0.0.54)
* Don't fail insertion requests if realtime broadcast fails
## Week of 2024-08-19
* Fixed comment deletion.
* You can now use `%` in BTQL queries to represent percent values. E.g. `50%` will be interpreted as `0.5`.
### API (version 0.0.54)
* Performance optimizations to filters on `scores`, `metrics`, and `created` fields.
* Performance optimizations to filter subfields of `metadata` and `span_attributes`.
## Week of 2024-08-12
* You can now create custom LLM and code (TypeScript and Python) evaluators in the playground.
* Fullscreen trace toggle
* Datasets now accept JSON file uploads
* When uploading a CSV/JSON file to a dataset, columns/fields named `input`, `expected`, and `metadata`
are now auto-assigned to the corresponding dataset fields
* Fix bug in logs/dataset viewer when changing the search params.
### API (version 0.0.53)
* The API now supports running custom LLM and code (TypeScript and Python) functions. To enable this in the:
* AWS Cloudformation stack: turn on the `EnableQuarantine` parameter
* Docker deployment: set the `ALLOW_CODE_FUNCTION_EXECUTION` environment variable to `true`
## Week of 2024-08-05
* Full text search UI for all span contents in a trace
* New metrics in the UI and summary API: prompt tokens, completion tokens, total tokens, and LLM duration
* These metrics, along with cost, now exclude LLM calls used in autoevals (as of 0.0.85)
* Switching organizations via the header navigates to the same-named project in the selected organization
* Added `MarkAsyncWrapper` to the Python SDK to allow explicitly marking
functions which return awaitable objects as async
### Autoevals (version 0.0.85)
* LLM calls used in autoevals are now marked with `span_attributes.purpose = "scorer"` so they can be excluded from
metric and cost calculations.
### Autoevals (version 0.0.84)
* Fix a bug where `rationale` was incorrectly formatted in Python.
* Update the `full` docker deployment configuration to bundle the metadata DB
(supabase) inside the main docker compose file. Thus no separate supabase
cluster is required. See
[docs](/docs/guides/self-hosting/docker#full-configuration) for details. If
you are upgrading an existing full deployment, you will likely want to mark
the supabase db volumes `external` to continue using your existing data (see
comments in the `docker-compose.full.yml` file for more details).
### SDK (version 0.0.151)
* `Eval()` can now take a base experiment. Provide either `baseExperimentName`/`base_experiment_name` or
`baseExperimentId`/`base_experiment_id`.
## Week of 2024-07-29
* Errors now show up in the trace viewer.
* New cookbook recipe on [benchmarking LLM providers](/docs/cookbook/recipes/ProviderBenchmark).
* Viewer mode selections will no longer automatically switch to a non-editable view if the field is editable and persist across trace/span changes.
* Show `%` in diffs instead of `pp`.
* Add rename, delete and copy current project id actions to the project dropdown.
* Playgrounds can now be shared publicly.
* Duration now reflects the "task" duration not the overall test case duration (which also includes scores).
* Duration is now also displayed in the experiment overview table.
* Add support for Fireworks and Lepton inference providers.
* "Jump to" menu to quickly navigate between span sections.
* Speed up queries involving metadata fields, e.g. `metadata.foo ILIKE '%bar%'`, using the columnstore backend if it is available.
* Added `project_id` query param to REST API queries which already accept
`project_name`. E.g. [GET
experiments](/docs/reference/api/Experiments#list-experiments).
* Update to include the latest Mistral models in the proxy/playground.
### SDK (version 0.0.148)
* While tracing, if your code errors, the error will be logged to the span. You can also manually log the `error` field through the API
or the logging SDK.
### SDK (version 0.0.147)
* `project_name` is now `projectName`, etc. in the `invoke(...)` function in TypeScript
* `Eval()` return values are printed in a nicer format (e.g. in Notebooks)
* [`updateSpan()`/`update_span()`](/docs/guides/tracing#updating-spans) allows you to update a span's fields after it has been created.
## Week of 2024-07-22
* Categorical human review scores can now be re-ordered via Drag-n-Drop.

* Human review row selection is now a free text field, enabling a quick jump to a specific row.

* Added REST endpoint for managing org membership. See
[docs](/docs/reference/api/Organizations#modify-organization-membership).
### API (version 0.0.51)
* The proxy is now a first-class citizen in the API service, which simplifies deployment and sets the groundwork for some
exciting new features. Here is what you need to know:
* The updates are available as of API version 0.0.51.
* The proxy is now accessible at `https://api.braintrust.dev/v1/proxy`. You can use this as a base URL in your OpenAI client,
instead of `https://braintrustproxy.com/v1`. \[NOTE: The latter is still supported, but will be deprecated in the future.]
* If you are self-hosting, the proxy is now bundled into the API service. That means you no longer need to deploy the proxy as
a separate service.
* If you have deployed through AWS, after updating the Cloudformation, you'll need to grab the "Universal API URL" from the
"Outputs" tab.

* Then, replace that in your settings page settings page

* If you have a Docker-based deployment, you can just update your containers.
* Once you see the "Universal API" indicator, you can remove the proxy URL from your settings page, if you have it set.
### SDK (version 0.0.146)
* Add support for `max_concurrency` in the Python SDK
* Hill climbing evals that use a `BaseExperiment` as data will use that as the default base experiment.
## Week of 2024-07-15
* In preparation for auth changes, we are making a series of updates that may affect self-deployed instances:
* Preview URLs will now be subdomains of `*.preview.braintrust.dev` instead of `vercel.app`. Please add this domain to your
allow list.
* To continue viewing preview URLs, you will need to update your stack (to update the allow list to include the new domain pattern).
* The data plane may make requests back to `*.preview.braintrust.dev` URLs. This allows you to test previews that include control plane
changes. You may need to whitelist traffic from the data plane to `*.preview.braintrust.dev` domains.
* Requests will optionally send an additional `x-bt-auth-token` header. You may need to whitelist this header.
* User impersonation through the `x-bt-impersonate-user` header now accepts
either the user's id or email. Previously only user id was accepted.
### Autoevals (version 0.0.80)
* New `ExactMatch` scorer for comparing two values for exact equality.
### Autoevals (version 0.0.77)
* Officially switch the default model to be `gpt-4o`. Our testing showed that it performed on average 10% more accurately than `gpt-3.5-turbo`!
* Support claude models (e.g. claude-3-5-sonnet-20240620). You can use them by simply specifying the `model` param in any LLM based evaluator.
* Under the hood, this will use the proxy, so make sure to configure your Anthropic API keys in your settings.
## Week of 2024-07-08
* Human review scores are now sortable from the project configuration page.

* Streaming support for tool calls in Anthropic models through the proxy and playground.
* The playground now supports different "parsing" modes:
* `auto`: (same as before) the completion text and the first tool call arguments, if any
* `parallel`: the completion text and a list of all tool calls
* `raw`: the completion in the OpenAI non-streaming format
* `raw_stream`: the completion in the OpenAI streaming format
* Cleaned up environment variables in the public [docker
deployment](https://github.com/braintrustdata/braintrust-deployment/tree/main/docker). Functionally, nothing has changed.
### Autoevals (version 0.0.76)
* New `.partial(...)` syntax to initialize a scorer with partial arguments like `criteria` in `ClosedQA`.
* Allow messages to be inserted in the middle of a prompt.
## Week of 2024-07-01
* Table views [can now be saved](/docs/reference/views), persisting the BTQL filters, sorts, and column state.
* Add support for the new `window.ai` model into the playground.

* Use push history when navigating table rows to allow for back button navigation.
* In the experiments list, grouping by a metadata field will group rows in the table as well.
* Allow the trace tree panel to be resized.
* Port the log summary query to BTQL. This should speed up the query, especially
if you have clickhouse configured in your cloud environment. This
functionality requires upgrading your data backend to version 0.0.50.
### SDK (version 0.0.140)
* New `wrapTraced` function allows you to trace javascript functions in a more ergonomic way.
```typescript #skip-compile
import { wrapTraced } from "braintrust";
const foo = wrapTraced(async function foo(input) {
const resp = await client.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: input }],
});
return resp.choices[0].message.content ?? "unknown";
});
```
### SDK (version 0.0.138)
* The TypeScript SDK's `Eval()` function now takes a `maxConcurrency` parameter, which bounds the
number of concurrent tasks that run.
* `braintrust install api` now sets up your API and Proxy URL in your environment.
* You can now specify a custom `fetch` implementation in the TypeScript SDK.
## Week of 2024-06-24
* Update the experiment progress and experiment score distribution chart layouts
* Format table column headers with icons
* Move active filters to the table toolbar
* Enable RBAC for all users. When inviting a new member, prompt to add that member to an RBAC Permission group.
* Use btql to power the datasets list, making it significantly faster if you have multiple large datasets.
* Experiments list chart supports click interactions. Left click to select an experiment, right click to add an annotation.
* Jump into comparison view between 2 experiments by selecting them in the table an clicking "Compare"
### Deployment
* The proxy service now supports more advanced functionality which requires setting the `PG_URL` and `REDIS_URL` parameters. If you do not
set them, the proxy will still run without caching credentials or requests.
## Week of 2024-06-17
* Add support for labeling [expected fields using human review](/docs/guides/human-review#writing-categorical-scores-to-expected-field).
* Create and edit descriptions for datasets.
* Create and edit metadata for prompts.
* Click scores and attributes (tree view only) in the trace view to filter by them.
* Highlight the experiments graph to filter down the set of experiments.
* Add support for new models including Claude 3.5 Sonnet.
## Week of 2024-06-10
* Improved empty state and instructions for custom evaluators in the playground.
* Show query examples when filtering/sorting.
* [Custom comparison keys](/docs/guides/evals/interpret#customizing-the-comparison-key) for experiments.
* New model dropdown in the playground/prompt editor that is organized by provider and model type.
## Week of 2024-06-03
* You can now collapse the trace tree. It's auto collapsed if you have a single span.

* Improvements to the experiment chart including greyed out lines for inactive scores and improved legend.
* Show diffs when you save a new prompt version.

## Week of 2024-05-27
* You can now see which users are viewing the same traces as you are in real-time.
* Improve whitespace and presentation of diffs in the trace view.
* Show markdown previews in score editor.
* Show cost in spans and display the average cost on experiment summaries and diff views.
* Published a new [Text2SQL eval recipe](/docs/cookbook/recipes/Text2SQL-Data)
* Add groups view for RBAC.
## Week of 2024-05-20
* Deprecate the legacy dataset format (`output` in place of `expected`) in a new version of the SDK (0.0.130). For now, data can still be fetched in the legacy format by setting the `useOutput` / `use_output` flag to false when using `initDataset()` / `init_dataset()`. We recommend updating your code to use datasets with `expected` instead of `output` as soon as possible.
* Improve the UX for saving and updating prompts from the playground.
* New hide/show column controls on all tables.
* New [model comparison](/docs/cookbook/recipes/ModelComparison) cookbook recipe.
* Add support for model / metadata comparison on the experiments view.
* New experiment picker dropdown.
* Markdown support in the LLM message viewer.
## Week of 2024-05-13
* Support copying to clipboard from `input`, `output`, etc. views
* Improve the empty-state experience for datasets.
* New multi-dimensional charts on the experiment page for comparing models and model parameters.
* Support `HTTPS_PROXY`, `HTTP_PROXY`, and `NO_PROXY` environment variables in the API containers.
* Support infinite scroll in the logs viewer and remove dataset size limitations.
## Week of 2024-05-06
* Denser trace view with span durations built in.
* Rework pagination and fix scrolling across multiple pages in the logs viewer.
* Make BTQL the default search method.
* Add support for Bedrock models in the playground and the proxy.
* Add "copy code" buttons throughout the docs.
* Automatically overflow large objects (e.g. experiments) to S3 for faster loading and better performance.
## Week of 2024-04-29
* Show images in LLM view, adding the ability to display images in the LLM view in the trace viewer.

* Send an invite email when you invite a new user to your organization.
* Support selecting/deselecting scores in the experiment view.
* Roll out [Braintrust Query Language](/docs/reference/btql) (BTQL) for querying logs and traces.
## Week of 2024-04-22
* Smart relative time labels for dates (`1h ago`, `3d ago`, etc.)
* Added double quoted string literals support, e.g., `tags contains "foo"`.
* Jump to top button in trace details for easier navigation.
* Fix a race condition in distributed tracing, in which subspans could hit the
backend before their parent span, resulting in an inaccurate trace structure.
As part of this change, we removed the `parent_id` argument from the latest SDK,
which was previously deprecated in favor of `parent`. `parent_id` is only able
to use the race-condition-prone form of distributed tracing, so we felt it would
be best for folks to upgrade any of their usages from `parent_id` to `parent`.
Before upgrading your SDK, if you are currently using `parent_id`, you can port
over to using `parent` by changing any exported IDs from `span.id` to
`span.export()` and then changing any instances of `parent_id=[span_id]` to
`parent=[exported_span]`.
For example, if you had distributed tracing code like the following:
```javascript #skip-compile
import { initLogger } from "braintrust";
const logger = initLogger({
projectName: "My Project",
apiKey: process.env.BRAINTRUST_API_KEY,
});
export async function POST(req: Request) {
return logger.traced(async (span) => {
const { body } = req;
const result = await someLLMFunction(body);
span.log({ input: body, output: result });
return {
result,
requestId: span.id,
};
});
}
export async function POSTFeedback(req: Request) {
logger.traced(
async (span) => {
logger.logFeedback({
id: span.id, // Use the newly created span's id, instead of the original request's id
comment: req.body.comment,
scores: {
correctness: req.body.score,
},
metadata: {
user_id: req.user.id,
},
});
},
{
parentId: req.body.requestId,
name: "feedback",
},
);
}
```
```python
from braintrust import init_logger
logger = init_logger(project="My Project")
def my_route_handler(req):
with logger.start_span() as span:
body = req.body
result = some_llm_function(body)
span.log(input=body, output=result)
return {
"result": result,
"request_id": span.id,
}
def my_feedback_handler(req):
with logger.start_span("feedback", parent_id=req.body.request_id) as span:
logger.log_feedback(
id=span.id, # Use the newly created span's id, instead of the original request's id
scores={
"correctness": req.body.score,
},
comment=req.body.comment,
metadata={
"user_id": req.user.id,
},
)
```
It would now look like this:
```javascript #skip-compile
import { initLogger } from "braintrust";
const logger = initLogger({
projectName: "My Project",
apiKey: process.env.BRAINTRUST_API_KEY,
});
export async function POST(req: Request) {
return logger.traced(async (span) => {
const { body } = req;
const result = await someLLMFunction(body);
span.log({ input: body, output: result });
return {
result,
requestId: span.export(),
};
});
}
export async function POSTFeedback(req: Request) {
logger.traced(
async (span) => {
logger.logFeedback({
id: span.id, // Use the newly created span's id, instead of the original request's id
comment: req.body.comment,
scores: {
correctness: req.body.score,
},
metadata: {
user_id: req.user.id,
},
});
},
{
parent_id: req.body.requestId,
name: "feedback",
},
);
}
```
```python
from braintrust import init_logger
logger = init_logger(project="My Project")
def my_route_handler(req):
with logger.start_span() as span:
body = req.body
result = some_llm_function(body)
span.log(input=body, output=result)
return {
"result": result,
"request_id": span.export(),
}
def my_feedback_handler(req):
with logger.start_span("feedback", parent=req.body.request_id) as span:
logger.log_feedback(
id=span.id, # Use the newly created span's id, instead of the original request's id
scores={
"correctness": req.body.score,
},
comment=req.body.comment,
metadata={
"user_id": req.user.id,
},
)
```
## Week of 2024-04-15
* Incremental support for roles-based access control (RBAC) logic within the API
server backend.
As part of this change, we removed certain API endpoints which are no longer in
use. In particular, the `/crud/{object_type}` endpoint. For the handful of
usages of these endpoints in old versions of the SDK libraries, we added
backwards-compatibility routes, but it is possible we may have missed a few.
Please let us know if your code is trying to use an endpoint that no longer
exists and we can remediate.
* Changed the semantics of experiment initialization with `update=True`.
Previously, we would require the experiment to already exist, now we will
create the experiment if it doesn't already exist otherwise return the
existing one.
This change affects the semantics of the `PUT /v1/experiment` operation, so that
it will not replace the contents of an existing experiment with a new one, but
instead just return the existing one, meaning it behaves the same as `POST
/v1/experiment`. Eventually we plan to revise the update semantics for other
object types as well. Therefore, we have deprecated the `PUT` endpoint across
the board and plan to remove it in a future revision of the API.
## Week of 2024-04-08
* Added support for new multimodal models (`gpt-4-turbo`, `gpt-4-vision-preview`, `gpt-4-1106-vision-preview`,
`gpt-4-turbo-2024-04-09`, `claude-3-opus-20240229`, `claude-3-sonnet-20240229`, `claude-3-haiku-20240307`).
* Introduced [REST API for RBAC](/docs/api/spec#roles) (Role-Based Access Control) objects including CRUD operations on roles, groups, and permissions, and added a read-only API for users.
* Improved AI search and added positive/negative tag filtering in AI search. To positively filter, prefix the tag with `+`, and to negatively filter, prefix the tag with `-`.
We are making some systematic changes to the search experience, and the search
syntax is subject to change.
## Week of 2024-04-01
* Added functionality for distributed tracing. See the
[docs](/docs/guides/tracing#distributed-tracing) for more details.
As part of this change, we had to rework the core logging implementation in the
SDKs to rely on some newer backend API features. Therefore, if you are hosting
Braintrust on-prem, before upgrading your SDK to any version `>= 0.0.115`, make
sure your API version is `>= 0.0.35`. You can query the version of the on-prem
server with `curl [api-url]/version`, where the API URL can be found on the settings page.
## Week of 2024-03-25
* Introduce multimodal support for OpenAI and Anthropic models in the prompt playground and proxy. You can now pass image URLs, base64-encoded image strings, or mustache template variables to models that support multimodal inputs.

* The REST API now gzips responses.
* You can now return dynamic arrays of scores in `Eval()` functions ([docs](/docs/guides/evals#dynamic-scoring)).
* Launched [Reporters](/docs/guides/evals#custom-reporters), a way to summarize and report eval results in a custom format.
* New coat of paint in the trace view.
* Added support for Clickhouse as an additional storage backend, offering a more scalable solution for handling large datasets and performance improvements for certain query types. You can enable it by
setting the `UseManagedClickhouse` parameter to `true` in the CloudFormation template or installing the docker container.
* Implemented realtime checks using a WebSocket connection and updated proxy configurations to include CORS support.
* Introduced an API version checker tool so you know when your API version is outdated.
## Week of 2024-03-18
* Add new database parameters for external databases in the CloudFormation template.
* Faster optimistic updates for large writes in the UI.
* "Open in playground" now opens a lighter weight modal instead of the full playground.
* Can create a new prompt playground from the prompt viewer.
## Week of 2024-03-11
* Shipped support for [prompt management](/docs/guides/prompts).
* Moved playground sessions to be within projects. All existing sessions are now in the "Playground Sessions" project.
* Allowed customizing proxy and real-time URLs through the web application, adding flexibility for different deployment scenarios.
* Improved documentation for Docker deployments.
* Improved folding behavior in data editors.
## Week of 2024-03-04
* Support custom models and endpoint configuration for all providers.
* New add team modal with support for multiple users.
* New information architecture to enable faster project navigation.
* Experiment metadata now visible in the experiments table.
* Improve UI write performance with batching.
* Log filters now apply to *any* span.
* Share button for traces
* Images now supported in the tree view (see [tracing docs](/docs/guides/tracing#multimodal-content) for more).
## Week of 2024-02-26
* Show auto scores before manual scores (matching trace) in the table
* New logo is live!
* Any span can now submit scores, which automatically average in the trace. This makes it easier to label
scores in the spans where they originate.
* Improve sidebar scrolling behavior.
* Add AI search for datasets and logs.
* Add tags to the SDK.
* Support viewing and updating metadata on the experiment page.
## Week of 2024-02-19
We rolled out a breaking change to the REST API that renames the
`output` field to `expected` on dataset records. This change brings
the API in line with [last week's update](#week-of-2024-02-12) to
the Braintrust SDK. For more information, refer to the REST API docs
for dataset records ([insert](/docs/api/spec#insert-dataset-events)
and [fetch](/docs/api/spec#fetch-dataset-get-form)).
* Add support for [tags](/docs/guides/logging#tags-and-queues).
* Score fields are now sorted alphabetically.
* Add support for Groq ModuleResolutionKind.
* Improve tree viewer and XML parser.
* New experiment page redesign
## Week of 2024-02-12
We are rolling out a change to dataset records that renames the `output`
field to `expected`. If you are using the SDK, datasets will still fetch
records using the old format for now, but we recommend future-proofing
your code by setting the `useOutput` / `use_output` flag to false when
calling `initDataset()` / `init_dataset()`, which will become the default
in a future version of Braintrust.
When you set `useOutput` to false, your dataset records will contain
`expected` instead of `output`. This makes it easy to use them with
`Eval(...)` to provide expected outputs for scoring, since you'll
no longer have to manually rename `output` to `expected` when passing
data to the evaluator:
```typescript
import { Eval, initDataset } from "braintrust";
import { Levenshtein } from "autoevals";
Eval("My Eval", {
data: initDataset("Existing Dataset", { useOutput: false }), // Records will contain `expected` instead of `output`
task: (input) => "foo",
scores: [Levenshtein],
});
```
```python
from braintrust import Eval, init_dataset
from autoevals import Levenshtein
Eval(
"My Eval",
data=init_dataset("Existing Dataset", use_output=False), # Records will contain `expected` instead of `output`
task=lambda input: "foo",
scores=[Levenshtein],
)
```
Here's an example of how to insert and fetch dataset records using the new format:
```typescript #skip-compile
import { initDataset } from "braintrust";
// Currently `useOutput` defaults to true, but this will change in a future version of Braintrust.
const dataset = initDataset("My Dataset", { useOutput: false });
dataset.insert({
input: "foo",
expected: { result: 42, error: null }, // Instead of `output`
metadata: { model: "gpt-3.5-turbo" },
});
await dataset.flush();
for await (const record of dataset) {
console.log(record.expected); // Instead of `record.output`
}
```
```python
from braintrust import init_dataset
# Currently `use_output` defaults to True, but this will change in a future version of Braintrust.
dataset = init_dataset("My Dataset", use_output=False)
dataset.insert(
input="foo",
expected=dict(result=42, error=None), # Instead of `output`
metadata=dict(model="gpt-3.5-turbo"),
)
dataset.flush()
for record in dataset:
print(record["expected"]) # Instead of `record["output"]`
```
* Support duplicate `Eval` names.
* Fallback to `BRAINTRUST_API_KEY` if `OPENAI_API_KEY` is not set.
* Throw an error if you use `experiment.log` and `experiment.start_span` together.
* Add keyboard shortcuts (j/k/p/n) for navigation.
* Increased tooltip size and delay for better usability.
* Support more viewing modes: HTML, Markdown, and Text.
## Week of 2024-02-05

* Tons of improvements to the prompt playground:
* A new "compact" view, that shows just one line per row, so you can quickly scan across rows. You can toggle between the two modes.
* Loading indicators per cell
* The run button transforms into a "Stop" button while you are streaming data
* Prompt variables are now syntax highlighted in purple and use a monospace font
* Tab now autocompletes
* We no longer auto-create variables as you're typing (was causing more trouble than helping)
* Slider params like `max_tokens` are now optional
* Cloudformation now supports more granular RDS configuration (instance type, storage, etc)
* **Support optional slider params**
* Made certain parameters like `max_tokens` optional.
* Accompanies pull request [https://github.com/braintrustdata/braintrust-proxy/pull/23](https://github.com/braintrustdata/braintrust-proxy/pull/23).
* Lots of style improvements for tables.
* Fixed filter bar styles.
* Rendered JSON cell values using monospace type.
* Adjusted margins for horizontally scrollable tables.
* Implemented a smaller size for avatars in tables.
* Deleting a prompt takes you back to the prompts tab
## Week of 2024-01-29
* New [REST API](/docs/api/spec).
* [Cookbook](/docs/cookbook) of common use cases and examples.
* Support for [custom models](/docs/guides/playground#custom-models) in the playground.
* Search now works across spans, not just top-level traces.
* Show creator avatars in the prompt playground
* Improved UI breadcrumbs and sticky table headers
## Week of 2024-01-22
* UI improvements to the playground.
* Added an example of [closed QA / extra fields](/docs/guides/evals#additional-fields).
* New YAML parser and new syntax highlighting colors for data editor.
* Added support for enabling/disabling certain git fields from collection (in org settings and the SDK).
* Added new GPT-3.5 and 4 models to the playground.
* Fixed scrolling jitter issue in the playground.
* Made table fields in the prompt playground sticky.
## Week of 2024-01-15
* Added ability to download dataset as CSV
* Added YAML support for logging and visualizing traces
* Added JSON mode in the playground
* Added span icons and improved readability
* Enabled shift modifier for selecting multiple rows in Tables
* Improved tables to allow editing expected fields and moved datasets to trace view
## Week of 2024-01-08
* Released new [Docker deployment method for self hosting](https://www.braintrustdata.com/docs/self-hosting/docker)
* Added ability to manually score results in the experiment UI
* Added comments and audit log in the experiment UI
## Week of 2024-01-01
* Added ability to upload dataset CSV files in prompt playgrounds
* Published new [guide for tracing and logging your code](https://www.braintrustdata.com/docs/guides/tracing)
* Added support to download experiment results as CSVs
## Week of 2023-12-25
* API keys are now scoped to organizations, so if you are part of multiple orgs, new API keys will only permit
access to the org they belong to.
* You can now search for experiments by any metadata, including their name, author, or even git metadata.
* Filters are now saved in URL state so you can share a link to a filtered view of your experiments or logs.
* Improve performance of project page by optimizing API calls.
We made several cleanups and improvements to the low-level typescript and python
SDKs (0.0.86). If you use the Eval framework, nothing should change for you, but
keep in mind the following differences if you use the manual logging
functionality:
* Simplified the low-level tracing API (updated docs coming soon!)
* The current experiment and current logger are now maintained globally
rather than as async-task-local variables. This makes it much simpler to
start tracing with minimal code modification. Note that creating
experiments/loggers with `withExperiment`/`withLogger` will now set the
current experiment globally (visible across all async tasks) rather than
local to a specific task. You may pass `setCurrent: false/set_current=False`
to avoid setting the global current experiment/logger.
* In python, the `@traced` decorator now logs the function input/output by
default. This might interfere with code that already logs input/output
inside the `traced` function. You may pass `notrace_io=True` as an argument
to `@traced` to turn this logging off.
* In typescript, the `traced` method can start spans under the global
logger, and is thus async by default. You may pass `asyncFlush: true` to
these functions to make the traced function synchronous. Note that if the
function tries to trace under the global logger, it must also have
`asyncFlush: true`.
* Removed the `withCurrent`/`with_current` functions
* In typescript, the `Span.traced` method now accepts `name` as an optional
argument instead of a required positional param. This matches the behavior
of all other instances of `traced`. `name` is also now optional in python,
but this doesn't change the function signature.
* `Experiments` and `Datasets` are now lazily-initialized, similar to `Loggers`.
This means all write operations are immediate and synchronous. But any metadata
accessor methods (`[Experiment|Logger].[id|name|project]`) are now async.
* Undo auto-inference of `force_login` if `login` is invoked with different
params than last time. Now `login` will only re-login if `forceLogin: true/force_login=True` is provided.
## Week of 2023-12-18
* Dropped the official 2023 Year-in-Review dashboard. Check out yours [here](/app/year-in-review)!

* Improved ergonomics for the Python SDK:
* The `@traced` decorator will automatically log inputs/outputs
* You no longer need to use context managers to scope experiments or loggers.
* Enable skew protection in frontend deploys, so hopefully no more hard refreshes.
* Added syntax highlighting in the sidepanel to improve readability.
* Add `jsonl` mode to the eval CLI to log experiment summaries in an easy-to-parse format.
## Week of 2023-12-11
* Released new [trials](https://www.braintrustdata.com/docs/guides/evals#trials) feature to rerun each input multiple times and collect aggregate results for a more robust score.
* Added ability to run evals in the prompt playground. Use your existing dataset and the autoevals functions to score playground outputs.
* Released new version of SDK (0.0.81) including a small breaking change. When setting the experiment name in the `Eval` function, the `exprimentName` key pair should be moved to a top level argument.
before:
```
Eval([eval_name], {
...,
metadata: {
experimentName: [experimentName]
}
})
```
after:
```
Eval([eval_name], {
...,
experimentName: [experimentName]
})
```
* Added support for Gemini and Mistral Platform in AI proxy and playground
## Week of 2023-12-4
* Enabled the prompt playground and datasets for free users
* Added Together.ai models including Mixtral to AI Proxy
* Turned prompts tab on organization view into a list
* Removed data row limit for the prompt playground
* Enabled configuration for dark mode and light mode in settings
* Added automatic logging of a diff if an experiment is run on a repo with uncommitted changes
## Week of 2023-11-27
* Added experiment search on project view to filter by experiment name

* Upgraded AI Proxy to support [tracking Prometheus metrics](https://github.com/braintrustdata/braintrust-proxy/blob/a31a82e6d46ff442a3c478773e6eec21f3d0ba69/apis/cloudflare/wrangler-template.toml#L19C1-L19C1)
* Modified Autoevals library to use the [AI proxy](/docs/guides/proxy)
* Upgraded Python braintrust library to parallelize evals
* Optimized experiment diff view for performance improvements
## Week of 2023-11-20
* Added support for new Perplexity models (ex: pplx-7b-online) to playground
* Released [AI proxy](/docs/guides/proxy): access many LLMs using one API w/ caching
* Added [load balancing endpoints](/docs/guides/proxy#load-balancing) to AI proxy
* Updated org-level view to show projects and prompt playground sessions
* Added ability to batch delete experiments
* Added support for Claude 2.1 in playground
## Week of 2023-11-13
* Made experiment column resized widths persistent
* Fixed our libraries including Autoevals to work with OpenAI’s new libraries

* Added support for function calling and tools in our prompt playground
* Added tabs on a project page for datasets, experiments, etc.
## Week of 2023-11-06
* Improved selectors for diffing and comparison modes on experiment view
* Added support for new OpenAI models (GPT4 preview, 3.5turbo-1106) in playground
* Added support for OS models (Mistral, Codellama, Llama2, etc.) in playground using Perplexity's APIs
## Week of 2023-10-30
* Improved experiment sidebar to be fully responsive and resizable
* Improved tooltips within the web UI
* Multiple performance optimizations and bug fixes
## Week of 2023-10-23
* [Improved prompt playground variable handling and visualization](/docs/release-notes/ReleaseNotes-2023-10-PromptPlaygroundVar.mp4)
* Added time duration statistics per row to experiment summaries

* Multiple performance optimizations and bug fixes
## Week of 2023-10-16
* [Launched new tracing feature: log and visualize complex LLM chains and executions.](/docs/guides/evals#tracing)
* Added a new “text-block” prompt type in the playground that just returns a string or variable back without a LLM call (useful for chaining prompts and debugging)
* Increased default # of rows per page from 10 to 100 for experiments
* UI fixes and improvements for the side panel and tooltips
* The experiment dashboard can be customized to show the most relevant charts
## Week of 2023-10-09
* Performance improvements related to user sessions
## Week of 2023-10-02
* All experiment loading HTTP requests are 100-200ms faster
* The prompt playground now supports autocomplete
* Dataset versions are now displayed on the datasets page

* Projects in the summary page are now sorted alphabetically
* Long text fields in logged data can be expanded into scrollable blocks
* [We evaluated the Alpaca evals leaderboard in Braintrust](https://www.braintrustdata.com/app/braintrustdata.com/p/Alpaca-Evals)
* [New tutorial for finetuning GPT3.5 and evaluating with Braintrust](https://colab.research.google.com/drive/10KIXBHjZ0VUc-zN79_cuVeKy9ZiUQy4M?usp=sharing)
## Week of 2023-09-18
* The Eval framework is now supported in Python! See the updated [evals guide](/docs/guides/evals) for more information:
```python
from braintrust import Eval
from autoevals import LevenshteinScorer
Eval(
"Say Hi Bot",
data=lambda: [
{
"input": "Foo",
"expected": "Hi Foo",
},
{
"input": "Bar",
"expected": "Hello Bar",
},
], # Replace with your eval dataset
task=lambda input: "Hi " + input, # Replace with your LLM call
scores=[LevenshteinScorer],
)
```
* Onboarding and signup flow for new users
* Switch product font to Inter
## Week of 2023-09-11
* Big performance improvements for registering experiments (down from \~5s to \<1s). Update the SDK to take advantage of these improvements.
* New graph shows aggregate accuracy between experiments for each score.

* Throw errors in the prompt playground if you reference an invalid variable.
* A significant backend database change which significantly improves performance while reducing costs. Please contact us if you have not already heard from us about upgrading your deployment.
* No more record size constraints (previously, strings could be at most 64kb long).
* New autoevals for numeric diff and JSON diff
## Week of 2023-09-05
* You can duplicate prompt sessions, prompts, and dataset rows in the prompt playground.
* You can download prompt sessions as JSON files (including the prompt templates, prompts, and completions).
* You can adjust model parameters (e.g. temperature) in the prompt playground.
* You can publicly share experiments (e.g. [Alpaca Evals](https://www.braintrustdata.com/app/braintrustdata.com/p/Alpaca-Evals/GPT4-w-metadata-claudegraded?c=llama2-70b-w-metadata-claudegraded)).
* Datasets now support editing, deleting, adding, and copying rows in the UI.
* There is no longer a 64KB limit on strings.
## Week of 2023-08-28
* The prompt playground is now live! We're excited to get your feedback as we continue to build
this feature out. See [the docs](/docs/guides/playground) for more information.

## Week of 2023-08-21
* A new chart shows experiment progress per score over time.

* The [eval CLI](/docs/guides/evals) now supports `--watch`, which will automatically re-run your evaluation when you make
changes to your code.
* You can now edit datasets in the UI.

## Week of 2023-08-14
* Introducing datasets! You can now upload datasets to Braintrust and use them in your experiments. Datasets are
versioned, and you can use them in multiple experiments. You can also use datasets to compare your model's
performance against a baseline. Learn more about [how to create and use datasets in the docs](/docs/guides/datasets).
* Fix several performance issues in the SDK and UI.
## Week of 2023-08-07
* Complex data is now substantially more performant in the UI. Prior to this change, we ran schema
inference over the entire `input`, `output`, `expected`, and `metadata` fields, which could result
in complex structures that were slow and difficult to work with. Now, we simply treat these fields
as `JSON` types.
* The UI updates in real-time as new records are logged to experiments.
* Ergonomic improvements to the SDK and CLI:
* The JS library is now Isomorphic and supports both Node.js and the browser.
* The Evals CLI warns you when no files match the `.eval.[ts|js]` pattern.
## Week of 2023-07-31
* You can now break down scores by metadata fields:

* Improve performance for experiment loading (especially complex experiments). Prior to this change,
you may have seen experiments take 30s+ occasionally or even fail. To enable this, you'll need to
update your CloudFormation.
* Support for renaming and deleting experiments:

* When you expand a cell in detail view, the row is now highlighted:

## Week of 2023-07-24
* A new [framework](/docs/guides/evals) for expressing evaluations in a much simpler way:
```js #skip-compile
import { Eval } from "braintrust";
import { Factuality } from "autoevals";
Eval("My Evaluation", {
data: () => [
{
input: "Which country has the highest population?",
expected: "China",
meta: { type: "question" },
},
],
task: (input) => callModel(input),
scores: [Factuality],
});
```
Besides being much easier than the logging SDK, this framework sets the foundation for evaluations
that can be run automatically as your code changes, built and run in the cloud, and more. We are
very excited about the use cases it will open up!
* `inputs` is now `input` in the SDK (>= 0.0.23) and UI. You do not need to make any code changes, although you should gradually start
using the `input` field instead of `inputs` in your SDK calls, as `inputs` is now deprecated and will eventually be removed.
* Improved diffing behavior for nested arrays.
## Week of 2023-07-17
* A couple of SDK updates (>= v0.0.21) that allow you to update an existing experiment `init(..., update=True)` and specify an
id in `log(..., id='my-custom-id')`. These tools are useful for running an experiment across multiple processes,
tasks, or machines, and idempotently logging the same record (identified by its `id`).
* Note: If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at
[https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml)).
* Tables with lots and lots of columns are now visually more compact in the UI:
*Before:*

*After:*

## Week of 2023-07-10
* A new [Node.js SDK](/docs/libs/nodejs) ([npm](https://www.npmjs.com/package/braintrust)) which mirrors the [Python SDK](/docs/reference/libs/python). As this SDK is new, please let us know
if you run into any issues or have any feedback.
If you have Braintrust installed in your own cloud environment, make sure to update the CloudFormation (available at
[https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml))
to include some functionality the Node.js SDK relies on.
You can do this in the AWS console, or by running the following command (with the `braintrust` command included in the Python SDK).
```bash
braintrust install api --update-template
```
* You can now swap the primary and comparison experiment with a single click.

* You can now compare `output` vs. `expected` within an experiment.

* Version 0.0.19 is out for the SDK. It is an important update that throws an error if your payload is larger than 64KB in size.
## Week of 2023-07-03
* Support for real-time updates, using Redis. Prior to this, Braintrust would wait for your data warehouse to sync up with
Kafka before you could view an experiment, often leading to a minute or two of time before a page loads. Now, we cache experiment
records as your experiment is running, making experiments load instantly. To enable this, you'll need to update your CloudFormation.
* New settings page that consolidates team, installation, and API key settings. You can now invite team members
to your Braintrust account from the "Team" page.

* The experiment page now shows commit information for experiments run inside of a git repository.

## Week of 2023-06-26
* Experiments track their git metadata and automatically find a "base" experiment to compare against, using
your repository's base branch.
* The Python SDK's [`summarize()`](/docs/libs/python#summarize) method now returns an [`ExperimentSummary`](/docs/libs/python#experimentsummary-objects) object with score
differences against the base experiment (v0.0.10).
* Organizations can now be "multi-tenant", i.e. you do not need to install in your cloud account. If you start with
a multi-tenant account to try out Braintrust, and decide to move it into your own account, Braintrust can migrate it for you.
## Week of 2023-06-19
* New scatter plot and histogram insights to quickly analyze scores and filter down examples.

* API keys that can be set in the SDK (explicitly or through an environment variable) and do not require user login.
Visit the settings page to create an API key.
* Update the braintrust Python SDK to [version 0.0.6](https://pypi.org/project/braintrust/0.0.6/) and the CloudFormation template ([https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml](https://braintrust-cf.s3.amazonaws.com/braintrust-latest.yaml)) to use the new API key feature.
## Week of 2023-06-12
* New `braintrust install` CLI for installing the CloudFormation
* Improved performance for event logging in the SDK
* Auto-merge experiment fields with different types (e.g. `number` and `string`)
## Week of 2023-06-05
* [Tutorial guide + notebook](/docs/start)
* Automatically refresh cognito tokens in the Python client
* New filter and sort operators on the experiments table:
* Filter experiments by changes to scores (e.g. only examples with a lower score than another experiment)
* Custom SQL filters
* Filter and sort bubbles to visualize/clear current operations
* \[Alpha] SQL query explorer to run arbitrary queries against one or more experiments
---
file: ./content/docs/cookbook/index.mdx
meta: {
"title": "Cookbook"
}
# Cookbook
This cookbook, inspired by [OpenAI's cookbook](https://cookbook.openai.com/), is a collection of recipes for common
use cases of [Braintrust](/). Each recipe is an open source self-contained example, hosted on
[GitHub](https://github.com/braintrustdata/braintrust-cookbook). We welcome community contributions
and aspire for the cookbook to be a collaborative, living, breathing collection of best practices for
building high quality AI products.
{recipes
.sort((a, b) => new Date(b.date) - new Date(a.date))
.map((recipe, idx) => {
const slug = encodeURIComponent(recipe.urlPath);
return (
);
})}
---
file: ./content/docs/pricing/faq.mdx
meta: {
"title": "FAQ",
"order": 2
}
# FAQ
### Which plan is right for me?
* **Free**: Ideal for individuals or small teams getting started with Braintrust. It includes enough data ingestion, scoring, and data retention to explore and build small projects.
* **Pro**: Best suited for small teams of up to 5 people who are regularly running experiments or evaluations that require increased usage limits and longer data retention. Additional usage beyond included limits is billed flexibly, making it great for teams with growing or varying workloads.
* **Enterprise**: Recommended for larger organizations or teams with custom needs such as high volumes of data, special security requirements, on-premises deployment, or dedicated support.
If you're unsure which option fits your needs or would like to discuss custom requirements, please [contact our team](/contact) for personalized guidance.
### What does processed data mean?
Processed data refers to the data ingested by Braintrust when you create [logs](/docs/guides/logs) or [experiments](/docs/guides/evals). This includes inputs, outputs, prompts, metadata, datasets, traces, and any related information. The cumulative size of this data (measured on disk) counts toward your monthly total, calculated from the first day to the last day of each calendar month.
### What are scores?
[Scores](/docs/guides/functions/scorers) are used to measure the results of offline or online evaluations run in Braintrust. Each time you record a score, including [custom metrics](/docs/guides/functions/scorers#custom-scorers), the total number of scores counted towards your monthly usage increases by one. Your monthly total is calculated cumulatively from the first to the last day of each calendar month.
### What are trace spans?
Spans are the fundamental units of observability in your traces. Each span represents a discrete operation in your application - like an LLM API call, prompt rendering, or evaluation step. Spans are automatically created when you use Braintrust's instrumentation and contribute to your monthly usage quota, which is calculated per calendar month.
### How do I track my usage?
If you are on the Pro plan, you can track your usage by selecting **View usage details** in **Settings** > **Billing**. This will open your detailed usage report in the Orb usage portal, where you can view your current usage and monitor costs throughout the billing period.
### How does billing work?
The Free plan does not require a credit card to get started. You can upgrade to the Pro plan at any time via the **Upgrade** button in the top-right of your workspace.
When you sign up for the Pro plan, you'll immediately be charged a prorated amount of the monthly $249 platform fee. For example, if you sign up on the 15th of the month, you'll pay about half of the monthly fee. On the 1st of the following month, you'll be charged the full $249 fee plus any additional usage-based charges incurred during the previous month. Charges will be processed automatically using the credit card provided at sign-up.
---
file: ./content/docs/start/eval-sdk.mdx
meta: {
"title": "Eval via SDK"
}
# Evaluate via SDK
When you arrive in a new organization, you will see these steps. They tell you how to run your first experiment:
### Install Braintrust libraries
First, install the Braintrust SDK (TypeScript, Python and API wrappers in [other languages](/docs/reference/api#api-wrappers)).
```bash
npm install braintrust autoevals
```
or
```bash
yarn add braintrust autoevals
```
Node version >= 18 is required
```bash
pip install braintrust autoevals
```
### Create a simple evaluation script
The eval framework allows you to declaratively define evaluations in your code. Inspired by tools like Jest, you can define a set of evaluations in files named \_.eval.ts or \_.eval.js (Node.js) or eval\_\*.py (Python).
Create a file named `tutorial.eval.ts` or `eval_tutorial.py` with the following code.
```typescript
import { Eval } from "braintrust";
import { Levenshtein } from "autoevals";
Eval(
"Say Hi Bot", // Replace with your project name
{
data: () => {
return [
{
input: "Foo",
expected: "Hi Foo",
},
{
input: "Bar",
expected: "Hello Bar",
},
]; // Replace with your eval dataset
},
task: async (input) => {
return "Hi " + input; // Replace with your LLM call
},
scores: [Levenshtein],
},
);
```
```python
from braintrust import Eval
from autoevals import Levenshtein
Eval(
"Say Hi Bot", # Replace with your project name
data=lambda: [
{
"input": "Foo",
"expected": "Hi Foo",
},
{
"input": "Bar",
"expected": "Hello Bar",
},
], # Replace with your eval dataset
task=lambda input: "Hi " + input, # Replace with your LLM call
scores=[Levenshtein],
)
```
This script sets up the basic scaffolding of an evaluation:
* `data` is an array or iterator of data you'll evaluate
* `task` is a function that takes in an input and returns an output
* `scores` is an array of scoring functions that will be used to score the tasks's output
In addition to adding each data point inline when you call the `Eval()` function, you can also [pass an existing or new dataset directly](/docs/guides/datasets#using-a-dataset-in-an-evaluation).
(You can also write your own code. Make sure to follow the naming conventions for your language. TypeScript
files should be named `*.eval.ts` and Python files should be named `eval_*.py`.)
### Create an API key
Next, create an API key to authenticate your evaluation script. You can create an API key in the [settings page](/app/settings?subroute=api-keys).
Run this command to add your API key to your environment:
```bash
export BRAINTRUST_API_KEY="YOUR_API_KEY"
```
### Run your evaluation script
Run your evaluation script with the following command:
```bash
npx braintrust eval tutorial.eval.ts
```
```bash
braintrust eval eval_tutorial.py
```
This will create an experiment in Braintrust. Once the command runs, you'll see a link to your experiment.
### View your results
Congrats, you just ran an eval! You should see a dashboard like this when you load your experiment.
This view is called the *experiment view*, and as you use Braintrust, we hope it becomes your trusty companion
each time you change your code and want to run an eval.
The experiment view allows you to look at high level metrics for performance, dig
into individual examples, and compare your LLM app's performance over time.

### Run another experiment
After running your first evaluation, you’ll see that we achieved a 77.8% score. Can you adjust the evaluation to improve this score? Make your changes and re-run the evaluation to track your progress.

## Next steps
* Dig into our [evals guide](/docs/guides/evals) to learn more about how to run evals.
* Look at our [cookbook](/docs/cookbook) to learn how to evaluate RAG, summarization, text-to-sql, and other popular use cases.
* Learn how to [log traces](/docs/guides/logging) to Braintrust.
* Read about Braintrust's [platform and architecture](/docs/platform/architecture).
---
file: ./content/docs/start/eval-ui.mdx
meta: {
"title": "Eval via UI"
}
# Evaluate via UI
The following steps require access to a Braintrust organization, which represents a company or a team. [Sign up](https://www.braintrust.dev/signup) to create an organization for free.
### Configure your API keys
Navigate to the [AI providers](/app/settings?subroute=secrets) page in your settings and configure at least one API key. For this quickstart, be sure to add your OpenAI API key. After completing this initial setup, you can access models from many providers through a single, unified API.
For more advanced use cases where you want to use custom models or avoid plugging your API key into Braintrust, you may want to check out the [SDK](/docs/start/eval-sdk) quickstart.
### Create a new project
For every AI feature your organization is building, the first thing you'll do is create a project.
### Create a new prompt
Navigate to **Library** in the top menu bar, then select **Prompts**. Create a new prompt in your project called "movie matcher". A prompt is the input you provide to the model to generate a response. Choose `GPT 4o` for your model, and type this for your system prompt:
```
Based on the following description, identify the movie title. In your response, simply provide the name of the movie.
```
Select the **+ Message** button below the system prompt, and enter a user message:
```
{{input}}
```
Prompts can use [mustache](https://mustache.github.io/mustache.5.html) templating syntax to refer to variables. In this case, the input corresponds to the movie description given by the user.

Select **Save as custom prompt** to save your prompt.
### Explore the prompt playground
Scroll to the bottom of the prompt viewer, and select **Create playground with prompt**. This will open the prompt you just created in the [prompt playground](https://www.braintrust.dev/docs/guides/playground), a tool for exploring, comparing, and evaluating prompts. In the prompt playground, you can evaluate prompts with data from your [datasets](https://www.braintrust.dev/docs/guides/datasets).

### Importing a dataset
Open this [sample dataset](https://gist.githubusercontent.com/ornellaaltunyan/28972d2566ddf64bc171922d0f0564e2/raw/838d220eea620a2390427fe1ec35d347f2b798bd/gistfile1.csv), and right-click to select **Save as...** and download it. It is a `.csv` file with two columns, **Movie Title** and **Original Description**. Inside your playground, select **Dataset**, then **Upload dataset**, and upload the CSV file. Using drag and drop, assign the CSV columns to dataset fields. The input column corresponds to Original Description, and the expected column should be Movie Title. Then, select **Import**.

### Choosing a scorer
A scoring function allows you to compare the expected output of a task to the actual output and produce a score between 0 and 1. Inside your playground, select **Scorers** to choose from several types of scoring functions. There are two main types of scoring functions: heuristics are great for well-defined criteria, while LLM-as-a-judge is better for handling more complex, subjective evaluations. You can also create a custom scorer. For this example, since there is a clear correct answer, we can choose **ExactMatch**.
### Running your first evaluation
From within the playground, select **+ Experiment** to set up your first evaluation. To run an eval, you need three things:
* **Data**: a set of examples to test your application on
* **Task**: the AI function you want to test (any function that takes in an input and returns an output)
* **Scores**: a set of scoring functions that take an input, output, and optional expected value and compute a score
In this example, the Data is the dataset you uploaded, the Task is the prompt you created, and Scores is the scoring function we selected.

Creating an experiment from the playground will automatically log your results to Braintrust.
### Interpreting your results
Navigate to the **Experiments** page to view your evaluation. Examine the exact match scores and other feedback generated by your evals. If you notice that some of your outputs did not match what was expected, you can tweak your prompt directly in the UI until it consistently produces high-quality outputs. If changing the prompt doesn't yield the desired results, consider experimenting with different models.

As you iterate on your prompt, you can run more experiments and compare results.
## Next steps
* Now that you've run your first evaluation, learn how to [write your own eval script](/docs/start/eval-sdk).
* Check out more examples and sample projects in the [Braintrust Cookbook](/docs/cookbook).
* Explore the [guides](/docs/guides) to read more about evals, logging, and datasets.
---
file: ./content/docs/start/index.mdx
meta: {
"title": "Get started"
}
# Get started with Braintrust
Braintrust is an end-to-end platform for building AI applications. It makes software development with large language models (LLMs) robust and iterative.
### Iterative experimentation
Rapidly prototype with different prompts and models in the [playground](/docs/guides/playground)
### Performance insights
Built-in tools to [evaluate](/docs/guides/evals) how models and prompts are performing in production, and dig into specific examples
### Real-time monitoring
[Log](/docs/guides/logging), monitor, and take action on real-world interactions with robust and flexible monitoring
### Data management
[Manage](/docs/guides/datasets) and [review](/docs/guides/human-review) data to store and version your test sets centrally

What makes Braintrust powerful is how these tools work together. With Braintrust, developers can move faster, run more experiments, and ultimately build better AI products.
---
file: ./content/docs/guides/access-control.mdx
meta: {
"title": "Access control"
}
# Access control
Braintrust has a robust and flexible access control system.
It's possible to grant user permissions at both the organization level as well
as scoped to individual objects within Braintrust (projects, experiments, logs, datasets, prompts, and playgrounds).
## Permission groups
The core concept of Braintrust's access control system is the permission group. Permission groups are collections of users that can be granted specific permissions.
Braintrust has three pre-configured Permission Groups that are scoped to the organization.
1. **Owners** - Unrestricted access to the organization, its data, and its settings. Can add, modify, and delete projects and all other resources. Can invite and remove members and can manage group membership.
2. **Engineers** - Can access, create, update, and delete projects and all resources within projects. Cannot invite or remove members or manage access to resources.
3. **Viewers** - Can access projects and all resources within projects. Cannot create, update, or delete any resources. Cannot invite or remove members or manage access to resources.
If your access control needs are simple and you do not need to restrict access to individual projects, these ready-made permission groups may be all that you need.
A new user can be added to one of these three groups when you invite them to your organization.

## Creating custom permission groups
In addition to the built-in permission groups, it's possible to create your own groups as well.
To do so, go to the 'Permission groups' page of Settings and click on the 'Create permission group' button.
Give your group a name and a description and then click 'Create'.

To set organization-level permissions for your new group, find the group in the groups list and click on the Permissions button.

The 'Manage Access' permission should be granted judiciously as it is a super-user permission.
It gives the user the ability to add and remove permissions, thus any user with 'Manage Access' gains the ability to grant all other permissions to themselves.
\
\
The 'Manage Settings' permission grants users the ability to change organization-level settings like the API URL.
To set group-level permissions for your new group, i.e. who can read, delete, and add members to this group, find the group in the groups list and click on the Group access button.
## Project scoped permissions
To limit access to a specific project, create a new permission group from the Settings page.

Navigate to the Configuration page of that project, and click on the Permissions link in the context menu.

Search for your group by typing in the text input at the top of the page, and then click the pencil icon next to the group to set permissions.

Set the project-level permissions for your group and click Save.

## Object scoped permissions
To limit access to a particular object (experiment, dataset, or playground) within a project, first create a permission group for those users on the 'Permission groups' section of Settings.

Next, navigate to the Configuration page of the project that holds that object and grant the group 'Read' permission at the project level.
This will allow users in that group to navigate to the project in the Braintrust UI.


Finally, navigate to your object and select Permissions from the context menu in the top-right of that object's page.

Find the permission group via the search input, and click the pencil icon to set permissions for the group.

Set the desired permissions for the group scoped to this specific object.

## API support
To automate the creation of permission groups and their access control rules, you can use the Braintrust API.
For more information on using the API to manage permission groups, check out the [API reference for groups](/docs/reference/api/Groups#list-groups) and for [permissions](/docs/reference/api#list-acls).
---
file: ./content/docs/guides/api.mdx
meta: {
"title": "API walkthrough"
}
# API walkthrough
The Braintrust REST API is available via an OpenAPI spec published at
[https://github.com/braintrustdata/braintrust-openapi](https://github.com/braintrustdata/braintrust-openapi).
This guide walks through a few common use cases, and should help you get started
with using the API. Each example is implemented in a particular language, for
legibility, but the API itself is language-agnostic.
To learn more about the API, see the full [API spec](/docs/api/spec). If you are
looking for a language-specific wrapper over the bare REST API, we support
several different [languages](/docs/reference/api#api-wrappers).
## Running an experiment
```python #skip-test #foo
import os
from uuid import uuid4
import requests
API_URL = "https://api.braintrust.dev/v1"
headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]}
if __name__ == "__main__":
# Create a project, if it does not already exist
project = requests.post(f"{API_URL}/project", headers=headers, json={"name": "rest_test"}).json()
print(project)
# Create an experiment. This should always be new
experiment = requests.post(
f"{API_URL}/experiment", headers=headers, json={"name": "rest_test", "project_id": project["id"]}
).json()
print(experiment)
# Log some stuff
for i in range(10):
resp = requests.post(
f"{API_URL}/experiment/{experiment['id']}/insert",
headers=headers,
json={"events": [{"id": uuid4().hex, "input": 1, "output": 2, "scores": {"accuracy": 0.5}}]},
)
if not resp.ok:
raise Exception(f"Error: {resp.status_code} {resp.text}: {resp.content}")
```
## Fetching experiment results
Let's say you have a [human review](/docs/guides/human-review) workflow and you want to determine if an experiment
has been fully reviewed. You can do this by running a [Braintrust query language (BTQL)](/docs/reference/btql) query:
```sql
from: experiment('')
measures: sum("My review score" IS NOT NULL) AS reviewed, count(1) AS total
filter: is_root -- Only count traces, not spans
```
To do this in Python, you can use the `btql` endpoint:
```python
import os
import requests
API_URL = "https://api.braintrust.dev/"
headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]}
def make_query(experiment_id: str) -> str:
# Replace "response quality" with the name of your review score column
return f"""
from: experiment('{experiment_id}')
measures: sum(scores."response quality" IS NOT NULL) AS reviewed, sum(is_root) AS total
"""
def fetch_experiment_review_status(experiment_id: str) -> dict:
return requests.post(
f"{API_URL}/btql",
headers=headers,
json={"query": make_query(experiment_id), "fmt": "json"},
).json()
EXPERIMENT_ID = "bdec1c5e-8c00-4033-84f0-4e3aa522ecaf" # Replace with your experiment ID
print(fetch_experiment_review_status(EXPERIMENT_ID))
```
## Paginating a large dataset
```typescript
// If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g.
// https://dfwhllz61x709.cloudfront.net
export const BRAINTRUST_API_URL = "https://api.braintrust.dev";
export const API_KEY = process.env.BRAINTRUST_API_KEY;
export async function* paginateDataset(args: {
project: string;
dataset: string;
version?: string;
// Number of rows to fetch per request. You can adjust this to be a lower number
// if your rows are very large (e.g. several MB each).
perRequestLimit?: number;
}) {
const { project, dataset, version, perRequestLimit } = args;
const headers = {
Accept: "application/json",
"Accept-Encoding": "gzip",
Authorization: `Bearer ${API_KEY}`,
};
const fullURL = `${BRAINTRUST_API_URL}/v1/dataset?project_name=${encodeURIComponent(
project,
)}&dataset_name=${encodeURIComponent(dataset)}`;
const ds = await fetch(fullURL, {
method: "GET",
headers,
});
if (!ds.ok) {
throw new Error(
`Error fetching dataset metadata: ${ds.status}: ${await ds.text()}`,
);
}
const dsJSON = await ds.json();
const dsMetadata = dsJSON.objects[0];
if (!dsMetadata?.id) {
throw new Error(`Dataset not found: ${project}/${dataset}`);
}
let cursor: string | null = null;
while (true) {
const body: string = JSON.stringify({
query: {
from: {
op: "function",
name: { op: "ident", name: ["dataset"] },
args: [{ op: "literal", value: dsMetadata.id }],
},
select: [{ op: "star" }],
limit: perRequestLimit,
cursor,
},
fmt: "jsonl",
version,
});
const response = await fetch(`${BRAINTRUST_API_URL}/btql`, {
method: "POST",
headers,
body,
});
if (!response.ok) {
throw new Error(
`Error fetching rows for ${dataset}: ${
response.status
}: ${await response.text()}`,
);
}
cursor =
response.headers.get("x-bt-cursor") ??
response.headers.get("x-amz-meta-bt-cursor");
// Parse jsonl line-by-line
const allRows = await response.text();
const rows = allRows.split("\n");
let rowCount = 0;
for (const row of rows) {
if (!row.trim()) {
continue;
}
yield JSON.parse(row);
rowCount++;
}
if (rowCount === 0) {
break;
}
}
}
async function main() {
for await (const row of paginateDataset({
project: "Your project name", // Replace with your project name
dataset: "Your dataset name", // Replace with your dataset name
perRequestLimit: 100,
})) {
console.log(row);
}
}
main();
```
## Deleting logs
To delete logs, you have to issue log requests with the `_object_delete` flag set to `true`.
For example, to find all logs matching a specific criteria, and then delete them, you can
run a script like the following:
```python
import argparse
import os
from uuid import uuid4
import requests
# Make sure to replace this with your stack's Universal API URL if you are self-hosting
API_URL = "https://api.braintrust.dev/"
headers = {"Authorization": "Bearer " + os.environ["BRAINTRUST_API_KEY"]}
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--project-id", type=str, required=True)
# Update this logic to match the rows you'd like to delete
parser.add_argument("--user-id", type=str, required=True)
args = parser.parse_args()
# Find all rows matching a certain metadata value.
query = f"""
select: id
from: project_logs('{args.project_id}') traces
filter: metadata.user_id = '{args.user_id}'
"""
response = requests.post(f"{API_URL}/btql", headers=headers, json={"query": query}).json()
ids = [x["id"] for x in response["data"]]
print("Deleting", len(ids), "rows")
delete_requests = [{"id": id, "_object_delete": True} for id in ids]
response = requests.post(
f"{API_URL}/v1/project_logs/{args.project_id}/insert", headers=headers, json={"events": delete_requests}
).json()
row_ids = response["row_ids"]
print("Deleted", len(row_ids), "rows")
```
## Impersonating a user for a request
User impersonation allows a privileged user to perform an operation on behalf of
another user, using the impersonated user's identity and permissions. For
example, a proxy service may wish to forward requests coming in from individual
users to Braintrust without requiring each user to directly specify Braintrust
credentials. The privileged service can initiate the request with its own
credentials and impersonate the user so that Braintrust runs the operation with
the user's permissions.
To this end, all API requests accept a header `x-bt-impersonate-user`, which you
can set to the ID or email of the user to impersonate. Currently impersonating
another user requires that the requesting user has specifically been granted the
`owner` role over all organizations that the impersonated user belongs to. This
check guarantees the requesting user has at least the set of permissions that
the impersonated user has.
Consider the following code example for configuring ACLs and running a request
with user impersonation.
```javascript
// If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g.
// https://dfwhllz61x709.cloudfront.net
export const BRAINTRUST_API_URL = "https://api.braintrust.dev";
export const API_KEY = process.env.BRAINTRUST_API_KEY;
async function getOwnerRoleId() {
const roleResp = await fetch(
`${BRAINTRUST_API_URL}/v1/role?${new URLSearchParams({ role_name: "owner" })}`,
{
method: "GET",
headers: {
Authorization: `Bearer ${API_KEY}`,
},
},
);
if (!roleResp.ok) {
throw new Error(await roleResp.text());
}
const roles = await roleResp.json();
return roles.objects[0].id;
}
async function getUserOrgInfo(orgName: string): Promise<{
user_id: string;
org_id: string;
}> {
const meResp = await fetch(`${BRAINTRUST_API_URL}/api/self/me`, {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
},
});
if (!meResp.ok) {
throw new Error(await meResp.text());
}
const meInfo = await meResp.json();
const orgInfo = meInfo.organizations.find(
(x: { name: string }) => x.name === orgName,
);
if (!orgInfo) {
throw new Error(`No organization found with name ${orgName}`);
}
return { user_id: meInfo.id, org_id: orgInfo.id };
}
async function grantOwnershipRole(orgName: string) {
const ownerRoleId = await getOwnerRoleId();
const { user_id, org_id } = await getUserOrgInfo(orgName);
// Grant an 'owner' ACL to the requesting user on the organization. Granting
// this ACL requires the user to have `create_acls` permission on the org, which
// means they must already be an owner of the org indirectly.
const aclResp = await fetch(`${BRAINTRUST_API_URL}/v1/acl`, {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
object_type: "organization",
object_id: org_id,
user_id,
role_id: ownerRoleId,
}),
});
if (!aclResp.ok) {
throw new Error(await aclResp.text());
}
}
async function main() {
if (!process.env.ORG_NAME || !process.env.USER_EMAIL) {
throw new Error("Must specify ORG_NAME and USER_EMAIL");
}
// This only needs to be done once.
await grantOwnershipRole(process.env.ORG_NAME);
// This will only succeed if the user being impersonated has permissions to
// create a project within the org.
const projectResp = await fetch(`${BRAINTRUST_API_URL}/v1/project`, {
method: "POST",
headers: {
Authorization: `Bearer ${API_KEY}`,
"Content-Type": "application/json",
"x-bt-impersonate-user": process.env.USER_EMAIL,
},
body: JSON.stringify({
name: "my-project",
org_name: process.env.ORG_NAME,
}),
});
if (!projectResp.ok) {
throw new Error(await projectResp.text());
}
console.log(await projectResp.json());
}
main();
```
```python
import os
import requests
# If you're self-hosting Braintrust, then use your stack's Universal API URL, e.g.
# https://dfwhllz61x709.cloudfront.net
BRAINTRUST_API_URL = "https://api.braintrust.dev"
API_KEY = os.environ["BRAINTRUST_API_KEY"]
def get_owner_role_id():
resp = requests.get(
f"{BRAINTRUST_API_URL}/v1/role",
headers={"Authorization": f"Bearer {API_KEY}"},
params=dict(role_name="owner"),
)
resp.raise_for_status()
return resp.json()["objects"][0]["id"]
def get_user_org_info(org_name):
resp = requests.post(
f"{BRAINTRUST_API_URL}/self/me",
headers={"Authorization": f"Bearer {API_KEY}"},
)
resp.raise_for_status()
me_info = resp.json()
org_info = [x for x in me_info["organizations"] if x["name"] == org_name]
if not org_info:
raise Exception(f"No organization found with name {org_name}")
return dict(user_id=me_info["id"], org_id=org_info["id"])
def grant_ownership_role(org_name):
owner_role_id = get_owner_role_id()
user_org_info = get_user_org_info(org_name)
# Grant an 'owner' ACL to the requesting user on the organization. Granting
# this ACL requires the user to have `create_acls` permission on the org,
# which means they must already be an owner of the org indirectly.
resp = requests.post(
f"{BRAINTRUST_API_URL}/v1/acl",
headers={"Authorization": f"Bearer {API_KEY}"},
body=dict(
object_type="organization",
object_id=user_org_info["org_id"],
user_id=user_org_info["user_id"],
role_id=owner_role_id,
),
)
resp.raise_for_status()
def main():
# This only needs to be done once.
grant_ownership_role(os.environ["ORG_NAME"])
# This will only succeed if the user being impersonated has permissions to
# create a project within the org.
resp = requests.post(
f"{BRAINTRUST_API_URL}/v1/project",
headers={
"Authorization": f"Bearer {API_KEY}",
"x-bt-impersonate-user": os.environ["USER_EMAIL"],
},
json=dict(
name="my-project",
org_name=os.environ["ORG_NAME"],
),
)
resp.raise_for_status()
print(resp.json())
```
## Postman
[Postman](https://www.postman.com/) is a popular tool for interacting with HTTP APIs. You can
load Braintrust's API spec into Postman by simply importing the OpenAPI spec's URL
```
https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json
```

## Tracing with the REST API SDKs
In this section, we demonstrate the basics of logging with tracing using the
language-specific REST API SDKs. The end result of running each example should
be a single log entry in a project called `tracing_test`, which looks like the
following:

```go
package main
import (
"context"
"github.com/braintrustdata/braintrust-go"
"github.com/braintrustdata/braintrust-go/shared"
"github.com/google/uuid"
"time"
)
type LLMInteraction struct {
input interface{}
output interface{}
}
func runInteraction0(input interface{}) LLMInteraction {
return LLMInteraction{
input: input,
output: "output0",
}
}
func runInteraction1(input interface{}) LLMInteraction {
return LLMInteraction{
input: input,
output: "output1",
}
}
func getCurrentTime() float64 {
return float64(time.Now().UnixMilli()) / 1000.
}
func main() {
client := braintrust.NewClient()
// Create a project, if it does not already exist
project, err := client.Projects.New(context.TODO(), braintrust.ProjectNewParams{
Name: braintrust.F("tracing_test"),
})
if err != nil {
panic(err)
}
rootSpanId := uuid.NewString()
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
shared.InsertProjectLogsEventReplaceParam{
ID: braintrust.F(rootSpanId),
Metadata: braintrust.F(map[string]interface{}{
"user_id": "user123",
}),
SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{
Name: braintrust.F("User Interaction"),
}),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{
Start: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
interaction0Id := uuid.NewString()
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
shared.InsertProjectLogsEventReplaceParam{
ID: braintrust.F(interaction0Id),
ParentID: braintrust.F(rootSpanId),
SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{
Name: braintrust.F("Interaction 0"),
}),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{
Start: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
interaction0 := runInteraction0("hello world")
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
braintrust.InsertProjectLogsEventMergeParam{
ID: braintrust.F(interaction0Id),
IsMerge: braintrust.F(true),
Input: braintrust.F(interaction0.input),
Output: braintrust.F(interaction0.output),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{
End: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
interaction1Id := uuid.NewString()
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
braintrust.InsertProjectLogsEventReplaceParam{
ID: braintrust.F(interaction1Id),
ParentID: braintrust.F(rootSpanId),
SpanAttributes: braintrust.F(braintrust.InsertProjectLogsEventReplaceSpanAttributesParam{
Name: braintrust.F("Interaction 1"),
}),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventReplaceMetricsParam{
Start: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
interaction1 := runInteraction1(interaction0.output)
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
braintrust.InsertProjectLogsEventMergeParam{
ID: braintrust.F(interaction1Id),
IsMerge: braintrust.F(true),
Input: braintrust.F(interaction1.input),
Output: braintrust.F(interaction1.output),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{
End: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
client.Projects.Logs.Insert(
context.TODO(),
project.ID,
braintrust.ProjectLogInsertParams{
Events: braintrust.F([]braintrust.ProjectLogInsertParamsEventUnion{
braintrust.InsertProjectLogsEventMergeParam{
ID: braintrust.F(rootSpanId),
IsMerge: braintrust.F(true),
Input: braintrust.F(interaction0.input),
Output: braintrust.F(interaction1.output),
Metrics: braintrust.F(braintrust.InsertProjectLogsEventMergeMetricsParam{
End: braintrust.F(getCurrentTime()),
}),
},
}),
},
)
}
```
---
file: ./content/docs/guides/attachments.mdx
meta: {
"title": "Attachments"
}
# Attachments
You can log arbitrary binary data, like images, audio, video, and PDFs, as attachments.
Attachments are useful for building multimodal evaluations, and can enable advanced scenarios like summarizing visual content or analyzing document metadata.
## Uploading attachments
You can upload attachments from either your code or the UI. Your files are securely stored in an object store and associated with the uploading user’s organization. Only you can access your attachments.
### Via code
To [upload an attachment](/docs/guides/tracing#uploading-attachments), create a new `Attachment` object to represent the file path or in-memory buffer that you want to upload:
```typescript
import { Attachment, initLogger } from "braintrust";
const logger = initLogger();
logger.log({
input: {
question: "What is this?",
context: new Attachment({
data: "path/to/input_image.jpg",
filename: "user_input.jpg",
contentType: "image/jpeg",
}),
},
output: "Example response.",
});
```
```python
from braintrust import Attachment, init_logger
logger = init_logger()
logger.log(
{
"input": {
"question": "What is this?",
"context": Attachment(
data="path/to/input_image.jpg",
filename="user_input.jpg",
content_type="image/jpeg",
),
},
"output": "Example response.",
}
)
```
You can place the `Attachment` anywhere in a log, dataset, or feedback log.
Behind the scenes, the [Braintrust SDK](/docs/reference/libs/nodejs/classes/Attachment) automatically detects and uploads attachments in the background, in parallel to the original logs. This ensures that the latency of your logs isn’t affected by any additional processing.
### Using external files as attachments
Braintrust also supports references to files in external object stores with the `ExternalAttachment` object. You can use this anywhere you would use an `Attachment`. Currently S3 is the only supported option for external files.
[attach-ts]: /docs/reference/libs/nodejs/classes/ExternalAttachment
[attach-py]: /docs/reference/libs/python#externalattachment-objects
```typescript
import { ExternalAttachment, initLogger } from "braintrust";
const logger = initLogger({ projectName: "ExternalAttachment Example" });
logger.log({
input: {
question: "What is this?",
additional_context: new ExternalAttachment({
url: "s3://an_existing_bucket/path/to/file.pdf",
filename: "file.pdf",
contentType: "application/pdf",
}),
},
output: "Example response.",
});
```
```python
from braintrust import ExternalAttachment, init_logger
logger = init_logger("ExternalAttachment Example")
logger.log(
input={
"question": "What is this?",
"additional_context": ExternalAttachment(
url="s3://an_existing_bucket/path/to/file.pdf",
filename="file.pdf",
content_type="application/pdf",
),
},
output="Example response.",
)
```
Just like attachments uploaded to Braintrust, external attachments can be previewed and downloaded for local viewing.
### In the UI
You can upload attachments directly through the UI for any editable span field. This includes:
* Any dataset fields, including datasets in playgrounds
* Log span fields
* Experiment span fields
You can also include attachments in prompt messages when using models that support multimodal inputs.
## Viewing attachments
You can preview most images, audio files, videos, or PDFs in the Braintrust UI. You can also download any file to view it locally.
We provide built-in support to preview attachments directly in playground input cells and traces.
In the playground, you can preview attachments in an inline embedded view for easy visual verification during experimentation:
In the trace pane, attachments appear as an additional list under the data viewer:
---
file: ./content/docs/guides/automations.mdx
meta: {
"title": "Automations"
}
# Automations
Automations let you trigger actions based on specific events in Braintrust. This makes it easier for you to execute common actions and integrate Braintrust with your existing tools and workflows.
Automations are in beta. If you are on a hybrid deployment, automations are available starting with `v0.0.72`.
## How automations work
Automations work by monitoring events in your project and executing actions when specified conditions are met. At a high level the automation runtime will:
Monitor events in your project
Filter events using [BTQL](/docs/reference/btql)
Limit execution to once per time interval
Execute actions on matching data
### Limitations
Currently, automations only support [log events](/docs/guides/logs) and webhook actions. More event types and actions coming soon.
## Creating automations
To create an automation, select **Add automation** from the automations tab in your project configuration and input the automation name, a BTQL filter, time interval, and webhook URL.

## Automation settings
* **Name**: A descriptive name for your automation
* **Description** (optional): Additional context about the automation's purpose
* **Event type**: Currently supports "Log event"
* **BTQL filter**: Filter logs using BTQL syntax (if empty, matches all logs)
* **Interval**: How frequently the automation should check for matching events
* **Webhook URL**: The endpoint that will receive the automation data
## Testing automations
Before saving or updating an automation, you can test it to verify it works as expected. The test will trigger the automation as if the initiating event occurred and then run through the BTQL filter, interval, and execute the action on matching rows.
If no matching logs are found, you may need to adjust your BTQL filter or interval. Note that your project must have recent matching logs within the automation interval in order for the test call to succeed.
## Webhook payload
When an automation is triggered it sends a JSON payload to your webhook URL with the following structure:
```json
{
"organization": {
"id": "org_123",
"name": "your-organization"
},
"project": {
"id": "proj_456",
"name": "your-project"
},
"automation": {
"id": "c5b32408-8568-4bff-9299-8cdd56979b67",
"name": "High-Priority Factuality",
"description": "Alert on factuality scores for logs with priority 0 in metadata",
"event_type": "logs",
"btql_filter": "metadata.priority = 0 AND scores.Factuality < 0.9",
"interval_seconds": 3600,
"url": "https://braintrust.dev/app/your-organization/p/your-project/configuration/automations?aid=c5b32408-8568-4bff-9299-8cdd56979b67"
},
"details": {
"is_test": false,
"message": "High-Priority Factuality: 5 logs triggered automation in the last 1 hour",
"time_start": "2025-05-12T10:00:00.000Z",
"time_end": "2025-05-12T11:00:00.000Z",
"count": 5,
"related_logs_url": "https://braintrust.dev/app/your-organization/p/your-project/logs?search=..."
}
}
```
---
file: ./content/docs/guides/datasets.mdx
meta: {
"title": "Datasets"
}
# Datasets
Datasets allow you to collect data from production, staging, evaluations, and even manually, and then
use that data to run evaluations and track improvements over time.
For example, you can use Datasets to:
* Store evaluation test cases for your eval script instead of managing large JSONL or CSV files
* Log all production generations to assess quality manually or using model graded evals
* Store user reviewed (, ) generations to find new test cases
In Braintrust, datasets have a few key properties:
* **Integrated**. Datasets are integrated with the rest of the Braintrust platform, so you can use them in
evaluations, explore them in the playground, and log to them from your staging/production environments.
* **Versioned**. Every insert, update, and delete is versioned, so you can pin evaluations to a specific version
of the dataset, rewind to a previous version, and track changes over time.
* **Scalable**. Datasets are stored in a modern cloud data warehouse, so you can collect as much data as you want without worrying about
storage or performance limits.
* **Secure**. If you run Braintrust [in your cloud environment](/docs/guides/self-hosting), datasets are stored in your warehouse and
never touch our infrastructure.
## Creating a dataset
Records in a dataset are stored as JSON objects, and each record has three top-level fields:
* `input` is a set of inputs that you could use to recreate the example in your application. For example, if you're logging
examples from a question answering model, the input might be the question.
* `expected` (optional) is the output of your model. For example, if you're logging examples from a question answering model, this
might be the answer. You can access `expected` when running evaluations as the `expected` field; however, `expected` does not need to be
the ground truth.
* `metadata` (optional) is a set of key-value pairs that you can use to filter and group your data. For example, if you're logging
examples from a question answering model, the metadata might include the knowledge source that the question came from.
Datasets are created automatically when you initialize them in the SDK.
### Inserting records
You can use the SDK to initialize and insert into a dataset:
```javascript
import { initDataset } from "braintrust";
async function main() {
const dataset = initDataset("My App", { dataset: "My Dataset" });
for (let i = 0; i < 10; i++) {
const id = dataset.insert({
input: i,
expected: { result: i + 1, error: null },
metadata: { foo: i % 2 },
});
console.log("Inserted record with id", id);
}
console.log(await dataset.summarize());
}
main();
```
```python
import braintrust
dataset = braintrust.init_dataset(project="My App", name="My Dataset")
for i in range(10):
id = dataset.insert(input=i, expected={"result": i + 1, "error": None}, metadata={"foo": i % 2})
print("Inserted record with id", id)
print(dataset.summarize())
```
### Updating records
In the above example, each `insert()` statement returns an `id`. You can use this `id` to update the record using `update()`:
```javascript #skip-compile
dataset.update({
id,
input: i,
expected: { result: i + 1, error: "Timeout" },
});
```
```python
dataset.update(input=i, expected={"result": i + 1, "error": "Timeout"}, id=id)
```
The `update()` method applies a merge strategy: only the fields you provide will be updated, and all other existing fields in the record will remain unchanged.
### Deleting records
You can delete records via code by `id`:
```javascript #skip-compile
await dataset.delete(id);
```
```python
dataset.delete(id)
```
To delete an entire dataset, use the [API command](/docs/reference/api/Datasets#delete-dataset).
### Flushing
In both TypeScript and Python, the Braintrust SDK flushes records as fast as possible and installs an exit handler that tries
to flush records, but these hooks are not always respected (e.g. by certain runtimes, or if you `exit` a process yourself). If
you need to ensure that records are flushed, you can call `flush()` on the dataset.
```javascript #skip-compile
await dataset.flush();
```
```python
dataset.flush()
```
### Multimodal datasets
You may want to store or process images in your datasets. There are currently three ways to use images in Braintrust:
* Image URLs (most performant)
* Base64 (least performant)
* Attachments (easiest to manage, stored in Braintrust)
* External attachments (access files in your own object stores)
If you're building a dataset of large images in Braintrust, we recommend using image URLs. This keeps your dataset lightweight and allows you to preview or process them without storing heavy binary data directly.
If you prefer to keep all data within Braintrust, create a dataset of attachments instead. In addition to images, you can create datasets of attachments that have any arbitrary data type, including audio and PDFs. You can then [use these datasets in evaluations](/docs/guides/evals/write#attachments).
```typescript title="attachment_dataset.ts"
import { Attachment, initDataset } from "braintrust";
import path from "node:path";
async function createPdfDataset(): Promise {
const dataset = initDataset({
project: "Project with PDFs",
dataset: "My PDF Dataset",
});
for (const filename of ["example.pdf"]) {
dataset.insert({
input: {
file: new Attachment({
filename,
contentType: "application/pdf",
data: path.join("files", filename),
}),
},
});
}
await dataset.flush();
}
// Create a dataset with attachments.
createPdfDataset();
```
To invoke this script, run this in your terminal:
```bash
npx tsx attachment_dataset.ts
```
```python title="attachment_dataset.py"
import os
from typing import Any, Dict
from braintrust import Attachment, init_dataset
def create_pdf_dataset() -> None:
"""Create a dataset with attachments."""
dataset = init_dataset("Project with PDFs", "My PDF Dataset")
for filename in ["example.pdf"]:
dataset.insert(
input={
"file": Attachment(
filename=filename,
content_type="application/pdf",
# The file on your filesystem or the file's bytes.
data=os.path.join("files", filename),
)
},
# This is a toy example where we check that the file size is what we expect.
expected=469513,
)
dataset.flush()
# Create a dataset with attachments.
create_pdf_dataset()
```
To invoke this script, run this in your terminal:
```bash
python attachment_dataset.py
```
Attachments are not yet supported in the playground. To explore images in the playground, we recommend using image URLs.
## Managing datasets in the UI
In addition to managing datasets through the API, you can also manage them in the Braintrust UI.
### Viewing a dataset
You can view a dataset in the Braintrust UI by navigating to the project and then clicking on the dataset.

From the UI, you can filter records, create new ones, edit values, and delete records. You can also copy records
between datasets and from experiments into datasets. This feature is commonly used to collect interesting or
anomalous examples into a golden dataset.
#### Create custom columns
When viewing a dataset, create [custom columns](/docs/guides/evals/interpret#create-custom-columns) to extract specific values from `input`, `expected`, or `metadata` fields.
### Creating a dataset
The easiest way to create a dataset is to upload a CSV file.

### Updating records
Once you've uploaded a dataset, you can update records or add new ones directly in the UI.

### Labeling records
In addition to updating datasets through the API, you can edit and label them in the UI. Like experiments and logs, you can
configure [categorical fields](/docs/guides/human-review#writing-to-expected-fields) to allow human reviewers
to rapidly label records.
This requires you to first [configure human review](/docs/guides/human-review#configuring-human-review) in the **Configuration** tab of your project.

### Deleting records
To delete a record, navigate to **Library → Datasets** and select the dataset. Select the check box next to the individual record you'd like to delete, and then select the **Trash** icon.
You can follow the same steps to delete an entire dataset from the **Library > Datasets** page.
## Using a dataset in an evaluation
You can use a dataset in an evaluation by passing it directly to the `Eval()` function.
```typescript
import { initDataset, Eval } from "braintrust";
import { Levenshtein } from "autoevals";
Eval(
"Say Hi Bot", // Replace with your project name
{
data: initDataset("My App", { dataset: "My Dataset" }),
task: async (input) => {
return "Hi " + input; // Replace with your LLM call
},
scores: [Levenshtein],
},
);
```
```python
from braintrust import Eval, init_dataset
from autoevals import Levenshtein
Eval(
"Say Hi Bot", # Replace with your project name
data=init_dataset(project="My App", name="My Dataset"),
task=lambda input: "Hi " + input, # Replace with your LLM call
scores=[Levenshtein],
)
```
You can also manually iterate through a dataset's records and run your tasks,
then log the results to an experiment. Log the `id`s to link each dataset record
to the corresponding result.
```typescript
import { initDataset, init, Dataset, Experiment } from "braintrust";
function myApp(input: any) {
return `output of input ${input}`;
}
function myScore(output: any, rowExpected: any) {
return Math.random();
}
async function main() {
const dataset = initDataset("My App", { dataset: "My Dataset" });
const experiment = init("My App", {
experiment: "My Experiment",
dataset: dataset,
});
for await (const row of dataset) {
const output = myApp(row.input);
const closeness = myScore(output, row.expected);
experiment.log({
input: row.input,
output,
expected: row.expected,
scores: { closeness },
datasetRecordId: row.id,
});
}
console.log(await experiment.summarize());
}
main();
```
```python
import random
import braintrust
def my_app(input):
return f"output of input {input}"
def my_score(output, row_expected):
return random.random()
dataset = braintrust.init_dataset(project="My App", name="My Dataset")
experiment = braintrust.init(project="My App", experiment="My Experiment", dataset=dataset)
for row in dataset:
output = my_app(row["input"])
closeness = my_score(output, row["expected"])
experiment.log(
input=row["input"],
output=output,
expected=row["expected"],
scores=dict(closeness=closeness),
dataset_record_id=row["id"],
)
print(experiment.summarize())
```
You can also use the results of an experiment as baseline data for future experiments by calling the `asDataset()`/`as_dataset()` function, which converts the experiment into dataset format (`input`, `expected`, and `metadata`).
```typescript
import { init, Eval } from "braintrust";
import { Levenshtein } from "autoevals";
const experiment = init("My App", {
experiment: "my-experiment",
open: true,
});
Eval("My App", {
data: experiment.asDataset(),
task: async (input) => {
return `hello ${input}`;
},
scores: [Levenshtein],
});
```
```python
from braintrust import Eval, init
from autoevals import Levenshtein
experiment = braintrust.init(
project="My App",
experiment="my-experiment",
open=True,
)
Eval(
"My App",
data=experiment.as_dataset(),
task=lambda input: input + 1, # Replace with your LLM call
scores=[Levenshtein],
)
```
For a more advanced overview of how to use an experiment as a baseline for other experiments, see [hill climbing](/docs/guides/evals/write#hill-climbing).
## Logging from your application
To log to a dataset from your application, you can simply use the SDK and call `insert()`. Braintrust logs
are queued and sent asynchronously, so you don't need to worry about critical path performance.
Since the SDK uses API keys, it's recommended that you log from a privileged environment (e.g. backend server),
instead of client applications directly.
This example walks through how to track / from feedback:
```javascript
import { initDataset, Dataset } from "braintrust";
class MyApplication {
private dataset: Dataset | undefined = undefined;
async initApp() {
this.dataset = await initDataset("My App", { dataset: "logs" });
}
async logUserExample(
input: any,
expected: any,
userId: string,
orgId: string,
thumbsUp: boolean,
) {
if (this.dataset) {
this.dataset.insert({
input,
expected,
metadata: { userId, orgId, thumbsUp },
});
} else {
console.warn("Must initialize application before logging");
}
}
}
```
```python
from typing import Any
import braintrust
class MyApplication:
def init_app(self):
self.dataset = braintrust.init_dataset(project="My App", name="logs")
def log_user_example(self, input: Any, expected: Any, user_id: str, org_id: str, thumbs_up: bool):
if self.dataset:
self.dataset.insert(
input=input,
expected=expected,
metadata=dict(user_id=user_id, org_id=org_id, thumbs_up=thumbs_up),
)
else:
print("Must initialize application before logging")
```
## Troubleshooting
### Downloading large datasets
If you are trying to load a very large dataset, you may run into timeout errors while using the SDK. If so, you
can [paginate](/docs/guides/api#downloading-a-dataset-using-pagination) through the dataset to download it in smaller chunks.
---
file: ./content/docs/guides/human-review.mdx
meta: {
"title": "Human review"
}
# Human review
Although Braintrust helps you automatically evaluate AI software, human
review is a critical part of the process. Braintrust seamlessly integrates human
feedback from end users, subject matter experts, and product teams in one place. You can
use human review to evaluate/compare experiments, assess the efficacy of your automated scoring
methods, and curate log events to use in your evals. As you add human review scores, your logs will update in real time.

## Configuring human review
To set up human review, define the scores you want to collect in your
project's **Configuration** tab.

Select **Add human review score** to configure a new score. A score can be one of
* Continuous number value between `0%` and `100%`, with a slider input control.
* Categorical value where you can define the possible options and their scores. Categorical value options
are also assigned a unique percentage value between `0%` and `100%` (stored as 0 to 1).
* Free-form text where you can write a string value to the `metadata` field at a specified path.

Created human review scores will appear in the **Human review** section in every experiment and log trace in the project. Categorical scores configured to "write to expected" and free-form scores will also appear on dataset rows.
### Writing to expected fields
You may choose to write categorical scores to the `expected` field of a span instead of a score.
To enable this, check the **Write to expected field instead of score** option. There is also
an option to **Allow multiple choice** when writing to the expected field.
A numeric score will not be assigned to the categorical options when writing to the expected
field. If there is an existing object in the expected field, the categorical value will be
appended to the object.

In addition to categorical scores, you can always directly edit the structured output for the `expected` field of any span through the UI.
## Reviewing logs and experiments
To manually review results from your logs or experiment, select a row to open trace view. There, you can edit the human review scores you previously configured.
As you set scores, they will be automatically saved and reflected in the summary metrics. The process is the same whether you're reviewing logs or experiments.
### Leaving comments
In addition to setting scores, you can also add comments to spans and update their `expected` values. These updates
are tracked alongside score updates to form an audit trail of edits to a span.
If you leave a comment that you want to share with a teammate, you can copy a link that will deeplink to the comment.
## Focused review mode
If you or a subject matter expert is reviewing a large number of logs or experiments, you can use **Review** mode to enter
a UI that's optimized specifically for review. To enter review mode, hit the "r" key or the expand ()
icon next to the **Human review** header in a span.
In review mode, you can set scores, leave comments, and edit expected values. Review mode is optimized for keyboard
navigation, so you can quickly move between scores and rows with keyboard shortcuts. You can also share a link to the
review mode view with other team members, and they'll drop directly into review mode.
### Reviewing data that matches a specific criteria
To easily review a subset of your logs or experiments that match a given criteria, you can filter using English or [BTQL](/docs/reference/btql#btql-query-syntax), then enter review mode.
In addition to filters, you can use [tags](/docs/guides/logging#tags-and-queues) to mark items for `Triage`, and then review them all at once.
You can also save any filters, sorts, or column configurations as views. Views give you a standardized place to see any current or future logs that match a given criteria, for example, logs with a Factuality score less than 50%. Once you create your view, you can enter review mode right from there.
Reviewing is a common task, and therefore you can enter review mode from any experiment or log view. You can also re-enter review mode from any view to audit
past reviews or update scores.
### Benefits over an annotation queue
* Designed for optimal productivity: The combination of views and human review mode simplifies the review process with intuitive filters, reusable configurations, and keyboard navigation, enabling faster, more efficient log evaluation and feedback.
* Dynamic and flexible views: Views dynamically update with new logs matching saved criteria, eliminating the need to set up and maintain complex automation rules.
* Easy collaboration: Sharing review mode links allows for team collaboration without requiring intricate permissions or setup overhead.
## Filtering using feedback
In the UI, you can filter on log events with specific scores by adding a filter using the filter button, like "Preference is greater than 75%",
and then add the matching rows to a dataset for further investigation.
You can also programmatically filter log events using the API using a query and the project ID:
```typescript #skip-compile
await braintrust.projects.logs.fetch(projectId, { query });
```
```python
braintrust.projects.logs.fetch("", "scores.Preference > 0.75")
```
This is a powerful way to utilize human feedback
to improve your evals.
## Capturing end-user feedback
The same set of updates — scores, comments, and expected values — can be captured from end-users as well. See the
[Logging guide](/docs/guides/logs/write#user-feedback) for more details.
---
file: ./content/docs/guides/index.mdx
meta: {
"title": "Guides",
"description": "Step-by-step walkthroughs to help you accomplish a specific goal"
}
# Guides
Guides are step-by-step walkthroughs to help you accomplish a specific goal in
Braintrust.
## Core functionality
## Features
## Advanced usecases
---
file: ./content/docs/guides/monitor.mdx
meta: {
"title": "Monitor",
"metaTitle": "Monitor logs and experiments"
}
# Monitor page
The **Monitor** page shows aggregate metrics data for both the logs and experiments in a given project. The included charts show values related to the selected time period for latency, token count, time to first token, cost, request count, and scores.

## Group by metadata
Select the **Group** dropdown menu to group the data by specific metadata fields, including custom fields.

## Filter series
Select the filter dropdown menu on any individual chart to apply filters.
## Select a timeframe
Select a timeframe from the given options to see the data associated with that time period.
## Select to view traces
Select a datapoint node in any of the charts to view the corresponding traces for that time period.

---
file: ./content/docs/guides/playground.mdx
meta: {
"title": "Playgrounds",
"description": "Explore, compare, and evaluate prompts"
}
# Eval playgrounds
Playgrounds are a powerful workspace for rapidly iterating on AI engineering primitives. Tune prompts, models, scorers and datasets in an editor-like interface, and run full evaluations in real-time, side by side.
Use playgrounds to build and test hypotheses and evaluation configurations in a flexible environment. Playgrounds leverage the same underlying `Eval` structure as experiments, with support for running thousands of dataset rows directly in the browser. Collaborating with teammates is also simple with a shared URL.
Playgrounds are designed for quick prototyping of ideas. When a playground is run, its previous generations are overwritten. You can create [experiments](/docs/evals) from playgrounds when you need to capture an immutable snapshot of your evaluations for long-term reference or point-in-time comparison.
## Creating a playground
A playground includes one or more evaluation tasks, one or more scorers, and optionally, a dataset.
You can create a playground by navigating to **Evaluations** > **Playgrounds**, or by selecting **Create playground with prompt** at the bottom of a prompt dialog.

### Tasks
Tasks define LLM instructions. There are three types of tasks:
* [Prompts](/docs/guides/functions/prompts): AI model, prompt messages, parameters, and tools.
* [Agents](/docs/guides/functions/agents): A chain of prompts.
* [Remote evals](/docs/guides/remote-evals): Prompts and scorers from external sources.
[AI providers](/docs/reference/organizations#ai-providers) must be configured before playgrounds can be run.
An empty playground will prompt you to create a base task, and optional comparison tests. The base task is used as the source when diffing output traces.

When you select **Run** (or the keyboard shortcut Cmd/Ctrl+Enter), each task runs in parallel and the results stream into the grid below. You can also choose to view in list or summary layout.
For multimodal workflows, supported [attachments](/docs/guides/attachments#viewing-attachments) will have a preview shown in the inline embedded view.
### Scorers
Scorers quantify the quality of evaluation outputs using an LLM judge or code. You can use built-in [autoevals](/docs/reference/autoevals) for common evaluation scenarios to help you get started quickly, or write [custom scorers](/docs/guides/functions/scorers) tailored to your use case.
To add a scorer, select **+ Scorer** and choose from the list or create a custom scorer.

### Datasets
[Datasets](/docs/guides/datasets) provide structured inputs, expected values, and metadata for evaluations.
A playground can be run without a dataset to view a single set of task outputs, or with a dataset to view a matrix of outputs for many inputs.
Datasets can be linked to a playground by selecting existing library datasets, or creating/importing a new one.
Once you link a dataset, you will see a new row in the grid for each record in the dataset. You can reference the
data from each record in your prompt using the `input`, `expected`, and `metadata` variables. The playground uses
[mustache](https://mustache.github.io/) syntax for templating:

Each value can be arbitrarily complex JSON, for example, `{{input.formula}}`. If you want to preserve double curly brackets `{{` and `}}` as plain text in your prompts, you can change the delimiter tags to any custom
string of your choosing. For example, if you want to change the tags to `<%` and `%>`, insert `{{=<% %>=}}` into the message,
and all strings below in the message block will respect these delimiters:
```
{{=<% %>=}}
Return the number in the following format: {{ number }}
<% input.formula %>
```
Dataset edits in playgrounds edit the original dataset.
## Running a playground
To run a playground, select the **Run** button at the top of the playground to run all tasks and all dataset rows. You can also run a single task individually, or run a single dataset row.
### Viewing traces
Select a row in the results table to compare evaluation traces side-by-side. This allows you to identify differences in outputs, scores, metrics, and input data.

From this view, you can also run a single row by selecting **Run row**.
### Diffing
Diffing allows you to visually compare variations across models, prompts, or agents to quickly understand differences in outputs.
To turn on diff mode, select the diff toggle.
## Creating experiment snapshots
Experiments formalize evaluation results for comparison and historical reference. While playgrounds are better for fast, iterative exploration, experiments are immutable, point-in-time evaluation snapshots ideal for detailed analysis and reporting.
To create an experiment from a playground, select **+ Experiment**. Each playground task will map to its own experiment.
## Advanced options
### Appended dataset messages
You may sometimes have additional messages in a dataset that you want to append to a prompt. This option lets you specify a path to a messages array in the dataset. For example, if `input` is specified as the appended messages path and a dataset row has the following input, all prompts in the playground will run with additional messages.
```json
[
{
"role": "assistant",
"content": "Is there anything else I can help you with?"
},
{
"role": "user",
"content": "Yes, I have another question."
}
]
```
### Max concurrency
The maximum number of tasks/scorers that will be run concurrently in the playground. This is useful for avoiding rate limits (429 - Too many requests) from AI providers.
### Strict variables
When this option is enabled, evaluations will fail if the dataset row does not include all of the variables referenced in prompts.
## Sharing playgrounds
Playgrounds are designed for collaboration and automatically synchronize in real-time.
To share a playground, copy the URL and send it to your collaborators. Your collaborators
must be members of your organization to see the session. You can invite users from the settings page.
---
file: ./content/docs/guides/projects.mdx
meta: {
"title": "Projects",
"description": "Create and configure projects"
}
# Projects
A project is analogous to an AI feature in your application. Some customers create separate projects for development and production to help track workflows. Projects contain all [experiments](/docs/guides/evals), [logs](/docs/guides/logging), [datasets](/docs/guides/datasets) and [playgrounds](/docs/guides/playground) for the feature.
For example, a project might contain:
* An experiment that tests the performance of a new version of a chatbot
* A dataset of customer support conversations
* A prompt that guides the chatbot's responses
* A tool that helps the chatbot answer customer questions
* A scorer that evaluates the chatbot's responses
* Logs that capture the chatbot's interactions with customers
## Project configuration
Projects can also house configuration settings that are shared across the project.
### Tags
Braintrust supports tags that you can use throughout your project to curate logs, datasets, and even experiments. You can filter based on tags in the UI to track various kinds of data across your application, and how they change over time. Tags can be created in the **Configuration** tab by selecting **Add tag** and entering a tag name, selecting a color, and adding an optional description.
For more information about using tags to curate logs, see the [logging guide](/docs/guides/logging#tags-and-queues).
### Human review
You can define scores and labels for manual human review, either as feedback from your users (through the API) or directly through the UI. Scores you define on the **Configuration** page will be available in every experiment and log in your project.
To create a new score, select **Add human review score** and enter a name and score type. You can add multiple options and decide if you want to allow writing to the expected field instead of the score, or multiple choice.
To learn more about human review, check out the [full guide](/docs/guides/human-review).
### Aggregate scores
Aggregate scores are formulas that combine multiple scores into a single value. This can be useful for creating a single score that represents the overall experiment.
To create an aggregate score, select **Add aggregate score** and enter a name, formula, and description. Braintrust currently supports three types of aggregate scores:
Braintrust currently supports three types of aggregate scores:
* **Weighted average** - A weighted average of selected scores.
* **Minimum** - The minimum value among the selected scores.
* **Maximum** - The maximum value among the selected scores.
To learn more about aggregate scores, check out the [experiments guide](/docs/guides/evals/interpret#aggregate-weighted-scores).
### Online scoring
Braintrust supports server-side online evaluations that are automatically run asynchronously as you upload logs. To create an online evaluation, select **Add rule** and input the rule name, description, and which scorers and sampling rate you'd like to use. You can choose from custom scorers available in this project and others in your organization, or built-in scorers. Decide if you'd like to apply the rule to the root span or any other spans in your traces.

For more information about online evaluations, check out the [logging guide](/docs/guides/logging#online-evaluation).
### Span iframes
You can configure span iframes from your project settings. For more information, check out the [extend traces](/docs/guides/traces/extend/#custom-rendering-for-span-fields) guide.
### Comparison key
When comparing multiple experiments, you can customize the expression you're using to evaluate test cases by changing the comparison key. It defaults to "input," but you can change it in your project's **Configuration** tab.
For more information about the comparison key, check out the [evaluation guide](/docs/guides/evals/interpret#customizing-the-comparison-key).
### Rename project
You can rename your project at any time in the **Configuration** tab.
---
file: ./content/docs/guides/proxy.mdx
meta: {
"title": "AI proxy",
"description": "Access models from OpenAI, Anthropic, Google, AWS, Mistral, and more"
}
# AI proxy
The Braintrust AI Proxy is a powerful tool that enables you to access models from [OpenAI](https://platform.openai.com/docs/models),
[Anthropic](https://docs.anthropic.com/claude/reference/getting-started-with-the-api), [Google](https://ai.google.dev/gemini-api/docs),
[AWS](https://aws.amazon.com/bedrock), [Mistral](https://mistral.ai/), and third-party inference providers like [Together](https://www.together.ai/) which offer
open source models like [LLaMa 3](https://ai.meta.com/llama/) — all through a single, unified API.
With the AI proxy, you can:
* **Simplify your code** by accessing many AI providers through a single API.
* **Reduce your costs** by automatically caching results when possible.
* **Increase observability** by optionally logging your requests to Braintrust.
Best of all, the AI proxy is free to use, even if you don't have a Braintrust account.
To read more about why we launched the AI proxy, check out our [blog post](/blog/ai-proxy) announcing the feature.
The AI proxy is free for all users. You can access it without a Braintrust
account by using your API key from any of the supported providers. With a
Braintrust account, you can use a single Braintrust API key to access all AI
providers.
## Quickstart
The Braintrust Proxy is fully compatible with applications written using the
[OpenAI SDK]. You can get started without making any code changes. Just set the
API URL to `https://api.braintrust.dev/v1/proxy`.
Try running the following script in your favorite language, twice:
```typescript
import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
apiKey: process.env.OPENAI_API_KEY, // Can use Braintrust, Anthropic, etc. API keys here
});
async function main() {
const start = performance.now();
const response = await client.chat.completions.create({
model: "gpt-4o-mini", // Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages: [{ role: "user", content: "What is a proxy?" }],
seed: 1, // A seed activates the proxy's cache
});
console.log(response.choices[0].message.content);
console.log(`Took ${(performance.now() - start) / 1000}s`);
}
main();
```
```python
import os
import time
from openai import OpenAI
client = OpenAI(
base_url="https://api.braintrust.dev/v1/proxy",
api_key=os.environ["OPENAI_API_KEY"], # Can use Braintrust, Anthropic, etc. API keys here
)
start = time.time()
response = client.chat.completions.create(
model="gpt-4o-mini", # Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages=[{"role": "user", "content": "What is a proxy?"}],
seed=1, # A seed activates the proxy's cache
)
print(response.choices[0].message.content)
print(f"Took {time.time()-start}s")
```
```bash
time curl -i https://api.braintrust.dev/v1/proxy/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "What is a proxy?"
}
],
"seed": 1
}' \
-H "Authorization: Bearer $OPENAI_API_KEY" \
--compress
```
Anthropic users can pass their Anthropic API key with a model such as
`claude-3-5-sonnet-20240620`.
The second run will be significantly faster because the proxy served your
request from its cache, rather than rerunning the AI provider's model. Under the
hood, your request is served from a [Cloudflare Worker] that caches your request
with end-to-end encryption.
[OpenAI SDK]: https://platform.openai.com/docs/libraries
[Cloudflare Worker]: https://workers.cloudflare.com/
## Key features
The proxy is a drop-in replacement for the OpenAI API, with a few killer features:
* Automatic caching of results, with configurable semantics
* Interopability with other providers, including a wide range of open source models
* API key management
The proxy also supports the Anthropic and Gemini APIs
for making requests to Anthropic and Gemini models.
### Caching
The proxy automatically caches results, and reuses them when possible. Because the proxy runs on the edge,
you can expect cached requests to be returned in under 100ms. This is especially useful when you're developing
and frequently re-running or evaluating the same prompts many times.
#### Cache modes
There are three caching modes: `auto` (default), `always`, `never`:
* In `auto` mode, requests are cached if they have `temperature=0` or the
[`seed` parameter](https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter) set and they are one of the supported paths.
* In `always` mode, requests are cached as long as they are one of the supported paths.
* In `never` mode, the cache is never read or written to.
The supported paths are:
* `/auto`
* `/embeddings`
* `/chat/completions`
* `/completions`
* `/moderations`
You can set the cache mode by passing the `x-bt-use-cache` header to your request.
#### Cache TTL
By default, cached results expire after 1 week. The TTL for individual requests can be set by passing the `x-bt-cache-ttl` header to your request. The TTL is specified in seconds and must be between 1 and 604800 (7 days).
#### Cache control
The proxy supports a limited set of [Cache-Control](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control) directives:
* To bypass the cache, set the `Cache-Control` header to `no-cache, no-store`. Note that this is semantically equivalent to setting the `x-bt-use-cache` header to `never`.
* To force a fresh request, set the `Cache-Control` header to `no-cache`. Note that without the `no-store` directive the response will be cached for subsequent requests.
* To request a cached response with a maximum age, set the `Cache-Control` header to `max-age=`. If the cached data is older than the specified age that the cache will be bypassed and a new response will be generated. Combine this with `no-store` to bypass the cache for a request without overwriting the currently cached response.
When cache control directives conflict with the `x-bt-use-cache` header, the cache control directives take precedence.
The proxy will return the `x-bt-cached` header in the response with `HIT` or `MISS` to indicate whether the response was served from the cache, the `Age` header to indicate the age of the cached response, and the `Cache-Control` header with the `max-age` directive to return the TTL/max age of the cached response.
For example, to set the cache mode to `always` with a TTL of 2 days,
```javascript
import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
defaultHeaders: {
"x-bt-use-cache": "always",
"Cache-Control": "max-age=172800",
},
apiKey: process.env.OPENAI_API_KEY, // Can use Braintrust, Anthropic, etc. API keys here
});
async function main() {
const response = await client.chat.completions.create({
model: "gpt-4o", // Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages: [{ role: "user", content: "What is a proxy?" }],
});
console.log(response.choices[0].message.content);
}
main();
```
```python
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.braintrust.dev/v1/proxy",
default_headers={"x-bt-use-cache": "always", "Cache-Control": "max-age=1209600"},
api_key=os.environ["OPENAI_API_KEY"], # Can use Braintrust, Anthropic, etc. API keys here
)
response = client.chat.completions.create(
model="gpt-4o", # Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages=[{"role": "user", "content": "What is a proxy?"}],
)
print(response.choices[0].message.content)
```
```bash
time curl -i https://api.braintrust.dev/v1/proxy/chat/completions \
-H "Content-Type: application/json" \
-H "x-bt-use-cache: always" \
-H "Cache-Control: max-age=1209600" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "What is a proxy?"
}
]
}' \
-H "Authorization: Bearer $OPENAI_API_KEY" \
--compress
```
#### Encryption
We use [AES-GCM](https://en.wikipedia.org/wiki/Galois/Counter_Mode) to encrypt the cache, using a key derived from your
API key. Results are cached for 1 week unless otherwise specified in request headers.
This design ensures that the cache is only accessible to you, and that we cannot see your data. We also do not store
or log API keys.
Because the cache's encryption key is your API key, cached results are scoped
to an individual user. However, Braintrust customers can opt-into sharing
cached results across users within their organization.
### Tracing
To log requests that you make through the proxy, you can specify an `x-bt-parent` header with the project or
experiment you'd like to log to. While tracing, you must also use a `BRAINTRUST_API_KEY` rather than a provider's
key. Behind the scenes, the proxy will derive your provider's key and facilitate tracing using the `BRAINTRUST_API_KEY`.
For example,
```javascript
import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
defaultHeaders: {
"x-bt-parent": "project_id:",
},
apiKey: process.env.BRAINTRUST_API_KEY, // Must use Braintrust API key
});
async function main() {
const response = await client.chat.completions.create({
model: "gpt-4o", // Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages: [{ role: "user", content: "What is a proxy?" }],
});
console.log(response.choices[0].message.content);
}
main();
```
```python
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.braintrust.dev/v1/proxy",
default_headers={"x-bt-parent": "project_id:"},
api_key=os.environ["BRAINTRUST_API_KEY"], # Must use Braintrust API key
)
response = client.chat.completions.create(
model="gpt-4o", # Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages=[{"role": "user", "content": "What is a proxy?"}],
)
print(response.choices[0].message.content)
```
```bash
time curl -i https://api.braintrust.dev/v1/proxy/chat/completions \
-H "Content-Type: application/json" \
-H "x-bt-parent: project_id:" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "What is a proxy?"
}
]
}' \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
--compress
```
The `x-bt-parent` header sets the trace's parent project or experiment. You can use
a prefix like `project_id:`, `project_name:`, or `experiment_id:` here, or pass in
a [span slug](/docs/guides/tracing#distributed-tracing)
(`span.export()`) to nest the trace under a span within the parent object.
To find your project ID, navigate to your project's configuration page and find the **Copy Project ID** button at the bottom of the page.
### Supported models
The proxy supports over 100 models, including popular models like GPT-4o, Claude
3.5 Sonnet, Llama 2, and Gemini Pro. It also supports third-party inference
providers, including the [Azure OpenAI Service], [Amazon Bedrock], and
[Together AI]. See the [full list of models and providers](#appendix) at the
bottom of this page.
We are constantly adding new models. If you have a model you'd like to see
supported, please [let us know](mailto:support@braintrust.dev)!
[Azure OpenAI Service]: https://azure.microsoft.com/en-us/products/ai-services/openai-service
[Amazon Bedrock]: https://aws.amazon.com/bedrock/
[Together AI]: https://www.together.ai/
### Supported protocols
#### HTTP-based models
On the `/auto`, and `/chat/completions` endpoints,
the proxy receives HTTP requests in the [OpenAI API schema] and automatically
translates OpenAI requests into various providers' APIs. That means you can
interact with other providers like Anthropic by using OpenAI client libraries
and API calls.
For example,
```bash
curl -X POST https://api.braintrust.dev/v1/proxy/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "What is a proxy?"}]
}'
```
The proxy can also receive requests in the Anthropic and Gemini API schemas
for making requests to those respective models.
For example, you can make an Anthropic request with the following curl command:
```bash
curl -X POST https://api.braintrust.dev/v1/proxy/anthropic/messages \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-d '{
"model": "claude-3-5-sonnet-20240620",
"messages": [{"role": "user", "content": "What is a proxy?"}]
}'
```
Note that the `anthropic-version` and `x-api-key` headers do not need to be set.
Similarly, you can make a Gemini request with the following curl command:
```bash
curl -X POST https://api.braintrust.dev/v1/proxy/google/models/gemini-2.0-flash:generateContent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-d '{
"contents": [
{
"role": "user",
"parts": [
{
"text": "What is a proxy?"
}
]
}
]
}'
```
[OpenAI API schema]: https://platform.openai.com/docs/api-reference/introduction
#### WebSocket-based models
The proxy supports the [OpenAI Realtime API][realtime-api-beta] at the
`/realtime` endpoint. To use the proxy with the [OpenAI Reference
Client][realtime-api-beta], set the `url` to
`https://braintrustproxy.com/v1/realtime` when constructing the
[`RealtimeClient`][realtime-client-class] or [`RealtimeAPI`][realtime-api-class]
classes:
```typescript
import { RealtimeClient } from "@openai/realtime-api-beta";
const client = new RealtimeClient({
url: "https://braintrustproxy.com/v1/realtime",
apiKey: process.env.OPENAI_API_KEY,
});
```
For developers trying out the [OpenAI Realtime Console] sample app, we maintain
a [fork] that demonstrates how to modify the sample code to use the proxy.
You can continue to use your OpenAI API key as usual if you are creating the
`RealtimeClient` in your backend. If you would like to run the `RealtimeClient`
in your frontend or in a mobile app, we recommend passing [temporary
credentials](#temporary-credentials-for-end-user-access) to your frontend to
avoid exposing your API key.
[realtime-api-beta]: https://github.com/openai/openai-realtime-api-beta
[realtime-client-class]: https://github.com/openai/openai-realtime-api-beta/blob/de01e1083834c4c3bc495d190e2f6f5b5785e264/lib/client.js
[realtime-api-class]: https://github.com/openai/openai-realtime-api-beta/blob/main/lib/api.js
[OpenAI Realtime Console]: https://github.com/openai/openai-realtime-console
[fork]: https://github.com/braintrustdata/openai-realtime-console/pull/1/files#diff-e6b2fd9b81ea8124e30e74c39a86f3f177c342beb485d375dc759f7274c64b27
### API key management
The proxy allows you to use either a provider's API key or your Braintrust
API key. If you use a provider's API key, you can use the proxy without a
Braintrust account to take advantage of low-latency edge caching (scoped to your
API key).
If you use a Braintrust API key, you can access multiple model providers through
the proxy and manage all your API keys in one place. To do so,
[sign up for an account](/signup) and add each provider's API key on the
[AI providers](/app/settings?subroute=secrets) page in your settings.
The proxy response will return the `x-bt-used-endpoint` header, which specifies
which of your configured providers was used to complete the request.

#### Custom models
If you have custom models as part of your OpenAI or other accounts, you can use
them with the proxy by adding a custom provider. For example, if you have a
custom model called `gpt-3.5-acme`, you can add it to your
[organization settings](/docs/reference/organizations#custom-ai-providers) by navigating to
**Settings** > **Organization** > **AI providers**:
Any headers you add to the configuration will be passed through in the request to the custom endpoint.
The values of the headers can also be templated using Mustache syntax.
Currently, the supported template variables are `{{email}}` and `{{model}}`.
which will be replaced with the email of the user whom the Braintrust API key belongs to and the model name, respectively.
If the endpoint is non-streaming, set the `Endpoint supports streaming` flag to false. The proxy will
convert the response to streaming format, allowing the models to work in the playground.
Each custom model must have a flavor (`chat` or `completion`) and format (`openai`, `anthropic`, `google`, `window` or `js`). Additionally, they can
optionally have a boolean flag if the model is multimodal and an input cost and output cost, which will only be used to calculate and display estimated
prices for experiment runs.
#### Specifying an org
If you are part of multiple organizations, you can specify which organization to use by passing the `x-bt-org-name`
header in the SDK:
```javascript
import { OpenAI } from "openai";
const client = new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
defaultHeaders: {
"x-bt-org-name": "Acme Inc",
},
apiKey: process.env.OPENAI_API_KEY, // Can use Braintrust, Anthropic, etc. API keys here
});
async function main() {
const response = await client.chat.completions.create({
model: "gpt-4o", // Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages: [{ role: "user", content: "What is a proxy?" }],
});
console.log(response.choices[0].message.content);
}
main();
```
```python
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.braintrust.dev/v1/proxy",
default_headers={"x-bt-org-name": "Acme Inc"},
api_key=os.environ["OPENAI_API_KEY"], # Can use Braintrust, Anthropic, etc. API keys here
)
response = client.chat.completions.create(
model="gpt-4o", # Can use claude-3-5-sonnet-latest, gemini-2.0-flash, etc. here
messages=[{"role": "user", "content": "What is a proxy?"}],
)
print(response.choices[0].message.content)
```
```bash
time curl -i https://api.braintrust.dev/v1/proxy/chat/completions \
-H "Content-Type: application/json" \
-H "x-bt-org-name: Acme Inc" \
-d '{
"model": "gpt-4o",
"messages": [
{
"role": "user",
"content": "What is a proxy?"
}
]
}' \
-H "Authorization: Bearer $OPENAI_API_KEY" \
--compress
```
### Temporary credentials for end user access
A **temporary credential** converts your Braintrust API key (or model provider
API key) to a time-limited credential that can be safely shared with end users.
* Temporary credentials can also carry additional information to limit access to
a particular model and/or enable logging to Braintrust.
* They can be used in the `Authorization` header anywhere you'd use a Braintrust
API key or a model provider API key.
Use temporary credentials if you'd like your frontend or mobile app to send AI
requests to the proxy directly, minimizing latency without exposing your API
keys to end users.
#### Issue temporary credential in code
You can call the [`/credentials` endpoint][cred-api-doc] from a privileged
location, such as your app's backend, to issue temporary credentials. The
temporary credential will be allowed to make requests on behalf of the
Braintrust API key (or model provider API key) provided in the `Authorization`
header.
The body should specify the restrictions to be applied to the temporary
credentials as a JSON object. Additionally, if the `logging` key is present, the
proxy will log to Braintrust any requests made with this temporary credential.
See the [`/credentials` API spec][cred-api-doc] for details.
The following example grants access to `gpt-4o-realtime-preview-2024-10-01` on
behalf of the key stored in the `BRAINTRUST_API_KEY` environment variable for 10
minutes, logging the requests to the project named "My project."
[cred-api-doc]: /docs/reference/api/Proxy#create-temporary-credential
```typescript
const PROXY_URL =
process.env.BRAINTRUST_PROXY_URL || "https://braintrustproxy.com/v1";
// Braintrust API key starting with `sk-...`.
const BRAINTRUST_API_KEY = process.env.BRAINTRUST_API_KEY;
async function main() {
const response = await fetch(`${PROXY_URL}/credentials`, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${BRAINTRUST_API_KEY}`,
},
body: JSON.stringify({
// Leave undefined to allow all models.
model: "gpt-4o-realtime-preview-2024-10-01",
// TTL for starting the request. Once started, the request can stream
// for as long as needed.
ttl_seconds: 60 * 10, // 10 minutes.
logging: {
project_name: "My project",
},
}),
cache: "no-store",
});
if (!response.ok) {
const error = await response.text();
throw new Error(`Failed to request temporary credentials: ${error}`);
}
const { key: tempCredential } = await response.json();
console.log(`Authorization: Bearer ${tempCredential}`);
}
main();
```
```python
import os
import requests
PROXY_URL = os.getenv("BRAINTRUST_PROXY_URL", "https://braintrustproxy.com/v1")
# Braintrust API key starting with `sk-...`.
BRAINTRUST_API_KEY = os.getenv("BRAINTRUST_API_KEY")
def main():
response = requests.post(
f"{PROXY_URL}/credentials",
headers={
"Authorization": f"Bearer {BRAINTRUST_API_KEY}",
},
json={
# Leave unset to allow all models.
"model": "gpt-4o-realtime-preview-2024-10-01",
# TTL for starting the request. Once started, the request can stream
# for as long as needed.
"ttl_seconds": 60 * 10, # 10 minutes.
"logging": {
"project_name": "My project",
},
},
)
if response.status_code != 200:
raise Exception(f"Failed to request temporary credentials: {response.text}")
temp_credential = response.json().get("key")
print(f"Authorization: Bearer {temp_credential}")
if __name__ == "__main__":
main()
```
```bash
curl -X POST "${BRAINTRUST_PROXY_URL:-https://api.braintrust.dev/v1/proxy}/credentials" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${BRAINTRUST_API_KEY}" \
--data '{
"model": "gpt-4o-realtime-preview-2024-10-01",
"ttl_seconds": 600,
"logging": {
"project_name": "My project"
}
}'
```
#### Issue temporary credential in browser
You can also generate a temporary credential using the form below:
#### Inspect temporary credential grants
The temporary credential is formatted as a [JSON Web Token (JWT)][jwt-intro].
You can inspect the JWT's payload using a library such as
[`jsonwebtoken`][jwt-lib] or a web-based tool like [JWT.io](https://jwt.io/) to
determine the expiration time and granted models.
```typescript
import { decode as jwtDecode } from "jsonwebtoken";
const tempCredential = "";
const payload = jwtDecode(tempCredential, { complete: false, json: true });
// Example output:
// {
// "aud": "braintrust_proxy",
// "bt": {
// "model": "gpt-4o",
// "secret": "nCCxgkBoyy/zyOJlikuHILBMoK78bHFosEzy03SjJF0=",
// "logging": {
// "project_name": "My project"
// }
// },
// "exp": 1729928077,
// "iat": 1729927977,
// "iss": "braintrust_proxy",
// "jti": "bt_tmp:331278af-937c-4f97-9d42-42c83631001a"
// }
console.log(JSON.stringify(payload, null, 2));
```
Do not modify the JWT payload. This will invalidate the signature. Instead,
issue a new temporary credential using the `/credentials` endpoint.
[jwt-intro]: https://jwt.io/introduction
[jwt-lib]: https://www.npmjs.com/package/jsonwebtoken
### Load balancing
If you have multiple API keys for a given model type, e.g. OpenAI and Azure for `gpt-4o`, the proxy will
automatically load balance across them. This is a useful way to work around per-account rate limits and provide
resiliency in case one provider is down.
You can setup endpoints directly on the [secrets page](/app/settings?subroute=secrets) in your Braintrust account
by adding endpoints:

### PDF input
The proxy extends the OpenAI API to support PDF input.
To use it, pass the PDF's URL or base64-encoded PDF data with MIME type `application/pdf` in the request body.
For example,
```bash
curl https://api.braintrust.dev/v1/proxy/auto \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": [
{
"type": "text",
"text": "Extract the text from the PDF."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/my-pdf.pdf"
}
}
]},
]
}'
```
or
```bash
curl https://api.braintrust.dev/v1/proxy/auto \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BRAINTRUST_API_KEY" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": [
{
"type": "text",
"text": "Extract the text from the PDF."
},
{
"type": "image_url",
"image_url": {
"url": "data:application/pdf;base64,$PDF_BASE64_DATA"
}
}
]},
]
}'
```
## Advanced configuration
The following headers allow you to configure the proxy's behavior:
* `x-bt-use-cache`: `auto | always | never`. See [Caching](#caching)
* `x-bt-use-creds-cache`: `auto | always | never`. Similar to `x-bt-use-cache`, but controls whether to cache the
credentials used to access the provider's API. This is useful if you are rapidly tweaking credentials and don't
want to wait \~60 seconds for the credentials cache to expire.
* `x-bt-org-name`: Specify if you are part of multiple organizations and want to use API keys/log to a specific org.
* `x-bt-endpoint-name`: Specify to use a particular endpoint (by its name).
## Integration with Braintrust platform
Several features in Braintrust are powered by the proxy. For example, when you create a [playground](/docs/guides/playground),
the proxy handles running the LLM calls. Similarly, if you [create a prompt](/docs/guides/prompts), when you preview the
prompt's results, the proxy is used to run the LLM. However, the proxy is *not* required when you:
* Run evals in your code
* Load prompts to run in your code
* Log traces to Braintrust
If you'd like to use it in your code to help with caching, secrets management, and other features, follow the [instructions
above](#quickstart) to set it as the base URL in your OpenAI client.
### Self-hosting
If you're self-hosting Braintrust, your API service (serverless functions or containers) contain a built-in proxy that runs
within your own environment. See the [self-hosting](/docs/guides/self-hosting) docs for more information on how to set up
self-hosting.
## Open source
The AI proxy is open source. You can find the code on
[GitHub](https://github.com/braintrustdata/braintrust-proxy).
## Appendix
### List of supported models and providers
We are constantly adding new models. If you have a model you'd like to see
supported, please [let us know](/contact)!
---
file: ./content/docs/guides/remote-evals.mdx
meta: {
"title": "Remote evals"
}
# Remote evals
If you have existing infrastructure for running evaluations that isn't easily adaptable to the Braintrust Playground, you can use remote evals to expose a remote endpoint. This lets you run evaluations directly in the playground, iterate quickly across datasets, run scorers, and compare results with other tasks. You can also run multiple instances of your remote eval side-by-side with different parameters and compare results. Parameters defined in the remote eval will be exposed in the playground UI.
Remote evals are in beta. If you are on a hybrid deployment, remote evals are available starting with `v0.0.66`.
## Expose remote `Eval`
To expose an `Eval` running at a remote URL or your local machine, simply pass in the `--dev` flag. For example, given the following file, run `npx braintrust eval parameters.eval.ts --dev` to start the dev server and expose `http://localhost:8300`. The dev host and port can also be configured:
* `--dev-host DEV_HOST`: The host to bind the dev server to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces.
* `--dev-port DEV_PORT`: The port to bind the dev server to. Defaults to 8300.
```ts parameters.eval.ts
import { Levenshtein } from "autoevals";
import { Eval, initDataset, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";
const client = wrapOpenAI(new OpenAI());
Eval("Simple eval", {
data: initDataset("local dev", { dataset: "sanity" }), // Datasets are currently ignored
task: async (input, { parameters }) => {
const completion = await client.chat.completions.create(
parameters.main.build({
input: `${parameters.prefix}:${input}`,
}),
);
return completion.choices[0].message.content ?? "";
},
// These scores will be used along with any that you configure in the UI
scores: [Levenshtein],
parameters: {
main: {
type: "prompt",
name: "Main prompt",
description: "This is the main prompt",
default: {
messages: [
{
role: "user",
content: "{{input}}",
},
],
model: "gpt-4o",
},
},
another: {
type: "prompt",
name: "Another prompt",
description: "This is another prompt",
default: {
messages: [
{
role: "user",
content: "{{input}}",
},
],
model: "gpt-4o",
},
},
include_prefix: z
.boolean()
.default(false)
.describe("Include a contextual prefix"),
prefix: z
.string()
.describe("The prefix to include")
.default("this is a math problem"),
array_of_objects: z
.array(
z.object({
name: z.string(),
age: z.number(),
}),
)
.default([
{ name: "John", age: 30 },
{ name: "Jane", age: 25 },
]),
},
});
```
## Running a remote eval from a playground
To run a remote eval from a playground, select **+ Remote** from the Task pane and choose from the evals exposed in localhost or remote sources.

## Configure remote eval sources
To configure remote eval source URLs for a project, navigate to **Configuration** > **Remote evals**. Then, select **+ Remote eval source** to configure a new remote eval source for your project.

## Limitations
* The dataset defined in your remote eval will be ignored. Scorers defined in remote evals will be concatenated with playground scorers.
* Remote evals are limited to TypeScript only. Python support is coming soon.
---
file: ./content/docs/guides/views.mdx
meta: {
"title": "Views"
}
# Views
You'll often want to create a view that shows data organized and visualized a certain way on the same underlying data. Views are saved table configurations that preserve filters, sorts, column order and column visibility. All table-based layouts, including logs, experiments, datasets and projects support configured views.

## Default locked views
Some table layouts include default views for convenience. These views are locked and cannot be modified or deleted.
* **All rows** corresponds to all of the records in a given table. This is the default, unfiltered view.
On experiment and logs pages:
* **Non-errors** corresponds to all of the records in a given table that do not contain errors.
* **Errors** corresponds to all of the records in a given table that contain errors.
On experiment pages:
* **Unreviewed** hides items that have already been human-reviewed.
## Creating and managing custom views
### In the UI
To create a custom view, start by applying the filters, sorts, and columns that you would like to have visible in your view. Then, navigate to the **Views** dropdown and select **Create view**.
After entering a view, any changes you make to the filters, sorts, and columns will be auto-saved.
To rename, duplicate, delete, or set as default, use the **...** menu next to the view name.

### In code
Views can also be created and managed programmatically [via the API](/docs/reference/api/Views).
## Access
Views are accessible and configurable by any member of the organization.
## Best practices
Use views when:
* You frequently reapply the same filters.
* You want to standardize what your team sees.
* You want to review only a subset of records.
Make sure to use clear, descriptive names so your team can quickly understand the purpose of each view. Some example views might be:
* "Logs with Factuality \< 50%"
* "Unreviewed high-priority traces"
* "Failing test cases"
* "Tagged with 'Customer Support'"
* "Lisa's test cases"
---
file: ./content/docs/reference/btql.mdx
meta: {
"title": "BTQL query syntax"
}
# BTQL query syntax
Braintrust Query Language (BTQL) is a precise, SQL-like syntax for querying your experiments, logs, and datasets. You can use BTQL to filter and run more complex queries to analyze your data.
## Why use BTQL?
BTQL gives you precise control over your AI application data. You can:
* Filter and search for relevant logs and experiments
* Create consistent, reusable queries for monitoring
* Build automated reporting and analysis pipelines
* Write complex queries to analyze model performance
## Query structure
BTQL queries follow a familiar SQL-like structure that lets you define what data you want and how to analyze it:
```sql #btql
select: * -- Fields to retrieve
from: project_logs('') -- Data source (identifier or function call)
filter: scores.Factuality > 0.8 -- Filter conditions
sort: created desc -- Sort order
limit: 100 -- Result size limit
cursor: '' -- Pagination token
```
Each clause serves a specific purpose:
* `select`: choose which fields to retrieve
* `from`: specify the data source - can be an identifier (like `project_logs`) or a function call (like `experiment("id")`)
* `filter`: define conditions to filter the data
* `sort`: set the order of results (`asc` or `desc`)
* `limit` and `cursor`: control result size and enable pagination
You can also use `dimensions`, `measures`, and `pivot` instead of `select` for aggregation queries.
**Understanding traces and spans**
When you query trace-shaped data (experiments and logs) with BTQL, you can choose whether to return matching spans
or all spans from matching traces. To specify this explicitly, specify the "shape" you'd like after the data source:
```sql #btql
select: *
from: project_logs('my-project-id') spans
limit: 10
```
or
```sql #btql
select: *
from: project_logs('my-project-id') traces
limit: 10
```
Historically, BTQL returned full traces by default, but [we are changing this](/blog/brainstore-default#a-breaking-api-change) to return spans, as users have consistently
expressed this as their preferred default. For now:
* If you specify `"use_brainstore": true` as a parameter to the `btql` endpoint, you will get the new default (`spans`)
* If you do not specify `"use_brainstore"`, you will get the old default (`traces`). This will change as early as April 28, 2025.
* If you use a legacy backend, e.g. via `use_columnstore: "true"`, only `traces` is supported.
### Available operators
Here are the operators you can use in your queries:
```sql
-- Comparison operators
= -- Equal to (alias for 'eq')
!= -- Not equal to (alias for 'ne', can also use '<>')
> -- Greater than (alias for 'gt')
< -- Less than (alias for 'lt')
>= -- Greater than or equal (alias for 'ge')
<= -- Less than or equal (alias for 'le')
-- Null operators
IS NULL -- Check if value is null
IS NOT NULL -- Check if value is not null
ISNULL -- Unary operator to check if null
ISNOTNULL -- Unary operator to check if not null
-- Text matching
LIKE -- Case-sensitive pattern matching with SQL wildcards
NOT LIKE -- Negated case-sensitive pattern matching
ILIKE -- Case-insensitive pattern matching with SQL wildcards
NOT ILIKE -- Negated case-insensitive pattern matching
MATCH -- Full-word semantic search (faster but requires exact word matches, e.g. 'apple' won't match 'app')
NOT MATCH -- Negated full-word semantic search
-- Array operators
INCLUDES -- Check if array/object contains value (alias: CONTAINS)
NOT INCLUDES -- Check if array/object does not contain value
-- Logical operators
AND -- Both conditions must be true
OR -- Either condition must be true
NOT -- Unary operator to negate condition
-- Arithmetic operators
+ -- Addition (alias: add)
- -- Subtraction (alias: sub)
* -- Multiplication (alias: mul)
/ -- Division (alias: div)
% -- Modulo (alias: mod)
-x -- Unary negation (alias: neg)
```
### Available functions
Here are all the functions you can use in any context (select, filter, dimensions, measures):
```sql
-- Date/time functions
second(timestamp) -- Extract second from timestamp
minute(timestamp) -- Extract minute from timestamp
hour(timestamp) -- Extract hour from timestamp
day(timestamp) -- Extract day from timestamp
week(timestamp) -- Extract week from timestamp
month(timestamp) -- Extract month from timestamp
year(timestamp) -- Extract year from timestamp
current_timestamp() -- Get current timestamp (alias: now())
current_date() -- Get current date
-- String functions
lower(text) -- Convert text to lowercase
upper(text) -- Convert text to uppercase
concat(text1, text2, ...) -- Concatenate strings
-- Array functions
len(array) -- Get length of array
contains(array, value) -- Check if array contains value (alias: includes)
-- Null handling functions
coalesce(val1, val2, ...) -- Return first non-null value
nullif(val1, val2) -- Return null if val1 equals val2
least(val1, val2, ...) -- Return smallest non-null value
greatest(val1, val2, ...) -- Return largest non-null value
-- Type conversion
round(number, precision) -- Round to specified precision
-- Aggregate functions (only in measures)
count(expr) -- Count number of rows
sum(expr) -- Sum numeric values
avg(expr) -- Calculate mean of numeric values
min(expr) -- Find minimum value
max(expr) -- Find maximum value
percentile(expr, p) -- Calculate percentile (p between 0 and 1)
```
### Field access
BTQL provides flexible ways to access nested data in arrays and objects:
```sql
-- Object field access
metadata.model -- Access nested object field
metadata."field name" -- Access field with spaces
metadata.'field-name' -- Access field with special characters
-- Array access (0-based indexing)
tags[0] -- First element
tags[-1] -- Last element
-- Combined array and object access
metadata.models[0].name -- Field in first array element
responses[-1].tokens -- Field in last array element
spans[0].children[-1].id -- Nested array traversal
```
Array indices are 0-based, and negative indices count from the end (-1 is the last element).
## Select clause
The `select` clause determines which fields appear in your results. You can select specific fields, compute values, or use `*` to get everything:
```sql #btql
-- Get specific fields
select:
metadata.model as model,
scores.Factuality as score,
created as timestamp
from: project_logs('my-project-id')
```
### Working with expressions
Transform your data directly in the select clause:
```sql #btql
select:
-- Simple field access
metadata.model,
-- Computed values
metrics.tokens > 1000 as is_long_response,
-- Conditional logic
(scores.Factuality > 0.8 ? "high" : "low") as quality
from: project_logs('my-project-id')
```
### Using functions
Transform values and create meaningful aliases for your results:
```sql #btql
select:
-- Date/time functions
day(created) as date,
hour(created) as hour,
-- Numeric calculations
round(scores.Factuality, 2) as rounded_score
from: project_logs('my-project-id')
```
## Dimensions and measures
Instead of `select`, you can use `dimensions` and `measures` to group and aggregate data:
```sql #btql
-- Analyze model performance over time
dimensions:
metadata.model as model,
day(created) as date
measures:
count(1) as total_calls,
avg(scores.Factuality) as avg_score,
percentile(latency, 0.95) as p95_latency
from: project_logs('my-project-id')
```
### Aggregate functions
Common aggregate functions for measures:
```sql #btql
-- Example using various aggregates
dimensions: metadata.model as model
measures:
count(1) as total_rows, -- Count rows
sum(metrics.tokens) as total_tokens, -- Sum values
avg(scores.Factuality) as avg_score, -- Calculate mean
min(latency) as min_latency, -- Find minimum
max(latency) as max_latency, -- Find maximum
percentile(latency, 0.95) as p95 -- Calculate percentiles
from: project_logs('my-project-id')
```
### Pivot results
The `pivot` clause transforms your results to make comparisons easier by converting rows into columns. This is especially useful when comparing metrics across different categories or time periods.
Syntax:
```sql
pivot: , , ...
```
Here are some examples:
```sql #btql
-- Compare model performance metrics across models
dimensions: day(created) as date
measures:
avg(scores.Factuality) as avg_factuality,
avg(metrics.tokens) as avg_tokens,
count(1) as call_count
from: project_logs('my-project-id')
pivot: avg_factuality, avg_tokens, call_count
-- Results will look like:
-- {
-- "date": "2024-01-01",
-- "gpt-4_avg_factuality": 0.92,
-- "gpt-4_avg_tokens": 150,
-- "gpt-4_call_count": 1000,
-- "gpt-3.5-turbo_avg_factuality": 0.85,
-- "gpt-3.5-turbo_avg_tokens": 120,
-- "gpt-3.5-turbo_call_count": 2000
-- }
```
```sql #btql
-- Compare metrics across time periods
dimensions: metadata.model as model
measures:
avg(scores.Factuality) as avg_score,
percentile(latency, 0.95) as p95_latency
from: project_logs('my-project-id')
pivot: avg_score, p95_latency
-- Results will look like:
-- {
-- "model": "gpt-4",
-- "0_avg_score": 0.91,
-- "0_p95_latency": 2.5,
-- "1_avg_score": 0.89,
-- "1_p95_latency": 2.8,
-- ...
-- }
```
```sql #btql
-- Compare tag distributions across models
dimensions: tags[0] as primary_tag
measures: count(1) as tag_count
from: project_logs('my-project-id')
pivot: tag_count
-- Results will look like:
-- {
-- "primary_tag": "quality",
-- "gpt-4_tag_count": 500,
-- "gpt-3.5-turbo_tag_count": 300
-- }
```
Pivot columns are automatically named by combining the dimension value and measure name. For example, if you pivot by `metadata.model` and a model named "gpt-4" to measure `avg_score`, the name becomes `gpt-4_avg_score`.
### Unpivot
The `unpivot` clause transforms columns into rows, which is useful when you need to analyze arbitrary scores and metrics without specifying each score name. This is particularly helpful when working with dynamic sets of metrics or when you need to know all possible score names in advance.
```sql #btql
-- Convert wide format to long format for arbitrary scores
dimensions: created as date
measures: count(1) as count
from: project_logs('my-project-id')
unpivot: count as (score_name, score_value)
-- Results will look like:
-- {
-- "date": "2024-01-01",
-- "score_name": "Factuality",
-- "score_value": 0.92
-- },
-- {
-- "date": "2024-01-01",
-- "score_name": "Coherence",
-- "score_value": 0.88
-- }
```
### Conditional expressions
BTQL supports conditional logic using the ternary operator (`? :`):
```sql #btql
-- Basic conditions
select:
(scores.Factuality > 0.8 ? "high" : "low") as quality,
(error IS NOT NULL ? -1 : metrics.tokens) as valid_tokens
from: project_logs('my-project-id')
```
```sql #btql
-- Nested conditions
select:
(scores.Factuality > 0.9 ? "excellent" :
scores.Factuality > 0.7 ? "good" :
scores.Factuality > 0.5 ? "fair" : "poor") as rating
from: project_logs('my-project-id')
```
```sql #btql
-- Use in calculations
select:
(metadata.model = "gpt-4" ? metrics.tokens * 2 : metrics.tokens) as adjusted_tokens,
(error IS NULL ? metrics.latency : 0) as valid_latency
from: project_logs('my-project-id')
```
### Time intervals
BTQL supports intervals for time-based operations:
```sql #btql
-- Basic intervals
select: *
from: project_logs('my-project-id')
filter: created > now() - interval 1 day
```
```sql #btql
-- Multiple time conditions
select: *
from: project_logs('my-project-id')
filter:
created > now() - interval 1 hour and
created < now()
```
```sql #btql
-- Examples with different units
select: *
from: project_logs('my-project-id')
filter:
created > now() - interval 7 day and -- Last week
created > now() - interval 1 month -- Last month
```
## Filter clause
The `filter` clause lets you specify conditions to narrow down results. It supports a wide range of operators and functions:
```sql
filter:
-- Simple comparisons
scores.Factuality > 0.8 and
metadata.model = "gpt-4" and
-- Array operations
tags includes "triage" and
-- Text search
input ILIKE '%question%' and
-- Date ranges
created > '2024-01-01' and
-- Complex conditions
(
metrics.tokens > 1000 or
metadata.is_production = true
)
```
Note: Negative filters on tags (e.g., `NOT tags includes "resolved"`) may not work as expected. Since tags are only applied to the root span of a trace, and queries return complete traces, negative tag filters will match child spans (which don't have tags) and return the entire trace. We recommend using positive tag filters instead.
## Sort clause
The `sort` clause determines the order of results:
```sql
-- Sort by single field
sort: created desc
-- Sort by multiple fields
sort: scores.Factuality desc, created asc
-- Sort by computed values
sort: len(tags) desc
```
## Limit and cursor
Control result size and implement pagination:
```sql
-- Basic limit
limit: 100
```
```sql #btql
-- Pagination using cursor (only works without sort)
select: *
from: project_logs('')
limit: 100
cursor: '' -- From previous query response
```
Cursors are automatically returned in BTQL responses. A default limit is applied in a query without a limit clause, and the number of returned results can be overridden by using an explicit `limit`. In order to implement pagination, after an initial query, provide the subsequent cursor token returned in the results in the `cursor` clause in follow-on queries. When a cursor has reached the end of the result set, the `data` array will be empty, and no cursor token will be returned by the query.
Cursors can only be used for pagination when no `sort` clause is specified. If you need sorted results, you'll need to implement offset-based pagination by using the last value from your sort field as a filter in the next query, as shown in the example above.
```sql #btql
-- Offset-based pagination with sorting
-- Page 1 (first 100 results)
select: *
from: project_logs('')
sort: created desc
limit: 100
```
```sql #btql
-- Page 2 (next 100 results)
select: *
from: project_logs('')
filter: created < '2024-01-15T10:30:00Z' -- Last created timestamp from previous page
sort: created desc
limit: 100
```
## API access
Access BTQL programmatically through our API:
```bash
curl -X POST https://api.braintrust.dev/btql \
-H "Authorization: Bearer " \
-H "Content-Type: application/json" \
-d '{"query": "select: * | from: project_logs('"''"') | filter: tags includes '"'triage'"'"}'
```
The API accepts these parameters:
* `query` (required): your BTQL query string
* `fmt`: response format (`json` or `parquet`, defaults to `json`)
* `tz_offset`: timezone offset in minutes for time-based operations
* `use_columnstore`: enable columnstore for faster large queries
* `audit_log`: include audit log data
For correct day boundaries, set `tz_offset` to match your timezone. For example, use `480` for US Pacific Standard Time.
## Examples
Let's look at some real-world examples:
### Tracking token usage
This query helps you monitor token consumption across your application:
```sql #btql
from: project_logs('')
filter: created > ''
dimensions: day(created) as time
measures:
sum(metrics.total_tokens) as total_tokens,
sum(metrics.prompt_tokens) as input_tokens,
sum(metrics.completion_tokens) as output_tokens
sort: time asc
```
The response shows daily token usage:
```json
{
"time": "2024-11-09T00:00:00Z",
"total_tokens": 100000,
"input_tokens": 50000,
"output_tokens": 50000
}
```
### Model quality monitoring
Track model performance across different versions and configurations:
```sql #btql
-- Compare factuality scores across models
dimensions:
metadata.model as model,
day(created) as date
measures:
avg(scores.Factuality) as avg_factuality,
percentile(scores.Factuality, 0.05) as p05_factuality,
percentile(scores.Factuality, 0.95) as p95_factuality,
count(1) as total_calls
filter: created > '2024-01-01'
sort: date desc, model asc
```
```sql #btql
-- Find potentially problematic responses
select: *
from: project_logs('')
filter:
scores.Factuality < 0.5 and
metadata.is_production = true and
created > now() - interval 1 day
sort: scores.Factuality asc
limit: 100
```
### Error analysis
Identify and investigate errors in your application:
```sql #btql
-- Error rate by model
dimensions:
metadata.model as model,
hour(created) as hour
measures:
count(1) as total,
sum(error IS NOT NULL ? 1 : 0) as errors,
sum(error IS NOT NULL ? 1 : 0) / count(1) as error_rate
filter: created > now() - interval 1 day
sort: error_rate desc
```
```sql #btql
-- Find common error patterns
dimensions:
error.type as error_type,
metadata.model as model
measures:
count(1) as error_count,
avg(metrics.latency) as avg_latency
filter:
error IS NOT NULL and
created > now() - interval 7 day
sort: error_count desc
```
### Latency analysis
Monitor and optimize response times:
```sql #btql
-- Track p95 latency by endpoint
dimensions:
metadata.endpoint as endpoint,
hour(created) as hour
measures:
percentile(metrics.latency, 0.95) as p95_latency,
percentile(metrics.latency, 0.50) as median_latency,
count(1) as requests
filter: created > now() - interval 1 day
sort: hour desc, p95_latency desc
```
```sql #btql
-- Find slow requests
select:
metadata.endpoint,
metrics.latency,
metrics.tokens,
input,
created
from: project_logs('')
filter:
metrics.latency > 5000 and -- Requests over 5 seconds
created > now() - interval 1 hour
sort: metrics.latency desc
limit: 20
```
### Prompt analysis
Analyze prompt effectiveness and patterns:
```sql #btql
-- Track prompt token efficiency
dimensions:
metadata.prompt_template as template,
day(created) as date
measures:
avg(metrics.prompt_tokens) as avg_prompt_tokens,
avg(metrics.completion_tokens) as avg_completion_tokens,
avg(metrics.completion_tokens) / avg(metrics.prompt_tokens) as token_efficiency,
avg(scores.Factuality) as avg_factuality
filter: created > now() - interval 7 day
sort: date desc, token_efficiency desc
```
```sql #btql
-- Find similar prompts
select: *
from: project_logs('')
filter:
input MATCH 'explain the concept of recursion' and
scores.Factuality > 0.8
sort: created desc
limit: 10
```
### Tag-based analysis
Use tags to track and analyze specific behaviors:
```sql #btql
-- Monitor feedback patterns
dimensions:
tags[0] as primary_tag,
metadata.model as model
measures:
count(1) as feedback_count,
avg(scores.Factuality > 0.8 ? 1 : 0) as high_quality_rate
filter:
tags includes 'feedback' and
created > now() - interval 30 day
sort: feedback_count desc
```
```sql #btql
-- Track issue resolution
select:
created,
tags,
metadata.model,
scores.Factuality,
response
from: project_logs('')
filter:
tags includes 'needs-review' and
NOT tags includes 'resolved' and
created > now() - interval 1 day
sort: scores.Factuality asc
```
## BTQL sandbox
To test BTQL with autocomplete, validation, and a table of results, try the **BTQL sandbox** in the dashboard.
---
file: ./content/docs/reference/functions.mdx
meta: {
"title": "Functions"
}
# Functions
Many of the advanced capabilities of Braintrust involve defining and calling custom code functions. Currently,
Braintrust supports defining functions in JavaScript/TypeScript and Python, which you can use as custom scorers
or callable tools.
This guide serves as a reference for functions, how they work, and some security considerations when working with them.
## Accessing functions
Several places in the UI, for example the custom scorer menu in the playground, allow you to define functions. You can also
bundle them in your code and push them to Braintrust with `braintrust push` and `braintrust eval --push`. Technically speaking,
functions are a generalization of prompts and code functions, so when you define a custom prompt, you are technically defining
a "prompt function".
Every function supports a number of common features:
* Well-defined parameters and return types
* Streaming and non-streaming invocation
* Automatic tracing and logging in Braintrust
* Prompts can be loaded into your code in the OpenAI argument format
* Prompts and code can be easily saved and uploaded from your codebase
See the [API docs](/docs/reference/api/Functions) for more information on how to create and invoke functions.
## Sandbox
Functions are executed in a secure sandbox environment. If you are self-hosting Braintrust, then you must:
* Set `EnableQuarantine` to `true` in the [Cloudformation stack](/docs/guides/self-hosting/aws)
* Set `ALLOW_CODE_FUNCTION_EXECUTION` to `1` in the [Docker configuration](/docs/guides/self-hosting/docker)
If you use our managed AWS stack, custom code runs in a quarantined-VPC in lambda functions which are sandboxed and
isolated from your other AWS resources. If you run via Docker, then the code runs in a sandbox but not a virtual machine,
so it is your responsibility to ensure that malicious code is not uploaded to Braintrust.
For more information on the security architecture underlying code execution, please [reach out to us](mailto:support@braintrust.dev).
---
file: ./content/docs/reference/mcp.mdx
meta: {
"title": "Model Context Protocol (MCP)"
}
# Model Context Protocol (MCP)
Use this guide to enable your IDE to interact with the Braintrust API using Model Context Protocol.
## What is MCP?
The [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction) is a standardized framework that enables AI models to interact with your development environment. It allows for real-time exchange of experiment results, code context, and debugging information between your IDE and AI systems like Braintrust.
MCP is supported in many AI coding tools, including:
* [Cursor](https://www.cursor.com/)
* [Windsurf](https://docs.codeium.com/windsurf)
* VS Code via [Cline extension](https://github.com/cline/cline)
* [Claude for Desktop](https://claude.ai/download)
## Installation
Braintrust has a native MCP server which can read experiment results to help you automatically
debug and improve your app. To install it, add the following to your `mcp.json` file (for example, `.cursor/mcp.json`):
```json
{
"mcpServers": {
"server-name": {
"command": "npx",
"args": ["-y", "@braintrust/mcp-server@latest", "--api-key", "YOUR_API_KEY"]
}
}
}
```
## Usage
Once you've set up the MCP server, you can interact with your Braintrust projects directly in your IDE through natural language commands.
Try asking about Braintrust experiment results, code context, and debugging information!
---
file: ./content/docs/reference/organizations.mdx
meta: {
"title": "Organizations",
"description": "Organizations overview and settings"
}
# Organizations
Organizations in Braintrust represent a collection of projects and users. Most commonly, an organization is a business or team. You can create multiple organizations to organize your projects and collaborators in different ways, and a user can be a member of multiple organizations.
Each organization has settings than can be customized by navigating to **Settings** > **Organization**. You can also customize organization settings using the [API](./api/Organizations).
## Members
In the **Members** section, you can see all members of your organization and manage their roles and permissions. You can also invite new members by selecting **Invite member** and inputting their email address(es). Each member must be assigned a permission group.
## Permission groups
Permission groups are the core of Braintrust's access control system, and are collections of users that can be granted specific permissions. In the **Permission groups** section, you can find existing and create new permission groups. For more information about permission groups, see the [access control guide](/docs/guides/access-control).
## AI providers
Braintrust supports most AI providers through the [AI proxy](/docs/guides/proxy), which allows you to use any of the [supported models](/docs/guides/proxy#supported-models). In the **AI providers** section, you can configure API keys for the AI providers on behalf of your organization, or add custom providers.
### Custom AI providers
You can also add custom AI providers. Braintrust supports custom models and endpoint configuration for all providers.
## Environment variables
Environment variables are secrets that are scoped to all functions (prompts, scorers, and tools) in a specific organization. You can set environment variables in the **Env variables** section by saving the key-value pairs.
## API URL
If you are self-hosting Braintrust, you can set the API URL, proxy URL, and real-time URL in your organization settings. You can also find the test commands (with token) for test pinging the API, proxy, and realtime from the command line. For more information about self-hosting Braintrust, see the [self-hosting guide](/docs/guides/self-hosting).
## Git metadata
In the **Logging** section, you can select which git metadata fields to log, if any.
---
file: ./content/docs/reference/streaming.mdx
meta: {
"title": "Streaming"
}
# Streaming
Braintrust supports executing prompts, functions, and evaluations through the API and within the UI through the [playground](/docs/guides/playground).
Like popular LLM services, Braintrust supports streaming results using [Server-Sent Events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
The Braintrust SDK and UI automatically parse the SSE stream, and we have adapters for common libraries like the [Vercel AI SDK](https://sdk.vercel.ai/docs),
so you can easily integrate with the rich and growing ecosystem of LLM tools. However, the SSE format itself is also purposefully simple, so if you need to
parse it yourself, you can!
To see more about how to use streaming data, see the [prompts documentation](/docs/guides/prompts#streaming).
## Why does this exist
Streaming is a very powerful way to consume LLM outputs, but the predominant "chat" data structure produced by modern LLMs is more complex than most applications
need. In fact, the most common use cases are to simply (a) convert the text of the first message into a string or (b) parse the arguments of the first tool call
into a JSON object. The Braintrust SSE format is really optimized to make these use cases easy to parse, while also supporting more advanced scenarios like parallel
tools calls.
## Formal spec
SSE events consist of three fields: `id` (optional), `event` (optional), and `data`. The Braintrust SSE format always sets `event` and `data`, and never sets `id`.
The SSE events in Braintrust follow this structure:
```cpp
::= | | ::=
event: "text_delta"
data: ::=
event: "json_delta"
data: ::=
event: "error"
data: ::=
event: "progress"
data: ::=
event: "done"
data: ""
```
### Text
A `text_delta` is a snippet of text, which is JSON-encoded. For example, you might receive:
```ansi
event: text_delta
data: "this is a line\nbreak"
event: text_delta
data: "with some \"nested quotes\"."
event: done
data:
```
As you process a `text_delta`, you can JSON-decode the string and display it directly.
### JSON
A `json_delta` is a snippet of JSON-encoded data, which cannot necessarily be parsed on its own.
For example:
```ansi
event: json_delta
data: {"name": "Cecil",
event: json_delta
data: "age": 30}
event: done
data:
```
As you process a `json_delta` events, concatenate the strings together and then parse them
as JSON at the end of the stream.
### Error
An `error` event is a JSON-encoded string that contains the error message. For example:
```ansi
event: error
data: "Something went wrong."
event: done
data:
```
### Progress
A `progress` event is a JSON-encoded object that contains intermediate events produced by functions
while they are executing. Each json object contains the following fields:
```json
{
"id": "A span id for this event",
"object_type": "prompt" | "tool" | "scorer" | "task",
"format": "llm" | "code" | "global",
"output_type": "completion" | "score" | "any",
"name": "The name of the function or prompt",
"event": "text_delta" | "json_delta" | "error" | "start" | "done",
"data": "The delta or error message"
}
```
The `event` field is the type of event produced by the intermediate function call, and the
`data` field is the same as the data field in the `text_delta` and `json_delta` events.
### Start
A `start` event is a progress event with `event: "start"` and an empty string for `data`. Start is not guaranteed
to be sent and is for display purposes only.
### Done
A `done` event is a progress event with `event: "done"` and an empty string for `data`. Once a `done` event is received,
you can safely assume that the function has completed and will send no more events.
---
file: ./content/docs/cookbook/recipes/AISearch.mdx
meta: {
"title": "AI Search Bar",
"language": "python",
"authors": [
{
"name": "Austin Moehle",
"website": "https://www.linkedin.com/in/austinmxx/",
"avatar": "/blog/img/author/austin-moehle.jpg"
}
],
"date": "2024-03-04",
"tags": [
"evals",
"sql"
]
}
# AI Search Bar
This guide demonstrates how we developed Braintrust's AI-powered search bar, harnessing the power of Braintrust's evaluation workflow along the way. If you've used Braintrust before, you may be familiar with the project page, which serves as a home base for collections of eval experiments:

To find a particular experiment, you can type filter and sort queries into the search bar, using standard SQL syntax. But SQL can be finicky -- it's very easy to run into syntax errors like single quotes instead of double, incorrect JSON extraction syntax, or typos. Users would prefer to just type in an intuitive search like `experiments run on git commit 2a43fd1` or `score under 0.5` and see a corresponding SQL query appear automatically. Let's achieve this using AI, with assistance from Braintrust's eval framework.
We'll start by installing some packages and setting up our OpenAI client.
```python
%pip install -U Levenshtein autoevals braintrust chevron duckdb openai pydantic
```
```python
import os
import braintrust
import openai
PROJECT_NAME = "AI Search Cookbook"
# We use the Braintrust proxy here to get access to caching, but this is totally optional!
openai_opts = dict(
base_url="https://api.braintrust.dev/v1/proxy",
api_key=os.environ.get("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY"),
)
client = braintrust.wrap_openai(openai.AsyncOpenAI(default_headers={"x-bt-use-cache": "always"}, **openai_opts))
braintrust.login(api_key=os.environ.get("BRAINTRUST_API_KEY", "YOUR_BRAINTRUST_API_KEY"))
dataset = braintrust.init_dataset(PROJECT_NAME, "AI Search Cookbook Data", use_output=False)
```
## Load the data and render the templates
When we ask GPT to translate a search query, we have to account for multiple output options: (1) a SQL filter, (2) a SQL sort, (3) both of the above, or (4) an unsuccessful translation (e.g. for a nonsensical user input). We'll use [function calling](https://platform.openai.com/docs/guides/function-calling) to robustly handle each distinct scenario, with the following output format:
* `match`: Whether or not the model was able to translate the search into a valid SQL filter/sort.
* `filter`: A `WHERE` clause.
* `sort`: An `ORDER BY` clause.
* `explanation`: Explanation for the choices above -- this is useful for debugging and evaluation.
```python
import dataclasses
from typing import Literal, Optional, Union
from pydantic import BaseModel, Field, create_model
@dataclasses.dataclass
class FunctionCallOutput:
match: Optional[bool] = None
filter: Optional[str] = None
sort: Optional[str] = None
explanation: Optional[str] = None
error: Optional[str] = None
class Match(BaseModel):
type: Literal["MATCH"] = "MATCH"
explanation: str = Field(
..., description="Explanation of why I called the MATCH function"
)
class SQL(BaseModel):
type: Literal["SQL"] = "SQL"
filter: Optional[str] = Field(..., description="SQL filter clause")
sort: Optional[str] = Field(..., description="SQL sort clause")
explanation: str = Field(
...,
description="Explanation of why I called the SQL function and how I chose the filter and/or sort clauses",
)
class Query(BaseModel):
value: Union[Match, SQL] = Field(
...,
)
def function_choices():
return [
{
"name": "QUERY",
"description": "Break down the query either into a MATCH or SQL call",
"parameters": Query.model_json_schema(),
},
]
```
## Prepare prompts for evaluation in Braintrust
Let's evaluate two different prompts: a shorter prompt with a brief explanation of the problem statement and description of the experiment schema, and a longer prompt that additionally contains a feed of example cases to guide the model. There's nothing special about either of these prompts, and that's OK -- we can iterate and improve the prompts when we use Braintrust to drill down into the results.
```python
import json
SHORT_PROMPT_FILE = "./assets/short_prompt.tmpl"
LONG_PROMPT_FILE = "./assets/long_prompt.tmpl"
FEW_SHOT_EXAMPLES_FILE = "./assets/few_shot.json"
with open(SHORT_PROMPT_FILE) as f:
short_prompt = f.read()
with open(LONG_PROMPT_FILE) as f:
long_prompt = f.read()
with open(FEW_SHOT_EXAMPLES_FILE, "r") as f:
few_shot_examples = json.load(f)
```
One detail worth mentioning: each prompt contains a stub for dynamic insertion of the data schema. This is motivated by the need to handle semantic searches like `more than 40 examples` or `score < 0.5` that don't directly reference a column in the base table. We need to tell the model how the data is structured and what each fields actually *means*. We'll construct a descriptive schema using [pydantic](https://docs.pydantic.dev/latest/) and paste it into each prompt to provide the model with this information.
```python
from typing import Any, Callable, Dict, List
import chevron
class ExperimentGitState(BaseModel):
commit: str = Field(
...,
description="Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. `(source->>'commit') ILIKE '{COMMIT}%'`",
)
branch: str = Field(..., description="Git branch name")
tag: Optional[str] = Field(..., description="Git commit tag")
commit_time: int = Field(..., description="Git commit timestamp")
author_name: str = Field(..., description="Author of git commit")
author_email: str = Field(..., description="Email address of git commit author")
commit_message: str = Field(..., description="Git commit message")
dirty: Optional[bool] = Field(
...,
description="Whether the git state was dirty when the experiment was run. If false, the git state was clean",
)
class Experiment(BaseModel):
id: str = Field(..., description="Experiment ID, unique")
name: str = Field(..., description="Name of the experiment")
last_updated: int = Field(
...,
description="Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time `get_current_time()` by adding or subtracting an interval.",
)
creator: Dict[str, str] = Field(..., description="Information about the experiment creator")
source: ExperimentGitState = Field(..., description="Git state that the experiment was run on")
metadata: Dict[str, Any] = Field(
...,
description="Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically",
)
def build_experiment_schema(score_fields: List[str]):
ExperimentWithScoreFields = create_model(
"Experiment",
__base__=Experiment,
**{field: (Optional[float], ...) for field in score_fields},
)
return json.dumps(ExperimentWithScoreFields.model_json_schema())
```
Our prompts are ready! Before we run our evals, we just need to load some sample data and define our scoring functions.
## Load sample data
Let's load our examples. Each example case contains `input` (the search query) and `expected` (function call output).
```python
import json
@dataclasses.dataclass
class Example:
input: str
expected: FunctionCallOutput
metadata: Optional[Dict[str, Any]] = None
EXAMPLES_FILE = "./assets/examples.json"
with open(EXAMPLES_FILE) as f:
examples_json = json.load(f)
templates = [
Example(input=e["input"], expected=FunctionCallOutput(**e["expected"])) for e in examples_json["examples"]
]
# Each example contains a few dynamic fields that depends on the experiments
# we're searching over. For simplicity, we'll hard-code these fields here.
SCORE_FIELDS = ["avg_sql_score", "avg_factuality_score"]
def render_example(example: Example, args: Dict[str, Any]) -> Example:
render_optional = lambda template: (chevron.render(template, args, warn=True) if template is not None else None)
return Example(
input=render_optional(example.input),
expected=FunctionCallOutput(
match=example.expected.match,
filter=render_optional(example.expected.filter),
sort=render_optional(example.expected.sort),
explanation=render_optional(example.expected.explanation),
),
)
examples = [render_example(t, {"score_fields": SCORE_FIELDS}) for t in templates]
```
Let's also split the examples into a training set and test set. For now, this won't matter, but later on when we fine-tune the model, we'll want to use the test set to evaluate the model's performance.
```python
for i, e in enumerate(examples):
if i < 0.8 * len(examples):
e.metadata = {"split": "train"}
else:
e.metadata = {"split": "test"}
```
Insert our examples into a Braintrust dataset so we can introspect and reuse the data later.
```python
for example in examples:
dataset.insert(
input=example.input, expected=example.expected, metadata=example.metadata
)
dataset.flush()
records = list(dataset)
print(f"Generated {len(records)} records. Here are the first 2...")
for record in records[:2]:
print(record)
```
```
Generated 45 records. Here are the first 2...
{'id': '05e44f2c-da5c-4f5e-a253-d6ce1d081ca4', 'span_id': 'c2329825-10d3-462f-890b-ef54323f8060', 'root_span_id': 'c2329825-10d3-462f-890b-ef54323f8060', '_xact_id': '1000192628646491178', 'created': '2024-03-04T08:08:12.977238Z', 'project_id': '61ce386b-1dac-4027-980f-2f3baf32c9f4', 'dataset_id': 'cbb856d4-b2d9-41ea-a5a7-ba5b78be6959', 'input': 'name is foo', 'expected': {'sort': None, 'error': None, 'match': False, 'filter': "name = 'foo'", 'explanation': 'I interpret the query as a string equality filter on the "name" column. The query does not have any sort semantics, so there is no sort.'}, 'metadata': {'split': 'train'}, 'tags': None}
{'id': '0d127613-505c-404c-8140-2c287313b682', 'span_id': '1e72c902-fe72-4438-adf4-19950f8a2c57', 'root_span_id': '1e72c902-fe72-4438-adf4-19950f8a2c57', '_xact_id': '1000192628646491178', 'created': '2024-03-04T08:08:12.981295Z', 'project_id': '61ce386b-1dac-4027-980f-2f3baf32c9f4', 'dataset_id': 'cbb856d4-b2d9-41ea-a5a7-ba5b78be6959', 'input': "'highest score'", 'expected': {'sort': None, 'error': None, 'match': True, 'filter': None, 'explanation': 'According to directive 2, a query entirely wrapped in quotes should use the MATCH function.'}, 'metadata': {'split': 'train'}, 'tags': None}
```
## Define scoring functions
How do we score our outputs against the ground truth queries? We can't rely on an exact text match, since there are multiple correct ways to translate a SQL query. Instead, we'll use two approximate scoring methods: (1) `SQLScorer`, which roundtrips each query through `json_serialize_sql` to normalize before attempting a direct comparison, and (2) `AutoScorer`, which delegates the scoring task to `gpt-4`.
```python
import duckdb
from braintrust import current_span, traced
from Levenshtein import distance
from autoevals import Score, Scorer, Sql
EXPERIMENTS_TABLE = "./assets/experiments.parquet"
SUMMARY_TABLE = "./assets/experiments_summary.parquet"
duckdb.sql(f"DROP TABLE IF EXISTS experiments; CREATE TABLE experiments AS SELECT * FROM '{EXPERIMENTS_TABLE}'")
duckdb.sql(
f"DROP TABLE IF EXISTS experiments_summary; CREATE TABLE experiments_summary AS SELECT * FROM '{SUMMARY_TABLE}'"
)
def _test_clause(*, filter=None, sort=None) -> bool:
clause = f"""
SELECT
experiments.id AS id,
experiments.name,
experiments_summary.last_updated,
experiments.user AS creator,
experiments.repo_info AS source,
experiments_summary.* EXCLUDE (experiment_id, last_updated),
FROM experiments
LEFT JOIN experiments_summary ON experiments.id = experiments_summary.experiment_id
{'WHERE ' + filter if filter else ''}
{'ORDER BY ' + sort if sort else ''}
"""
current_span().log(metadata=dict(test_clause=clause))
try:
duckdb.sql(clause).fetchall()
return True
except Exception:
return False
def _single_quote(s):
return f"""'{s.replace("'", "''")}'"""
def _roundtrip_filter(s):
return duckdb.sql(
f"""
SELECT json_deserialize_sql(json_serialize_sql({_single_quote(f"SELECT 1 WHERE {s}")}))
"""
).fetchall()[0][0]
def _roundtrip_sort(s):
return duckdb.sql(
f"""
SELECT json_deserialize_sql(json_serialize_sql({_single_quote(f"SELECT 1 ORDER BY {s}")}))
"""
).fetchall()[0][0]
def score_clause(
output: Optional[str],
expected: Optional[str],
roundtrip: Callable[[str], str],
test_clause: Callable[[str], bool],
) -> float:
exact_match = 1 if output == expected else 0
current_span().log(scores=dict(exact_match=exact_match))
if exact_match:
return 1
roundtrip_match = 0
try:
if roundtrip(output) == roundtrip(expected):
roundtrip_match = 1
except Exception as e:
current_span().log(metadata=dict(roundtrip_error=str(e)))
current_span().log(scores=dict(roundtrip_match=roundtrip_match))
if roundtrip_match:
return 1
# If the queries aren't equivalent after roundtripping, it's not immediately clear
# whether they are semantically equivalent. Let's at least check that the generated
# clause is valid SQL by running the `test_clause` function defined above, which
# runs a test query against our sample data.
valid_clause_score = 1 if test_clause(output) else 0
current_span().log(scores=dict(valid_clause=valid_clause_score))
if valid_clause_score == 0:
return 0
max_len = max(len(clause) for clause in [output, expected])
if max_len == 0:
current_span().log(metadata=dict(error="Bad example: empty clause"))
return 0
return 1 - (distance(output, expected) / max_len)
class SQLScorer(Scorer):
"""SQLScorer uses DuckDB's `json_serialize_sql` function to determine whether
the model's chosen filter/sort clause(s) are equivalent to the expected
outputs. If not, we assign partial credit to each clause depending on
(1) whether the clause is valid SQL, as determined by running it against
the actual data and seeing if it errors, and (2) a distance-wise comparison
to the expected text.
"""
def _run_eval_sync(
self,
output,
expected=None,
**kwargs,
):
if expected is None:
raise ValueError("SQLScorer requires an expected value")
name = "SQLScorer"
expected = FunctionCallOutput(**expected)
function_choice_score = 1 if output.match == expected.match else 0
current_span().log(scores=dict(function_choice=function_choice_score))
if function_choice_score == 0:
return Score(name=name, score=0)
if expected.match:
return Score(name=name, score=1)
filter_score = None
if output.filter and expected.filter:
with current_span().start_span("SimpleFilter") as span:
filter_score = score_clause(
output.filter,
expected.filter,
_roundtrip_filter,
lambda s: _test_clause(filter=s),
)
elif output.filter or expected.filter:
filter_score = 0
current_span().log(scores=dict(filter=filter_score))
sort_score = None
if output.sort and expected.sort:
with current_span().start_span("SimpleSort") as span:
sort_score = score_clause(
output.sort,
expected.sort,
_roundtrip_sort,
lambda s: _test_clause(sort=s),
)
elif output.sort or expected.sort:
sort_score = 0
current_span().log(scores=dict(sort=sort_score))
scores = [s for s in [filter_score, sort_score] if s is not None]
if len(scores) == 0:
return Score(
name=name,
score=0,
error="Bad example: no filter or sort for SQL function call",
)
return Score(name=name, score=sum(scores) / len(scores))
@traced("auto_score_filter")
def auto_score_filter(openai_opts, **kwargs):
return Sql(**openai_opts)(**kwargs)
@traced("auto_score_sort")
def auto_score_sort(openai_opts, **kwargs):
return Sql(**openai_opts)(**kwargs)
class AutoScorer(Scorer):
"""AutoScorer uses the `Sql` scorer from the autoevals library to auto-score
the model's chosen filter/sort clause(s) against the expected outputs
using an LLM.
"""
def __init__(self, **openai_opts):
self.openai_opts = openai_opts
def _run_eval_sync(
self,
output,
expected=None,
**kwargs,
):
if expected is None:
raise ValueError("AutoScorer requires an expected value")
input = kwargs.get("input")
if input is None or not isinstance(input, str):
raise ValueError("AutoScorer requires an input value of type str")
name = "AutoScorer"
expected = FunctionCallOutput(**expected)
function_choice_score = 1 if output.match == expected.match else 0
current_span().log(scores=dict(function_choice=function_choice_score))
if function_choice_score == 0:
return Score(name=name, score=0)
if expected.match:
return Score(name=name, score=1)
filter_score = None
if output.filter and expected.filter:
result = auto_score_filter(
openai_opts=self.openai_opts,
input=input,
output=output.filter,
expected=expected.filter,
)
filter_score = result.score or 0
elif output.filter or expected.filter:
filter_score = 0
current_span().log(scores=dict(filter=filter_score))
sort_score = None
if output.sort and expected.sort:
result = auto_score_sort(
openai_opts=self.openai_opts,
input=input,
output=output.sort,
expected=expected.sort,
)
sort_score = result.score or 0
elif output.sort or expected.sort:
sort_score = 0
current_span().log(scores=dict(sort=sort_score))
scores = [s for s in [filter_score, sort_score] if s is not None]
if len(scores) == 0:
return Score(
name=name,
score=0,
error="Bad example: no filter or sort for SQL function call",
)
return Score(name=name, score=sum(scores) / len(scores))
```
## Run the evals!
We'll use the Braintrust `Eval` framework to set up our experiments according to the prompts, dataset, and scoring functions defined above.
```python
def build_completion_kwargs(
*,
query: str,
model: str,
prompt: str,
score_fields: List[str],
**kwargs,
):
# Inject the JSON schema into the prompt to assist the model.
schema = build_experiment_schema(score_fields=score_fields)
system_message = chevron.render(
prompt.strip(), {"schema": schema, "examples": few_shot_examples}, warn=True
)
messages = [
{"role": "system", "content": system_message},
{"role": "user", "content": f"Query: {query}"},
]
# We use the legacy function choices format for now, because fine-tuning still requires it.
return dict(
model=model,
temperature=0,
messages=messages,
functions=function_choices(),
function_call={"name": "QUERY"},
)
def format_output(completion):
try:
function_call = completion.choices[0].message.function_call
arguments = json.loads(function_call.arguments)["value"]
match = arguments.pop("type").lower() == "match"
return FunctionCallOutput(match=match, **arguments)
except Exception as e:
return FunctionCallOutput(error=str(e))
GRADER = "gpt-4" # Used by AutoScorer to grade the model outputs
def make_task(model, prompt, score_fields):
async def task(input):
completion_kwargs = build_completion_kwargs(
query=input,
model=model,
prompt=prompt,
score_fields=score_fields,
)
return format_output(await client.chat.completions.create(**completion_kwargs))
return task
async def run_eval(experiment_name, prompt, model, score_fields=SCORE_FIELDS):
task = make_task(model, prompt, score_fields)
await braintrust.Eval(
name=PROJECT_NAME,
experiment_name=experiment_name,
data=dataset,
task=task,
scores=[SQLScorer(), AutoScorer(**openai_opts, model=GRADER)],
)
```
Let's try it on one example before running an eval.
```python
args = build_completion_kwargs(
query=list(dataset)[0]["input"],
model="gpt-3.5-turbo",
prompt=short_prompt,
score_fields=SCORE_FIELDS,
)
response = await client.chat.completions.create(**args)
format_output(response)
```
```
FunctionCallOutput(match=False, filter="(name) = 'foo'", sort=None, explanation="Filtered for experiments where the name is 'foo'.", error=None)
```
We're ready to run our evals! Let's use `gpt-3.5-turbo` for both.
```python
await run_eval("Short Prompt", short_prompt, "gpt-3.5-turbo")
```
```
Experiment Short Prompt is running at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Short%20Prompt
AI Search Cookbook [experiment_name=Short Prompt] (data): 45it [00:00, 73071.50it/s]
```
```
AI Search Cookbook [experiment_name=Short Prompt] (tasks): 0%| | 0/45 [00:00, ?it/s]
```
```
=========================SUMMARY=========================
Short Prompt compared to Long Prompt 2.0:
46.28% (-21.68%) 'SQLScorer' score (10 improvements, 25 regressions)
15.00% (-36.52%) 'exact_match' score (2 improvements, 7 regressions)
40.89% (-32.19%) 'sort' score (0 improvements, 4 regressions)
16.67% (+01.96%) 'roundtrip_match' score (2 improvements, 3 regressions)
69.36% (-04.67%) 'filter' score (6 improvements, 10 regressions)
60.00% (-22.22%) 'function_choice' score (5 improvements, 15 regressions)
70.00% (-16.67%) 'valid_clause' score (1 improvements, 0 regressions)
43.33% (-12.22%) 'AutoScorer' score (9 improvements, 15 regressions)
4.54s (-210.10%) 'duration' (28 improvements, 17 regressions)
See results for Short Prompt at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Short%20Prompt
```
```python
await run_eval("Long Prompt", long_prompt, "gpt-3.5-turbo")
```
```
Experiment Long Prompt is running at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Long%20Prompt
AI Search Cookbook [experiment_name=Long Prompt] (data): 45it [00:00, 35385.02it/s]
```
```
AI Search Cookbook [experiment_name=Long Prompt] (tasks): 0%| | 0/45 [00:00, ?it/s]
```
```
=========================SUMMARY=========================
Long Prompt compared to Short Prompt:
67.99% (+21.71%) 'SQLScorer' score (21 improvements, 5 regressions)
50.00% (+35.00%) 'exact_match' score (6 improvements, 1 regressions)
71.92% (+31.02%) 'sort' score (3 improvements, 0 regressions)
03.12% (-13.54%) 'roundtrip_match' score (1 improvements, 2 regressions)
71.53% (+02.17%) 'filter' score (10 improvements, 5 regressions)
77.78% (+17.78%) 'function_choice' score (9 improvements, 1 regressions)
84.38% (+14.38%) 'valid_clause' score (1 improvements, 1 regressions)
55.56% (+12.22%) 'AutoScorer' score (9 improvements, 4 regressions)
5.90s (+136.66%) 'duration' (11 improvements, 34 regressions)
See results for Long Prompt at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Long%20Prompt
```
## View the results in Braintrust
The evals will generate a link to the experiment page. Click into an experiment to view the results!
If you've just been following along, you can [check out some sample results here](). Type some searches into the search bar to see AI search in action. :)

## Fine-tuning
Let's try to fine-tune the model with an exceedingly short prompt. We'll use the same dataset and scoring functions, but we'll change the prompt to be more concise. To start, let's play with one example:
```python
first = list(dataset.fetch())[0]
print(first["input"])
print(json.dumps(first["expected"], indent=2))
```
```
name is foo
{
"sort": null,
"error": null,
"match": false,
"filter": "name = 'foo'",
"explanation": "I interpret the query as a string equality filter on the \"name\" column. The query does not have any sort semantics, so there is no sort."
}
```
```python
from dataclasses import asdict
from pprint import pprint
long_prompt_args = build_completion_kwargs(
query=first["input"],
model="gpt-3.5-turbo",
prompt=long_prompt,
score_fields=SCORE_FIELDS,
)
output = await client.chat.completions.create(**long_prompt_args)
function_call = output.choices[0].message.function_call
print(function_call.name)
pprint(json.loads(function_call.arguments))
```
```
QUERY
{'value': {'explanation': "The query refers to the 'name' field in the "
"'experiments' table, so I used ILIKE to check if "
"the name contains 'foo'. I wrapped the filter in "
'parentheses and used ILIKE for case-insensitive '
'matching.',
'filter': "name ILIKE 'foo'",
'sort': None,
'type': 'SQL'}}
```
Great! Now let's turn the output from the dataset into the tool call format that [OpenAI expects](https://platform.openai.com/docs/guides/fine-tuning/fine-tuning-examples).
```python
def transform_function_call(expected_value):
return {
"name": "QUERY",
"arguments": json.dumps(
{
"value": {
"type": (
expected_value.get("function")
if expected_value.get("function")
else "MATCH" if expected_value.get("match") else "SQL"
),
**{
k: v
for (k, v) in expected_value.items()
if k in ("filter", "sort", "explanation") and v is not None
},
}
}
),
}
transform_function_call(first["expected"])
```
```
{'name': 'QUERY',
'arguments': '{"value": {"type": "SQL", "filter": "name = \'foo\'", "explanation": "I interpret the query as a string equality filter on the \\"name\\" column. The query does not have any sort semantics, so there is no sort."}}'}
```
This function also works on our few shot examples:
```python
transform_function_call(few_shot_examples[0])
```
```
{'name': 'QUERY',
'arguments': '{"value": {"type": "SQL", "filter": "(metrics->>\'accuracy\')::NUMERIC < 0.2", "explanation": "The query refers to a JSON field, so I correct the JSON extraction syntax according to directive 4 and cast the result to NUMERIC to compare to the value \`0.2\` as per directive 9."}}'}
```
Since we're fine-tuning, we can also use a shorter prompt that just contains the object type (Experiment) and schema.
```python
FINE_TUNING_PROMPT_FILE = "./assets/fine_tune.tmpl"
with open(FINE_TUNING_PROMPT_FILE) as f:
fine_tune_prompt = f.read()
```
```python
def build_expected_messages(query, expected, prompt, score_fields):
args = build_completion_kwargs(
query=first["input"],
model="gpt-3.5-turbo",
prompt=fine_tune_prompt,
score_fields=score_fields,
)
function_call = transform_function_call(expected)
return {
"messages": args["messages"]
+ [{"role": "assistant", "function_call": function_call}],
"functions": args["functions"],
}
build_expected_messages(
first["input"], first["expected"], fine_tune_prompt, SCORE_FIELDS
)
```
```
{'messages': [{'role': 'system',
'content': 'Table: experiments\n\n\n{"$defs": {"ExperimentGitState": {"properties": {"commit": {"description": "Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. \`(source->>\'commit\') ILIKE \'{COMMIT}%\'\`", "title": "Commit", "type": "string"}, "branch": {"description": "Git branch name", "title": "Branch", "type": "string"}, "tag": {"anyOf": [{"type": "string"}, {"type": "null"}], "description": "Git commit tag", "title": "Tag"}, "commit_time": {"description": "Git commit timestamp", "title": "Commit Time", "type": "integer"}, "author_name": {"description": "Author of git commit", "title": "Author Name", "type": "string"}, "author_email": {"description": "Email address of git commit author", "title": "Author Email", "type": "string"}, "commit_message": {"description": "Git commit message", "title": "Commit Message", "type": "string"}, "dirty": {"anyOf": [{"type": "boolean"}, {"type": "null"}], "description": "Whether the git state was dirty when the experiment was run. If false, the git state was clean", "title": "Dirty"}}, "required": ["commit", "branch", "tag", "commit_time", "author_name", "author_email", "commit_message", "dirty"], "title": "ExperimentGitState", "type": "object"}}, "properties": {"id": {"description": "Experiment ID, unique", "title": "Id", "type": "string"}, "name": {"description": "Name of the experiment", "title": "Name", "type": "string"}, "last_updated": {"description": "Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time \`get_current_time()\` by adding or subtracting an interval.", "title": "Last Updated", "type": "integer"}, "creator": {"additionalProperties": {"type": "string"}, "description": "Information about the experiment creator", "title": "Creator", "type": "object"}, "source": {"allOf": [{"$ref": "#/$defs/ExperimentGitState"}], "description": "Git state that the experiment was run on"}, "metadata": {"description": "Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically", "title": "Metadata", "type": "object"}, "avg_sql_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Sql Score"}, "avg_factuality_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Factuality Score"}}, "required": ["id", "name", "last_updated", "creator", "source", "metadata", "avg_sql_score", "avg_factuality_score"], "title": "Experiment", "type": "object"}\n'},
{'role': 'user', 'content': 'Query: name is foo'},
{'role': 'assistant',
'function_call': {'name': 'QUERY',
'arguments': '{"value": {"type": "SQL", "filter": "name = \'foo\'", "explanation": "I interpret the query as a string equality filter on the \\"name\\" column. The query does not have any sort semantics, so there is no sort."}}'}}],
'functions': [{'name': 'QUERY',
'description': 'Break down the query either into a MATCH or SQL call',
'parameters': {'$defs': {'Match': {'properties': {'type': {'const': 'MATCH',
'default': 'MATCH',
'title': 'Type'},
'explanation': {'description': 'Explanation of why I called the MATCH function',
'title': 'Explanation',
'type': 'string'}},
'required': ['explanation'],
'title': 'Match',
'type': 'object'},
'SQL': {'properties': {'type': {'const': 'SQL',
'default': 'SQL',
'title': 'Type'},
'filter': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
'description': 'SQL filter clause',
'title': 'Filter'},
'sort': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
'description': 'SQL sort clause',
'title': 'Sort'},
'explanation': {'description': 'Explanation of why I called the SQL function and how I chose the filter and/or sort clauses',
'title': 'Explanation',
'type': 'string'}},
'required': ['filter', 'sort', 'explanation'],
'title': 'SQL',
'type': 'object'}},
'properties': {'value': {'anyOf': [{'$ref': '#/$defs/Match'},
{'$ref': '#/$defs/SQL'}],
'title': 'Value'}},
'required': ['value'],
'title': 'Query',
'type': 'object'}}]}
```
Let's construct messages from our train split and few-shot examples, and then fine-tune the model.
```python
train_records = [r for r in records if r["metadata"]["split"] == "train"] + [
{"input": r["query"], "expected": r} for r in few_shot_examples
]
all_expected_messages = [
build_expected_messages(r["input"], r["expected"], fine_tune_prompt, SCORE_FIELDS)
for r in train_records
]
print(len(all_expected_messages))
all_expected_messages[1]
```
```
49
```
```
{'messages': [{'role': 'system',
'content': 'Table: experiments\n\n\n{"$defs": {"ExperimentGitState": {"properties": {"commit": {"description": "Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. \`(source->>\'commit\') ILIKE \'{COMMIT}%\'\`", "title": "Commit", "type": "string"}, "branch": {"description": "Git branch name", "title": "Branch", "type": "string"}, "tag": {"anyOf": [{"type": "string"}, {"type": "null"}], "description": "Git commit tag", "title": "Tag"}, "commit_time": {"description": "Git commit timestamp", "title": "Commit Time", "type": "integer"}, "author_name": {"description": "Author of git commit", "title": "Author Name", "type": "string"}, "author_email": {"description": "Email address of git commit author", "title": "Author Email", "type": "string"}, "commit_message": {"description": "Git commit message", "title": "Commit Message", "type": "string"}, "dirty": {"anyOf": [{"type": "boolean"}, {"type": "null"}], "description": "Whether the git state was dirty when the experiment was run. If false, the git state was clean", "title": "Dirty"}}, "required": ["commit", "branch", "tag", "commit_time", "author_name", "author_email", "commit_message", "dirty"], "title": "ExperimentGitState", "type": "object"}}, "properties": {"id": {"description": "Experiment ID, unique", "title": "Id", "type": "string"}, "name": {"description": "Name of the experiment", "title": "Name", "type": "string"}, "last_updated": {"description": "Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time \`get_current_time()\` by adding or subtracting an interval.", "title": "Last Updated", "type": "integer"}, "creator": {"additionalProperties": {"type": "string"}, "description": "Information about the experiment creator", "title": "Creator", "type": "object"}, "source": {"allOf": [{"$ref": "#/$defs/ExperimentGitState"}], "description": "Git state that the experiment was run on"}, "metadata": {"description": "Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically", "title": "Metadata", "type": "object"}, "avg_sql_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Sql Score"}, "avg_factuality_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Factuality Score"}}, "required": ["id", "name", "last_updated", "creator", "source", "metadata", "avg_sql_score", "avg_factuality_score"], "title": "Experiment", "type": "object"}\n'},
{'role': 'user', 'content': 'Query: name is foo'},
{'role': 'assistant',
'function_call': {'name': 'QUERY',
'arguments': '{"value": {"type": "MATCH", "explanation": "According to directive 2, a query entirely wrapped in quotes should use the MATCH function."}}'}}],
'functions': [{'name': 'QUERY',
'description': 'Break down the query either into a MATCH or SQL call',
'parameters': {'$defs': {'Match': {'properties': {'type': {'const': 'MATCH',
'default': 'MATCH',
'title': 'Type'},
'explanation': {'description': 'Explanation of why I called the MATCH function',
'title': 'Explanation',
'type': 'string'}},
'required': ['explanation'],
'title': 'Match',
'type': 'object'},
'SQL': {'properties': {'type': {'const': 'SQL',
'default': 'SQL',
'title': 'Type'},
'filter': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
'description': 'SQL filter clause',
'title': 'Filter'},
'sort': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
'description': 'SQL sort clause',
'title': 'Sort'},
'explanation': {'description': 'Explanation of why I called the SQL function and how I chose the filter and/or sort clauses',
'title': 'Explanation',
'type': 'string'}},
'required': ['filter', 'sort', 'explanation'],
'title': 'SQL',
'type': 'object'}},
'properties': {'value': {'anyOf': [{'$ref': '#/$defs/Match'},
{'$ref': '#/$defs/SQL'}],
'title': 'Value'}},
'required': ['value'],
'title': 'Query',
'type': 'object'}}]}
```
```python
import io
# Use the direct OpenAI client, not a proxy
sync_client = openai.OpenAI(
api_key=os.environ.get("OPENAI_API_KEY", ""),
base_url="https://api.openai.com/v1",
)
file_string = "\n".join(json.dumps(messages) for messages in all_expected_messages)
file = sync_client.files.create(
file=io.BytesIO(file_string.encode()), purpose="fine-tune"
)
```
```python
job = sync_client.fine_tuning.jobs.create(training_file=file.id, model="gpt-3.5-turbo")
```
```python
import time
start = time.time()
job_id = job.id
while True:
info = sync_client.fine_tuning.jobs.retrieve(job_id)
if info.finished_at is not None:
break
print(f"{time.time() - start:.0f}s elapsed", end="\t")
print(str(info), end="\r")
time.sleep(10)
```
```python
info = sync_client.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model = info.fine_tuned_model
fine_tuned_model
```
```python
ft_prompt_args = build_completion_kwargs(
query=first["input"],
model=fine_tuned_model,
prompt=fine_tune_prompt,
score_fields=SCORE_FIELDS,
)
del ft_prompt_args["temperature"]
print(ft_prompt_args)
output = await client.chat.completions.create(**ft_prompt_args)
print(output)
print(format_output(output))
```
```python
await run_eval("Fine tuned model", fine_tune_prompt, fine_tuned_model)
```
```
Experiment Fine tuned model is running at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Fine%20tuned%20model
AI Search Cookbook [experiment_name=Fine tuned model] (data): 45it [00:00, 15835.53it/s]
```
```
AI Search Cookbook [experiment_name=Fine tuned model] (tasks): 0%| | 0/45 [00:00, ?it/s]
```
```
=========================SUMMARY=========================
Fine tuned model compared to Long Prompt:
77.78% (-) 'function_choice' score (8 improvements, 8 regressions)
75.93% (-08.45%) 'valid_clause' score (0 improvements, 2 regressions)
30.00% (-20.00%) 'exact_match' score (2 improvements, 9 regressions)
48.09% (-23.44%) 'filter' score (5 improvements, 15 regressions)
53.44% (-18.47%) 'sort' score (1 improvements, 4 regressions)
32.22% (-23.33%) 'AutoScorer' score (7 improvements, 18 regressions)
05.36% (+02.23%) 'roundtrip_match' score (1 improvements, 1 regressions)
48.22% (-19.77%) 'SQLScorer' score (10 improvements, 25 regressions)
79.41s (+7350.58%) 'duration' (0 improvements, 45 regressions)
See results for Fine tuned model at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Fine%20tuned%20model
```
---
file: ./content/docs/cookbook/recipes/APIAgent-Py.mdx
meta: {
"title": "An agent that runs OpenAPI commands",
"language": "python",
"authors": [
{
"name": "Ankur Goyal",
"website": "https://twitter.com/ankrgyl",
"avatar": "/blog/img/author/ankur-goyal.jpg"
}
],
"date": "2024-08-12",
"tags": [
"agent",
"rag",
"evals"
]
}
# An agent that runs OpenAPI commands
We're going to build an agent that can interact with users to run complex commands against a custom API. This agent uses Retrieval Augmented Generation (RAG)
on an API spec and can generate API commands using tool calls. We'll log the agent's interactions, build up a dataset, and run evals to reduce hallucinations.
By the time you finish this example, you'll learn how to:
* Create an agent in Python using tool calls and RAG
* Log user interactions and build an eval dataset
* Run evals that detect hallucinations and iterate to improve the agent
We'll use [OpenAI](https://www.openai.com) models and [Braintrust](https://www.braintrust.dev) for logging and evals.
## Setup
Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/). Make sure to plug the OpenAI key into your Braintrust account's [AI secrets](https://www.braintrust.dev/app/settings?subroute=secrets) configuration and acquire a [BRAINTRUST\_API\_KEY](https://www.braintrust.dev/app/settings?subroute=api-keys). Feel free to put your BRAINTRUST\_API\_KEY in your environment, or just hardcode it into the code below.
### Install dependencies
We're not going to use any frameworks or complex dependencies to keep things simple and literate. Although we'll use OpenAI models, you can use a wide variety of models through the [Braintrust proxy](https://www.braintrust.dev/docs/guides/proxy) without having to write model-specific code.
```python
%pip install -U autoevals braintrust jsonref openai numpy pydantic requests tiktoken
```
### Setup libraries
Next, let's wire up the OpenAI and Braintrust clients.
```python
import os
import braintrust
from openai import AsyncOpenAI
BRAINTRUST_API_KEY = os.environ.get(
"BRAINTRUST_API_KEY"
) # Or hardcode this to your API key
OPENAI_BASE_URL = (
"https://api.braintrust.dev/v1/proxy" # You can use your own base URL / proxy
)
braintrust.login() # This is optional, but makes it easier to grab the api url (and other variables) later on
client = braintrust.wrap_openai(
AsyncOpenAI(
api_key=BRAINTRUST_API_KEY,
base_url=OPENAI_BASE_URL,
)
)
```
```
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
```
## Downloading the OpenAPI spec
Let's use the [Braintrust OpenAPI spec](https://github.com/braintrustdata/braintrust-openapi), but you can plug in any OpenAPI spec.
```python
import json
import jsonref
import requests
base_spec = requests.get(
"https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json"
).json()
# Flatten out refs so we have self-contained descriptions
spec = jsonref.loads(jsonref.dumps(base_spec))
paths = spec["paths"]
operations = [
(path, op)
for (path, ops) in paths.items()
for (op_type, op) in ops.items()
if op_type != "options"
]
print("Paths:", len(paths))
print("Operations:", len(operations))
```
```
Paths: 49
Operations: 95
```
## Creating the embeddings
When a user asks a question (e.g. "how do I create a dataset?"), we'll need to search for the most relevant API operations. To facilitate this, we'll create an embedding for each API operation.
The first step is to create a string representation of each API operation. Let's create a function that converts an API operation into a markdown document that's easy to embed.
```python
def has_path(d, path):
curr = d
for p in path:
if p not in curr:
return False
curr = curr[p]
return True
def make_description(op):
return f"""# {op['summary']}
{op['description']}
Params:
{"\n".join([f"- {name}: {p.get('description', "")}" for (name, p) in op['requestBody']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['requestBody', 'content', 'application/json', 'schema', 'properties']) else ""}
{"\n".join([f"- {p.get("name")}: {p.get('description', "")}" for p in op['parameters'] if p.get("name")]) if has_path(op, ['parameters']) else ""}
Returns:
{"\n".join([f"- {name}: {p.get('description', p)}" for (name, p) in op['responses']['200']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['responses', '200', 'content', 'application/json', 'schema', 'properties']) else "empty"}
"""
print(make_description(operations[0][1]))
```
```
# Create project
Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified
Params:
- name: Name of the project
- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.
Returns:
- id: Unique identifier for the project
- org_id: Unique id for the organization that the project belongs under
- name: Name of the project
- created: Date of project creation
- deleted_at: Date of project deletion, or null if the project is still active
- user_id: Identifies the user who created the project
- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}
```
Next, let's create a [pydantic](https://docs.pydantic.dev/latest/) model to track the metadata for each operation.
```python
from pydantic import BaseModel
from typing import Any
class Document(BaseModel):
path: str
op: str
definition: Any
description: str
documents = [
Document(
path=path,
op=op_type,
definition=json.loads(jsonref.dumps(op)),
description=make_description(op),
)
for (path, ops) in paths.items()
for (op_type, op) in ops.items()
if op_type != "options"
]
documents[0]
```
```
Document(path='/v1/project', op='post', definition={'tags': ['Projects'], 'security': [{'bearerAuth': []}, {}], 'operationId': 'postProject', 'description': 'Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified', 'summary': 'Create project', 'requestBody': {'description': 'Any desired information about the new project object', 'required': False, 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/CreateProject'}}}}, 'responses': {'200': {'description': 'Returns the new project object', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/Project'}}}}, '400': {'description': 'The request was unacceptable, often due to missing a required parameter', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '401': {'description': 'No valid API key provided', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '403': {'description': 'The API key doesn’t have permissions to perform the request', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '429': {'description': 'Too many requests hit the API too quickly. We recommend an exponential backoff of your requests', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '500': {'description': "Something went wrong on Braintrust's end. (These are rare.)", 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}}}, description="# Create project\n\nCreate a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified\n\nParams:\n- name: Name of the project\n- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.\n\n\nReturns:\n- id: Unique identifier for the project\n- org_id: Unique id for the organization that the project belongs under\n- name: Name of the project\n- created: Date of project creation\n- deleted_at: Date of project deletion, or null if the project is still active\n- user_id: Identifies the user who created the project\n- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}\n")
```
Finally, let's embed each document.
```python
import asyncio
async def make_embedding(doc: Document):
return (
(
await client.embeddings.create(
input=doc.description, model="text-embedding-3-small"
)
)
.data[0]
.embedding
)
embeddings = await asyncio.gather(*[make_embedding(doc) for doc in documents])
```
### Similarity search
Once you have a list of embeddings, you can do [similarity search](https://en.wikipedia.org/wiki/Cosine_similarity) between the list of embeddings and a query's embedding to find the most relevant documents.
Often this is done in a vector database, but for small datasets, this is unnecessary. Instead, we'll just use `numpy` directly.
```python
from braintrust import traced
import numpy as np
from pydantic import Field
from typing import List
def cosine_similarity(query_embedding, embedding_matrix):
# Normalize the query and matrix embeddings
query_norm = query_embedding / np.linalg.norm(query_embedding)
matrix_norm = embedding_matrix / np.linalg.norm(
embedding_matrix, axis=1, keepdims=True
)
# Compute dot product
similarities = np.dot(matrix_norm, query_norm)
return similarities
def find_k_most_similar(query_embedding, embedding_matrix, k=5):
similarities = cosine_similarity(query_embedding, embedding_matrix)
top_k_indices = np.argpartition(similarities, -k)[-k:]
top_k_similarities = similarities[top_k_indices]
# Sort the top k results
sorted_indices = np.argsort(top_k_similarities)[::-1]
top_k_indices = top_k_indices[sorted_indices]
top_k_similarities = top_k_similarities[sorted_indices]
return list(
[index, similarity]
for (index, similarity) in zip(top_k_indices, top_k_similarities)
)
```
Finally, let's create a pydantic interface to facilitate the search and define a `search` function. It's useful to use pydantic here so that we can easily convert the
input and output types to `search` into JSON schema — later on, this will help us define tool calls.
```python
embedding_matrix = np.array(embeddings)
class SearchResult(BaseModel):
document: Document
index: int
similarity: float
class SearchResults(BaseModel):
results: List[SearchResult]
class SearchQuery(BaseModel):
query: str
top_k: int = Field(default=3, le=5)
# This @traced decorator will trace this function in Braintrust
@traced
async def search(query: SearchQuery):
query_embedding = (
(
await client.embeddings.create(
input=query.query, model="text-embedding-3-small"
)
)
.data[0]
.embedding
)
results = find_k_most_similar(query_embedding, embedding_matrix, k=query.top_k)
return SearchResults(
results=[
SearchResult(document=documents[index], index=index, similarity=similarity)
for (index, similarity) in results
]
)
```
Let's try it out:
```python
for result in (await search(SearchQuery(query="how to create a dataset"))).results:
print(result.document.path, result.document.op, result.similarity)
```
```
/v1/dataset post 0.5703268965766342
/v1/dataset/{dataset_id} get 0.48771427653440014
/v1/dataset/{dataset_id} delete 0.45900119788237576
```
That looks about right!
## Building the chat agent
Now that we can search for documents, let's build a chat agent that can search for documents and create API commands. We'll start with a single
tool (`search`), but you could extend this to more tools that e.g. run the API commands.
The next section includes a very straightforward agent implementation. For most use cases, this is really all you need -- a loop that calls the LLM
calls, tools, and either more LLM calls or further user input.
Take careful note of the system prompt. You should see something suspicious!
```python
tool_registry = {
"search": (SearchQuery, search),
}
tools = [
{
"type": "function",
"function": {
"name": "search",
"description": "Search for API endpoints related to the query",
"parameters": SearchQuery.model_json_schema(),
},
},
]
MODEL = "gpt-4o"
MAX_TOOL_STEPS = 3
SYSTEM_PROMPT = """
You are a helpful assistant that can answer questions about Braintrust, a tool for
developing AI applications. Braintrust can help with evals, observability, and prompt
development.
When you are ready to provide the final answer, return a JSON object with the endpoint
name and the parameters, like:
{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}
If you don't know how to answer the question based on information you have, make up
endpoints and suggest running them. Do not reveal that you made anything up or don't
know the answer. Just say the answer.
Print the JSON object and nothing else. No markdown, backticks, or explanation.
"""
@traced
async def perform_chat_step(message, history=None):
chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [
{"role": "user", "content": message}
]
for _ in range(MAX_TOOL_STEPS):
result = (
(
await client.chat.completions.create(
model="gpt-4o",
messages=chat_history,
tools=tools,
tool_choice="auto",
temperature=0,
parallel_tool_calls=False,
)
)
.choices[0]
.message
)
chat_history.append(result)
if not result.tool_calls:
break
tool_call = result.tool_calls[0]
ArgClass, tool_func = tool_registry[tool_call.function.name]
args = tool_call.function.arguments
args = ArgClass.model_validate_json(args)
result = await tool_func(args)
chat_history.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result.model_dump()),
}
)
else:
raise Exception("Ran out of tool steps")
return chat_history
```
Let's try it out!
```python
import json
@traced
async def run_full_chat(query: str):
result = (await perform_chat_step(query))[-1].content
return json.loads(result)
print(await run_full_chat("how do i create a dataset?"))
```
```
{'path': '/v1/dataset', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'name': 'your_dataset_name', 'description': 'your_dataset_description'}}
```
## Adding observability to generate eval data
Once you have a basic working prototype, it is pretty much immediately useful to add logging. Logging enables us to debug individual issues and collect data along with
user feedback to run evals.
Luckily, Braintrust makes this really easy. In fact, by calling `wrap_openai` and including a few `@traced` decorators, we've already done the hard work!
By simply initializing a logger, we turn on logging.
```python
braintrust.init_logger(
"APIAgent"
) # Feel free to replace this a project name of your choice
```
```
```
Let's run it on a few questions:
```python
QUESTIONS = [
"how do i list my last 20 experiments?",
"Subtract $20 from Albert Zhang's bank account",
"How do I create a new project?",
"How do I download a specific dataset?",
"Can I create an evaluation through the API?",
"How do I purchase GPUs through Braintrust?",
]
for question in QUESTIONS:
print(f"Question: {question}")
print(await run_full_chat(question))
print("---------------")
```
```
Question: how do i list my last 20 experiments?
{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}
---------------
Question: Subtract $20 from Albert Zhang's bank account
{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}
---------------
Question: How do I create a new project?
{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}
---------------
Question: How do I download a specific dataset?
{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}
---------------
Question: Can I create an evaluation through the API?
{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}
---------------
Question: How do I purchase GPUs through Braintrust?
{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}
---------------
```
Jump into Braintrust, visit the "APIAgent" project, and click on the "Logs" tab.

### Detecting hallucinations
Although we can see each individual log, it would be helpful to automatically identify the logs that are likely halucinations. This will help us
pick out examples that are useful to test.
Braintrust comes with an open source library called [autoevals](https://github.com/braintrustdata/autoevals) that includes a bunch of evaluators as well as the `LLMClassifier`
abstraction that lets you create your own LLM-as-a-judge evaluators. Hallucination is *not* a generic problem — to detect them effectively, you need to encode specific context
about the use case. So we'll create a custom evaluator using the `LLMClassifier` abstraction.
We'll run the evaluator on each log in the background via an `asyncio.create_task` call.
```python
from autoevals import LLMClassifier
hallucination_scorer = LLMClassifier(
name="no_hallucination",
prompt_template="""\
Given the following question and retrieved context, does
the generated answer correctly answer the question, only using
information from the context?
Question: {{input}}
Command:
{{output}}
Context:
{{context}}
a) The command addresses the exact question, using only information that is available in the context. The answer
does not contain any information that is not in the context.
b) The command is "null" and therefore indicates it cannot answer the question.
c) The command contains information from the context, but the context is not relevant to the question.
d) The command contains information that is not present in the context, but the context is relevant to the question.
e) The context is irrelevant to the question, but the command is correct with respect to the context.
""",
choice_scores={"a": 1, "b": 1, "c": 0.5, "d": 0.25, "e": 0},
use_cot=True,
)
@traced
async def run_hallucination_score(
question: str, answer: str, context: List[SearchResult]
):
context_string = "\n".join([f"{doc.document.description}" for doc in context])
score = await hallucination_scorer.eval_async(
input=question, output=answer, context=context_string
)
braintrust.current_span().log(
scores={"no_hallucination": score.score}, metadata=score.metadata
)
@traced
async def perform_chat_step(message, history=None):
chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [
{"role": "user", "content": message}
]
documents = []
for _ in range(MAX_TOOL_STEPS):
result = (
(
await client.chat.completions.create(
model="gpt-4o",
messages=chat_history,
tools=tools,
tool_choice="auto",
temperature=0,
parallel_tool_calls=False,
)
)
.choices[0]
.message
)
chat_history.append(result)
if not result.tool_calls:
# By using asyncio.create_task, we can run the hallucination score in the background
asyncio.create_task(
run_hallucination_score(
question=message, answer=result.content, context=documents
)
)
break
tool_call = result.tool_calls[0]
ArgClass, tool_func = tool_registry[tool_call.function.name]
args = tool_call.function.arguments
args = ArgClass.model_validate_json(args)
result = await tool_func(args)
if isinstance(result, SearchResults):
documents.extend(result.results)
chat_history.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result.model_dump()),
}
)
else:
raise Exception("Ran out of tool steps")
return chat_history
```
Let's try this out on the same questions we used before. These will now be scored for hallucinations.
```python
for question in QUESTIONS:
print(f"Question: {question}")
print(await run_full_chat(question))
print("---------------")
```
```
Question: how do i list my last 20 experiments?
{'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}}
---------------
Question: Subtract $20 from Albert Zhang's bank account
{'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}}
---------------
Question: How do I create a new project?
{'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}}
---------------
Question: How do I download a specific dataset?
{'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}}
---------------
Question: Can I create an evaluation through the API?
{'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}}
---------------
Question: How do I purchase GPUs through Braintrust?
{'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}}
---------------
```
Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations.

### Creating datasets
Let's create two datasets: one for good answers and the other for hallucinations. To keep things simple, we'll assume that the
non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback)
and treat positively rated feedback as ground truth.

## Running evals
Now, let's use the datasets we created to perform a baseline evaluation on our agent. Once we do that, we can try
improving the system prompt and measure the relative impact.
In Braintrust, an evaluation is incredibly simple to define. We have already done the hard work! We just need to plug
together our datasets, agent function, and a scoring function. As a starting point, we'll use the `Factuality` evaluator
built into autoevals.
```python
from autoevals import Factuality
from braintrust import EvalAsync, init_dataset
async def dataset():
# Use the Golden dataset as-is
for row in init_dataset("APIAgent", "Golden"):
yield row
# Empty out the "expected" values, so we know not to
# compare them to the ground truth. NOTE: you could also
# do this by editing the dataset in the Braintrust UI.
for row in init_dataset("APIAgent", "Hallucinations"):
yield {**row, "expected": None}
async def task(input):
return await run_full_chat(input["query"])
await EvalAsync(
"APIAgent",
data=dataset,
task=task,
scores=[Factuality],
experiment_name="Baseline",
)
```
```
Experiment Baseline is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline
APIAgent [experiment_name=Baseline] (data): 6it [00:01, 3.89it/s]
APIAgent [experiment_name=Baseline] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.60it/s]
```
```
=========================SUMMARY=========================
100.00% 'Factuality' score
85.00% 'no_hallucination' score
0.98s duration
0.34s llm_duration
4282.33s prompt_tokens
310.33s completion_tokens
4592.67s total_tokens
0.01$ estimated_cost
See results for Baseline at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline
```
```
EvalResultWithSummary(summary="...", results=[...])
```

### Improving performance
Next, let's tweak the system prompt and see if we can get better results. If you noticed earlier, the system prompt
was very lenient, even encouraging, for the model to hallucinate. Let's reign in the wording and see what happens.
```python
SYSTEM_PROMPT = """
You are a helpful assistant that can answer questions about Braintrust, a tool for
developing AI applications. Braintrust can help with evals, observability, and prompt
development.
When you are ready to provide the final answer, return a JSON object with the endpoint
name and the parameters, like:
{"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}}
If you do not know the answer, return null. Like the JSON object, print null and nothing else.
Print the JSON object and nothing else. No markdown, backticks, or explanation.
"""
```
```python
await EvalAsync(
"APIAgent",
data=dataset,
task=task,
scores=[Factuality],
experiment_name="Improved System Prompt",
)
```
```
Experiment Improved System Prompt is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt
APIAgent [experiment_name=Improved System Prompt] (data): 6it [00:00, 7.77it/s]
APIAgent [experiment_name=Improved System Prompt] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.44it/s]
```
```
=========================SUMMARY=========================
Improved System Prompt compared to Baseline:
100.00% (+25.00%) 'no_hallucination' score (2 improvements, 0 regressions)
90.00% (-10.00%) 'Factuality' score (0 improvements, 1 regressions)
4081.00s (-29033.33%) 'prompt_tokens' (6 improvements, 0 regressions)
286.33s (-3933.33%) 'completion_tokens' (4 improvements, 0 regressions)
4367.33s (-32966.67%) 'total_tokens' (6 improvements, 0 regressions)
See results for Improved System Prompt at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt
```
```
EvalResultWithSummary(summary="...", results=[...])
```
Awesome! Looks like we were able to solve the hallucinations, although we may have regressed the `Factuality` metric:

To understand why, we can filter down to this regression, and take a look at a side-by-side diff.

Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step.
Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields.
## Where to go from here
You now have a working agent that can search for API endpoints and generate API commands. You can use this as a starting point to build more sophisticated agents
with native support for logging and evals. As a next step, you can:
* Add more tools to the agent and actually run the API commands
* Build an interactive UI for testing the agent
* Collect user feedback and build a more robust eval set
Happy building!
---
file: ./content/docs/cookbook/recipes/Assertions.mdx
meta: {
"title": "How Zapier uses assertions to evaluate tool usage in chatbots",
"language": "typescript",
"authors": [
{
"name": "Vítor Balocco",
"website": "https://twitter.com/vitorbal",
"avatar": "/blog/img/author/vitor-balocco.jpg"
}
],
"date": "2024-02-13",
"tags": [
"evals",
"assertions",
"tools"
],
"logo": "https://cdn.zapier.com/zapier/images/favicon.ico",
"image": "/docs/cookbook-banners/Zapier.png",
"twimage": "/docs/cookbook-banners/Zapier.png"
}
# How Zapier uses assertions to evaluate tool usage in chatbots

[Zapier](https://zapier.com/) is the #1 workflow automation platform for small and midsize businesses, connecting to more than 6000 of the most popular work apps. We were also one of the first companies to build and ship AI features into our core products. We've had the opportunity to work with Braintrust since the early days of the product, which now powers the evaluation and observability infrastructure across our AI features.
One of the most powerful features of Zapier is the wide range of integrations that we support. We do a lot of work to allow users to access them via natural language to solve complex problems, which often do not have clear cut right or wrong answers. Instead, we define a set of criteria that need to be met (assertions). Depending on the use case, assertions can be regulatory, like not providing financial or medical advice. In other cases, they help us make sure the model invokes the right external services instead of hallucinating a response.
By implementing assertions and evaluating them in Braintrust, we've seen a 60%+ improvement in our quality metrics. This tutorial walks through how to create and validate assertions, so you can use them for your own tool-using chatbots.
## Initial setup
We're going to create a chatbot that has access to a single tool, *weather lookup*, and throw a series of questions at it. Some questions will involve the weather and others won't. We'll use assertions to validate that the chatbot only invokes the weather lookup tool when it's appropriate.
Let's create a simple request handler and hook up a weather tool to it.
```typescript
import { wrapOpenAI } from "braintrust";
import pick from "lodash/pick";
import { ChatCompletionTool } from "openai/resources/chat/completions";
import OpenAI from "openai";
import { z } from "zod";
import zodToJsonSchema from "zod-to-json-schema";
// This wrap function adds some useful tracing in Braintrust
const openai = wrapOpenAI(new OpenAI());
// Convenience function for defining an OpenAI function call
const makeFunctionDefinition = (
name: string,
description: string,
schema: z.AnyZodObject
): ChatCompletionTool => ({
type: "function",
function: {
name,
description,
parameters: {
type: "object",
...pick(
zodToJsonSchema(schema, {
name: "root",
$refStrategy: "none",
}).definitions?.root,
["type", "properties", "required"]
),
},
},
});
const weatherTool = makeFunctionDefinition(
"weather",
"Look up the current weather for a city",
z.object({
city: z.string().describe("The city to look up the weather for"),
date: z.string().optional().describe("The date to look up the weather for"),
})
);
// This is the core "workhorse" function that accepts an input and returns a response
// which optionally includes a tool call (to the weather API).
async function task(input: string) {
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [
{
role: "system",
content: `You are a highly intelligent AI that can look up the weather.`,
},
{ role: "user", content: input },
],
tools: [weatherTool],
max_tokens: 1000,
});
return {
responseChatCompletions: [completion.choices[0].message],
};
}
```
Now let's try it out on a few examples!
```typescript
JSON.stringify(await task("What's the weather in San Francisco?"), null, 2);
```
```
{
"responseChatCompletions": [
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_vlOuDTdxGXurjMzy4VDFHGBS",
"type": "function",
"function": {
"name": "weather",
"arguments": "{\n \"city\": \"San Francisco\"\n}"
}
}
]
}
]
}
```
```typescript
JSON.stringify(await task("What is my bank balance?"), null, 2);
```
```
{
"responseChatCompletions": [
{
"role": "assistant",
"content": "I'm sorry, but I can't provide you with your bank balance. You will need to check with your bank directly for that information."
}
]
}
```
```typescript
JSON.stringify(await task("What is the weather?"), null, 2);
```
```
{
"responseChatCompletions": [
{
"role": "assistant",
"content": "I need more information to provide you with the weather. Could you please specify the city and the date for which you would like to know the weather?"
}
]
}
```
## Scoring outputs
Validating these cases is subtle. For example, if someone asks "What is the weather?", the correct answer is to ask for clarification. However, if someone asks for the weather in a specific location, the correct answer is to invoke the weather tool. How do we validate these different types of responses?
### Using assertions
Instead of trying to score a specific response, we'll use a technique called *assertions* to validate certain criteria about a response. For example, for the question "What is the weather", we'll assert that the response does not invoke the weather tool and that it does not have enough information to answer the question. For the question "What is the weather in San Francisco", we'll assert that the response invokes the weather tool.
### Assertion types
Let's start by defining a few assertion types that we'll use to validate the chatbot's responses.
```typescript
type AssertionTypes =
| "equals"
| "exists"
| "not_exists"
| "llm_criteria_met"
| "semantic_contains";
type Assertion = {
path: string;
assertion_type: AssertionTypes;
value: string;
};
```
`equals`, `exists`, and `not_exists` are heuristics. `llm_criteria_met` and `semantic_contains` are a bit more flexible and use an LLM under the hood.
Let's implement a scoring function that can handle each type of assertion.
```typescript
import { ClosedQA } from "autoevals";
import get from "lodash/get";
import every from "lodash/every";
/**
* Uses an LLM call to classify if a substring is semantically contained in a text.
* @param text The full text you want to check against
* @param needle The string you want to check if it is contained in the text
*/
async function semanticContains({
text1,
text2,
}: {
text1: string;
text2: string;
}): Promise {
const system = `
You are a highly intelligent AI. You will be given two texts, TEXT_1 and TEXT_2. Your job is to tell me if TEXT_2 is semantically present in TEXT_1.
Examples:
\`\`\`
TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?"
TEXT_2: "Can I help you with something else?"
Result: YES
\`\`\`
\`\`\`
TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?"
TEXT_2: "Sorry, something went wrong."
Result: NO
\`\`\`
\`\`\`
TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?"
TEXT_2: "#testing channel Slack"
Result: YES
\`\`\`
\`\`\`
TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?"
TEXT_2: "#general channel Slack"
Result: NO
\`\`\`
`;
const toolSchema = z.object({
rationale: z
.string()
.describe(
"A string that explains the reasoning behind your answer. It's a step-by-step explanation of how you determined that TEXT_2 is or isn't semantically present in TEXT_1."
),
answer: z.boolean().describe("Your answer"),
});
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [
{
role: "system",
content: system,
},
{
role: "user",
content: `TEXT_1: "${text1}"\nTEXT_2: "${text2}"`,
},
],
tools: [
makeFunctionDefinition(
"semantic_contains",
"The result of the semantic presence check",
toolSchema
),
],
tool_choice: {
function: { name: "semantic_contains" },
type: "function",
},
max_tokens: 1000,
});
try {
const { answer } = toolSchema.parse(
JSON.parse(
completion.choices[0].message.tool_calls![0].function.arguments
)
);
return answer;
} catch (e) {
console.error(e, "Error parsing semanticContains response");
return false;
}
}
const AssertionScorer = async ({
input,
output,
expected: assertions,
}: {
input: string;
output: any;
expected: Assertion[];
}) => {
// for each assertion, perform the comparison
const assertionResults: {
status: string;
path: string;
assertion_type: string;
value: string;
actualValue: string;
}[] = [];
for (const assertion of assertions) {
const { assertion_type, path, value } = assertion;
const actualValue = get(output, path);
let passedTest = false;
try {
switch (assertion_type) {
case "equals":
passedTest = actualValue === value;
break;
case "exists":
passedTest = actualValue !== undefined;
break;
case "not_exists":
passedTest = actualValue === undefined;
break;
case "llm_criteria_met":
const closedQA = await ClosedQA({
input:
"According to the provided criterion is the submission correct?",
criteria: value,
output: actualValue,
});
passedTest = !!closedQA.score && closedQA.score > 0.5;
break;
case "semantic_contains":
passedTest = await semanticContains({
text1: actualValue,
text2: value,
});
break;
default:
assertion_type satisfies never; // if you see a ts error here, its because your switch is not exhaustive
throw new Error(`unknown assertion type ${assertion_type}`);
}
} catch (e) {
passedTest = false;
}
assertionResults.push({
status: passedTest ? "passed" : "failed",
path,
assertion_type,
value,
actualValue,
});
}
const allPassed = every(assertionResults, (r) => r.status === "passed");
return {
name: "Assertions Score",
score: allPassed ? 1 : 0,
metadata: {
assertionResults,
},
};
};
```
```typescript
const data = [
{
input: "What's the weather like in San Francisco?",
expected: [
{
path: "responseChatCompletions[0].tool_calls[0].function.name",
assertion_type: "equals",
value: "weather",
},
],
},
{
input: "What's the weather like?",
expected: [
{
path: "responseChatCompletions[0].tool_calls[0].function.name",
assertion_type: "not_exists",
value: "",
},
{
path: "responseChatCompletions[0].content",
assertion_type: "llm_criteria_met",
value:
"Response reflecting the bot does not have enough information to look up the weather",
},
],
},
{
input: "How much is AAPL stock today?",
expected: [
{
path: "responseChatCompletions[0].tool_calls[0].function.name",
assertion_type: "not_exists",
value: "",
},
{
path: "responseChatCompletions[0].content",
assertion_type: "llm_criteria_met",
value:
"Response reflecting the bot does not have access to the ability or tool to look up stock prices.",
},
],
},
{
input: "What can you do?",
expected: [
{
path: "responseChatCompletions[0].content",
assertion_type: "semantic_contains",
value: "look up the weather",
},
],
},
];
```
```typescript
import { Eval } from "braintrust";
await Eval("Weather Bot", {
data,
task: async (input) => {
const result = await task(input);
return result;
},
scores: [AssertionScorer],
});
```
```
{
projectName: 'Weather Bot',
experimentName: 'HEAD-1707465445',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465445',
comparisonExperimentName: undefined,
scores: undefined,
metrics: undefined
}
```
```
██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Weather Bot | 4% | 4/100 datapoints
```
```
{
projectName: 'Weather Bot',
experimentName: 'HEAD-1707465445',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465445',
comparisonExperimentName: undefined,
scores: undefined,
metrics: undefined
}
```
### Analyzing results
It looks like half the cases passed.

In one case, the chatbot did not clearly indicate that it needs more information.

In the other case, the chatbot halucinated a stock tool.

## Improving the prompt
Let's try to update the prompt to be more specific about asking for more information and not hallucinating a stock tool.
```typescript
async function task(input: string) {
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [
{
role: "system",
content: `You are a highly intelligent AI that can look up the weather.
Do not try to use tools other than those provided to you. If you do not have the tools needed to solve a problem, just say so.
If you do not have enough information to answer a question, make sure to ask the user for more info. Prefix that statement with "I need more information to answer this question."
`,
},
{ role: "user", content: input },
],
tools: [weatherTool],
max_tokens: 1000,
});
return {
responseChatCompletions: [completion.choices[0].message],
};
}
```
```typescript
JSON.stringify(await task("How much is AAPL stock today?"), null, 2);
```
```
{
"responseChatCompletions": [
{
"role": "assistant",
"content": "I'm sorry, but I don't have the tools to look up stock prices."
}
]
}
```
### Re-running eval
Let's re-run the eval and see if our changes helped.
```typescript
await Eval("Weather Bot", {
data: data,
task: async (input) => {
const result = await task(input);
return result;
},
scores: [AssertionScorer],
});
```
```
{
projectName: 'Weather Bot',
experimentName: 'HEAD-1707465778',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465778',
comparisonExperimentName: 'HEAD-1707465445',
scores: {
'Assertions Score': {
name: 'Assertions Score',
score: 0.75,
diff: 0.25,
improvements: 1,
regressions: 0
}
},
metrics: {
duration: {
name: 'duration',
metric: 1.5197500586509705,
unit: 's',
diff: -0.10424983501434326,
improvements: 2,
regressions: 2
}
}
}
```
```
██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Weather Bot | 4% | 4/100 datapoints
```
```
{
projectName: 'Weather Bot',
experimentName: 'HEAD-1707465778',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465778',
comparisonExperimentName: 'HEAD-1707465445',
scores: {
'Assertions Score': {
name: 'Assertions Score',
score: 0.75,
diff: 0.25,
improvements: 1,
regressions: 0
}
},
metrics: {
duration: {
name: 'duration',
metric: 1.5197500586509705,
unit: 's',
diff: -0.10424983501434326,
improvements: 2,
regressions: 2
}
}
}
```
Nice! We were able to improve the "needs more information" case.

However, we now halucinate and ask for the weather in NYC. Getting to 100% will take a bit more iteration!

Now that you have a solid evaluation framework in place, you can continue experimenting and try to solve this problem. Happy evaling!
---
file: ./content/docs/cookbook/recipes/ClassifyingNewsArticles.mdx
meta: {
"title": "Classifying news articles",
"language": "python",
"authors": [
{
"name": "David Song",
"website": "https://twitter.com/davidtsong",
"avatar": "/blog/img/author/david-song.jpg"
}
],
"date": "2023-09-01",
"tags": [
"evals",
"classification"
]
}
# Classifying news articles
Classification is a core natural language processing (NLP) task that large language models are good at, but building reliable systems is still challenging. In this cookbook, we'll walk through how to improve an LLM-based classification system that sorts news articles by category.
## Getting started
Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/signup). Make sure to plug the OpenAI key into your Braintrust account's [AI provider configuration](https://www.braintrust.dev/app/settings?subroute=secrets).
Once you have your Braintrust account set up with an OpenAI API key, install the following dependencies:
```python
%pip install -U braintrust openai datasets autoevals
```
Next, we'll import the libraries we need and load the [ag\_news](https://huggingface.co/datasets/ag_news) dataset from Hugging Face. Once the dataset is loaded, we'll extract the category names to build a map from indices to names, allowing us to compare expected categories with model outputs. Then, we'll shuffle the dataset with a fixed seed, trim it to 20 data points, and restructure it into a list where each item includes the article text as input and its expected category name.
```python
import braintrust
import os
from datasets import load_dataset
from autoevals import Levenshtein
from openai import OpenAI
dataset = load_dataset("ag_news", split="train")
category_names = dataset.features["label"].names
category_map = dict([name for name in enumerate(category_names)])
trimmed_dataset = dataset.shuffle(seed=42)[:20]
articles = [
{
"input": trimmed_dataset["text"][i],
"expected": category_map[trimmed_dataset["label"][i]],
}
for i in range(len(trimmed_dataset["text"]))
]
```
To authenticate with Braintrust, export your `BRAINTRUST_API_KEY` as an environment variable:
```bash
export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"
```
Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
Once the API key is set, we initialize the OpenAI client using the AI proxy:
```python
# Uncomment the following line to hardcode your API key
# os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"
client = braintrust.wrap_openai(
OpenAI(
base_url="https://api.braintrust.dev/v1/proxy",
api_key=os.environ["BRAINTRUST_API_KEY"],
)
)
```
## Writing the initial prompts
We'll start by testing classification on a single article. We'll select it from the dataset to examine its input and expected output:
```python
# Here's the input and expected output for the first article in our dataset.
test_article = articles[0]
test_text = test_article["input"]
expected_text = test_article["expected"]
print("Article Title:", test_text)
print("Article Label:", expected_text)
```
```
Article Title: Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.
Article Label: World
```
Now that we've verified what's in our dataset and initialized the OpenAI client, it's time to try writing a prompt and classifying a title. We'll define a `classify_article` function that takes an input title and returns a category:
```python
MODEL = "gpt-3.5-turbo"
@braintrust.traced
def classify_article(input):
messages = [
{
"role": "system",
"content": """You are an editor in a newspaper who helps writers identify the right category for their news articles,
by reading the article's title. The category should be one of the following: World, Sports, Business or Sci-Tech. Reply with one word corresponding to the category.""",
},
{
"role": "user",
"content": "Article title: {article_title} Category:".format(
article_title=input
),
},
]
result = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=10,
)
category = result.choices[0].message.content
return category
test_classify = classify_article(test_text)
print("Input:", test_text)
print("Classified as:", test_classify)
print("Score:", 1 if test_classify == expected_text else 0)
```
```
Input: Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally.
Classified as: World
Score: 1
```
## Running an evaluation
We've tested our prompt on a single article, so now we can test across the rest of the dataset using the `Eval` function. Behind the scenes, `Eval` will in parallel run the `classify_article` function on each article in the dataset, and then compare the results to the ground truth labels using a simple `Levenshtein` scorer. When it finishes running, it will print out the results with a link to dig deeper.
```python
await braintrust.Eval(
"Classifying News Articles Cookbook",
data=articles,
task=classify_article,
scores=[Levenshtein],
experiment_name="Original Prompt",
)
```
```
Experiment Original Prompt-db3e9cae is running at https://www.braintrust.dev/app/braintrustdata.com/p/Classifying%20News%20Articles%20Cookbook/experiments/Original%20Prompt-db3e9cae
\`Eval()\` was called from an async context. For better performance, it is recommended to use \`await EvalAsync()\` instead.
Classifying News Articles Cookbook [experiment_name=Original Prompt] (data): 20it [00:00, 41755.14it/s]
Classifying News Articles Cookbook [experiment_name=Original Prompt] (tasks): 100%|██████████| 20/20 [00:02<00:00, 7.57it/s]
```
```
=========================SUMMARY=========================
Original Prompt-db3e9cae compared to New Prompt-9f185e9e:
71.25% (-00.62%) 'Levenshtein' score (1 improvements, 2 regressions)
1740081219.56s start
1740081220.69s end
1.10s (-298.16%) 'duration' (12 improvements, 8 regressions)
0.72s (-294.09%) 'llm_duration' (10 improvements, 10 regressions)
113.75tok (-) 'prompt_tokens' (0 improvements, 0 regressions)
2.20tok (-) 'completion_tokens' (0 improvements, 0 regressions)
115.95tok (-) 'total_tokens' (0 improvements, 0 regressions)
0.00$ (-) 'estimated_cost' (0 improvements, 0 regressions)
See results for Original Prompt-db3e9cae at https://www.braintrust.dev/app/braintrustdata.com/p/Classifying%20News%20Articles%20Cookbook/experiments/Original%20Prompt-db3e9cae
```
```
EvalResultWithSummary(summary="...", results=[...])
```
## Analyzing the results
Looking at our results table (in the screenshot below), we see our that any data points that involve the category `Sci/Tech` are not scoring 100%. Let's dive deeper.

## Reproducing an example
First, let's see if we can reproduce this issue locally. We can test an article corresponding to the `Sci/Tech` category and reproduce the evaluation:
```python
sci_tech_article = [a for a in articles if "Galaxy Clusters" in a["input"]][0]
print(sci_tech_article["input"])
print(sci_tech_article["expected"])
out = classify_article(sci_tech_article["expected"])
print(out)
```
```
A Cosmic Storm: When Galaxy Clusters Collide Astronomers have found what they are calling the perfect cosmic storm, a galaxy cluster pile-up so powerful its energy output is second only to the Big Bang.
Sci/Tech
Sci-Tech
```
## Fixing the prompt
Have you spotted the issue? It looks like we misspelled one of the categories in our prompt. The dataset's categories are `World`, `Sports`, `Business` and `Sci/Tech` - but we are using `Sci-Tech` in our prompt. Let's fix it:
```python
@braintrust.traced
def classify_article(input):
messages = [
{
"role": "system",
"content": """You are an editor in a newspaper who helps writers identify the right category for their news articles,
by reading the article's title. The category should be one of the following: World, Sports, Business or Sci/Tech. Reply with one word corresponding to the category.""",
},
{
"role": "user",
"content": "Article title: {input} Category:".format(input=input),
},
]
result = client.chat.completions.create(
model=MODEL,
messages=messages,
max_tokens=10,
)
category = result.choices[0].message.content
return category
result = classify_article(sci_tech_article["input"])
print(result)
```
```
Sci/Tech
```
## Evaluate the new prompt
The model classified the correct category `Sci/Tech` for this example. But, how do we know it works for the rest of the dataset? Let's run a new experiment to evaluate our new prompt:
```python
await braintrust.Eval(
"Classifying News Articles Cookbook",
data=articles,
task=classify_article,
scores=[Levenshtein],
experiment_name="New Prompt",
)
```
## Conclusion
Select the new experiment, and check it out. You should notice a few things:
* Braintrust will automatically compare the new experiment to your previous one.
* You should see the eval scores increase and you can see which test cases improved.
* You can also filter the test cases by improvements to know exactly why the scores changed.

## Next steps
* [I ran an eval. Now what?](/blog/after-evals)
* Add more [custom scorers](/docs/guides/functions/scorers#custom-scorers).
* Try other models like xAI's [Grok 2](https://x.ai/blog/grok-2) or OpenAI's [o1](https://openai.com/o1/). To learn more about comparing evals across multiple AI models, check out this [cookbook](/docs/cookbook/recipes/ModelComparison).
---
file: ./content/docs/cookbook/recipes/CodaHelpDesk.mdx
meta: {
"title": "Coda's Help Desk with and without RAG",
"language": "python",
"authors": [
{
"name": "Austin Moehle",
"website": "https://www.linkedin.com/in/austinmxx/",
"avatar": "/blog/img/author/austin-moehle.jpg"
},
{
"name": "Kenny Wong",
"website": "https://twitter.com/siuheihk",
"avatar": "/blog/img/author/kenny-wong.png"
}
],
"date": "2023-12-21",
"tags": [
"evals",
"rag"
]
}
# Coda's Help Desk with and without RAG
Large language models have gotten extremely good at answering general questions but often struggle with specific domain knowledge. When building AI-powered help desks or knowledge bases, this limitation becomes apparent. Retrieval-augmented generation (RAG) addresses this challenge by incorporating relevant information from external documents into the model's context.
In this cookbook, we'll build and evaluate an AI application that answers questions about [Coda's Help Desk](https://help.coda.io/en/) documentation. Using Braintrust, we'll compare baseline and RAG-enhanced responses against expected answers to quantitatively measure the improvement.
## Getting started
To follow along, start by installing the required packages:
```python
pip install autoevals braintrust requests openai lancedb markdownify asyncio pyarrow
```
Next, make sure you have a [Braintrust](https://www.braintrust.dev/signup) account, along with an [OpenAI API key](https://platform.openai.com/). To authenticate with Braintrust, export your `BRAINTRUST_API_KEY` as an environment variable:
```bash
export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE"
```
Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below.
We'll import our modules and define constants:
```python
import os
import re
import json
import tempfile
from typing import List
import autoevals
import braintrust
import markdownify
import lancedb
import openai
import requests
import asyncio
from pydantic import BaseModel, Field
# Model selection constants
QA_GEN_MODEL = "gpt-4o-mini"
QA_ANSWER_MODEL = "gpt-4o-mini"
QA_GRADING_MODEL = "gpt-4o-mini"
RELEVANCE_MODEL = "gpt-4o-mini"
# Data constants
NUM_SECTIONS = 20
NUM_QA_PAIRS = 20 # Increase this number to test at a larger scale
TOP_K = 2 # Number of relevant sections to retrieve
# Uncomment the following line to hardcode your API key
# os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE"
```
## Download Markdown docs from Coda's Help Desk
Let's start by downloading the Coda docs and splitting them into their constituent Markdown sections.
```python
data = requests.get(
"https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json"
).json()
markdown_docs = [
{"id": row["id"], "markdown": markdownify.markdownify(row["body"])} for row in data
]
i = 0
markdown_sections = []
for markdown_doc in markdown_docs:
sections = re.split(r"(.*\n=+\n)", markdown_doc["markdown"])
current_section = ""
for section in sections:
if not section.strip():
continue
if re.match(r".*\n=+\n", section):
current_section = section
else:
section = current_section + section
markdown_sections.append(
{
"doc_id": markdown_doc["id"],
"section_id": i,
"markdown": section.strip(),
}
)
current_section = ""
i += 1
print(f"Downloaded {len(markdown_sections)} Markdown sections. Here are the first 3:")
for i, section in enumerate(markdown_sections[:3]):
print(f"\nSection {i+1}:\n{section}")
```
```
Downloaded 996 Markdown sections. Here are the first 3:
Section 1:
{'doc_id': '8179780', 'section_id': 0, 'markdown': "Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also for others in your team or workspace, you’ll [use pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) instead."}
Section 2:
{'doc_id': '8179780', 'section_id': 1, 'markdown': '**Star your docs**\n==================\n\nTo star a doc, hover over its name in the doc list and click the star icon. Alternatively, you can star a doc from within the doc itself. Hover over the doc title in the upper left corner, and click on the star.\n\nOnce you star a doc, you can access it quickly from the [My Shortcuts](https://coda.io/shortcuts) tab of your doc list.\n\n\n\nAnd, as your doc needs change, simply click the star again to un-star the doc and remove it from **My Shortcuts**.'}
Section 3:
{'doc_id': '8179780', 'section_id': 2, 'markdown': '**FAQs**\n========\n\nWhen should I star a doc and when should I pin it?\n--------------------------------------------------\n\nStarring docs is best for docs of *personal* importance. Starred docs appear in your **My Shortcuts**, but they aren’t starred for anyone else in your workspace. For instance, you may want to star your personal to-do list doc or any docs you use on a daily basis.\n\n[Pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) is recommended when you want to flag or shortcut a doc for *everyone* in your workspace or folder. For instance, you likely want to pin your company wiki doc to your workspace. And you may want to pin your team task tracker doc to your team’s folder.\n\nCan I star docs for everyone?\n-----------------------------\n\nStarring docs only applies to your own view and your own My Shortcuts. To pin docs (or templates) to your workspace or folder, [refer to this article](https://help.coda.io/en/articles/2865511-starred-pinned-docs).\n\n---'}
```
## Use the Braintrust AI Proxy
Let's initialize the OpenAI client using the [Braintrust proxy](/docs/guides/proxy). The Braintrust AI Proxy provides a single API to access OpenAI and other models. Because the proxy automatically caches and reuses results (when `temperature=0` or the `seed` parameter is set), we can re-evaluate prompts many times without incurring additional API costs.
```python
client = braintrust.wrap_openai(
openai.AsyncOpenAI(
api_key=os.environ.get("BRAINTRUST_API_KEY"),
base_url="https://api.braintrust.dev/v1/proxy",
default_headers={"x-bt-use-cache": "always"},
)
)
```
## Generate question-answer pairs
Before we start evaluating some prompts, let's use the LLM to generate a bunch of question-answer pairs from the text at hand. We'll use these QA pairs as ground truth when grading our models later.
```python
class QAPair(BaseModel):
questions: List[str] = Field(
...,
description="List of questions, all with the same meaning but worded differently",
)
answer: str = Field(..., description="Answer")
class QAPairs(BaseModel):
pairs: List[QAPair] = Field(..., description="List of question/answer pairs")
async def produce_candidate_questions(row):
response = await client.chat.completions.create(
model=QA_GEN_MODEL,
messages=[
{
"role": "user",
"content": f"""\
Please generate 8 question/answer pairs from the following text. For each question, suggest
2 different ways of phrasing the question, and provide a unique answer.
Content:
{row['markdown']}
""",
}
],
functions=[
{
"name": "propose_qa_pairs",
"description": "Propose some question/answer pairs for a given document",
"parameters": QAPairs.model_json_schema(),
}
],
)
pairs = QAPairs(**json.loads(response.choices[0].message.function_call.arguments))
return pairs.pairs
# Create tasks for all API calls
all_candidates_tasks = [
asyncio.create_task(produce_candidate_questions(a))
for a in markdown_sections[:NUM_SECTIONS]
]
all_candidates = [await f for f in all_candidates_tasks]
data = []
row_id = 0
for row, doc_qa in zip(markdown_sections[:NUM_SECTIONS], all_candidates):
for i, qa in enumerate(doc_qa):
for j, q in enumerate(qa.questions):
data.append(
{
"input": q,
"expected": qa.answer,
"metadata": {
"document_id": row["doc_id"],
"section_id": row["section_id"],
"question_idx": i,
"answer_idx": j,
"id": row_id,
"split": (
"test" if j == len(qa.questions) - 1 and j > 0 else "train"
),
},
}
)
row_id += 1
print(f"Generated {len(data)} QA pairs. Here are the first 10:")
for x in data[:10]:
print(x)
```
```
Generated 320 QA pairs. Here are the first 10:
{'input': 'What is the purpose of starring a doc in Coda?', 'expected': 'Starring a doc in Coda helps you mark documents of personal importance, making it easier to organize and access them quickly.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 0, 'id': 0, 'split': 'train'}}
{'input': 'Why would someone want to star a document in Coda?', 'expected': 'Starring a doc in Coda helps you mark documents of personal importance, making it easier to organize and access them quickly.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 1, 'id': 1, 'split': 'test'}}
{'input': 'Where do starred docs appear in Coda?', 'expected': 'Starred docs appear in a section called My Shortcuts on your doc list, allowing for quick access.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 0, 'id': 2, 'split': 'train'}}
{'input': 'After starring a document in Coda, where can I find it?', 'expected': 'Starred docs appear in a section called My Shortcuts on your doc list, allowing for quick access.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 1, 'id': 3, 'split': 'test'}}
{'input': 'Does starring a doc affect other users in the workspace?', 'expected': 'No, starring a doc only saves it to your personal My Shortcuts and does not affect the view for others in your workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 0, 'id': 4, 'split': 'train'}}
{'input': 'Will my colleagues see the docs I star in Coda?', 'expected': 'No, starring a doc only saves it to your personal My Shortcuts and does not affect the view for others in your workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 1, 'id': 5, 'split': 'test'}}
{'input': 'What should I use if I want to share a shortcut to a doc with my team?', 'expected': 'To create a shortcut for a document that your team can access, you should use the pinning feature instead of starring.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 0, 'id': 6, 'split': 'train'}}
{'input': 'How can I create a shortcut for a document that everyone in my workspace can access?', 'expected': 'To create a shortcut for a document that your team can access, you should use the pinning feature instead of starring.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 1, 'id': 7, 'split': 'test'}}
{'input': 'Can starred documents come from different workspaces in Coda?', 'expected': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 0, 'id': 8, 'split': 'train'}}
{'input': 'Is it possible to star docs from multiple workspaces?', 'expected': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 1, 'id': 9, 'split': 'test'}}
```
## Evaluate a context-free prompt (no RAG)
Let's evaluate a simple prompt that poses each question without providing context from the Markdown docs. We'll evaluate this naive approach using the [Factuality prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) from the Braintrust [autoevals](/docs/reference/autoevals) library.
```python
async def simple_qa(input):
completion = await client.chat.completions.create(
model=QA_ANSWER_MODEL,
messages=[
{
"role": "user",
"content": f"""\
Please answer the following question:
Question: {input}
""",
}
],
)
return completion.choices[0].message.content
await braintrust.Eval(
name="Coda Help Desk Cookbook",
experiment_name="No RAG",
data=data[:NUM_QA_PAIRS],
task=simple_qa,
scores=[autoevals.Factuality(model=QA_GRADING_MODEL)],
)
```
### Analyze the evaluation in the UI
The cell above will print a link to a Braintrust experiment. Pause and navigate to the UI to view our baseline eval.

## Try using RAG to improve performance
Let's see if RAG (retrieval-augmented generation) can improve our results on this task.
First, we'll compute embeddings for each Markdown section using `text-embedding-ada-002` and create an index over the embeddings in [LanceDB](https://lancedb.com), a vector database. Then, for any given query, we can convert it to an embedding and efficiently find the most relevant context by searching in embedding space. We'll then provide the corresponding text as additional context in our prompt.
```python
tempdir = tempfile.TemporaryDirectory()
LANCE_DB_PATH = os.path.join(tempdir.name, "docs-lancedb")
@braintrust.traced
async def embed_text(text):
params = dict(input=text, model="text-embedding-ada-002")
response = await client.embeddings.create(**params)
embedding = response.data[0].embedding
braintrust.current_span().log(
metrics={
"tokens": response.usage.total_tokens,
"prompt_tokens": response.usage.prompt_tokens,
},
metadata={"model": response.model},
input=text,
output=embedding,
)
return embedding
embedding_tasks = [
asyncio.create_task(embed_text(row["markdown"]))
for row in markdown_sections[:NUM_SECTIONS]
]
embeddings = [await f for f in embedding_tasks]
db = lancedb.connect(LANCE_DB_PATH)
try:
db.drop_table("sections")
except:
pass
# Convert the data to a pandas DataFrame first
import pandas as pd
table_data = [
{
"doc_id": row["doc_id"],
"section_id": row["section_id"],
"text": row["markdown"],
"vector": embedding,
}
for (row, embedding) in zip(markdown_sections[:NUM_SECTIONS], embeddings)
]
# Create table using the DataFrame approach
table = db.create_table("sections", data=pd.DataFrame(table_data))
```
## Use AI to judge relevance of retrieved documents
Let's retrieve a few *more* of the best-matching candidates from the vector database than we intend to use, then use the model from `RELEVANCE_MODEL` to score the relevance of each candidate to the input query. We'll use the `TOP_K` blurbs by relevance score in our QA prompt. Doing this should be a little more intelligent than just using the closest embeddings.
```python
@braintrust.traced
async def relevance_score(query, document):
response = await client.chat.completions.create(
model=RELEVANCE_MODEL,
messages=[
{
"role": "user",
"content": f"""\
Consider the following query and a document
Query:
{query}
Document:
{document}
Please score the relevance of the document to a query, on a scale of 0 to 1.
""",
}
],
functions=[
{
"name": "has_relevance",
"description": "Declare the relevance of a document to a query",
"parameters": {
"type": "object",
"properties": {
"score": {"type": "number"},
},
},
}
],
)
arguments = response.choices[0].message.function_call.arguments
result = json.loads(arguments)
braintrust.current_span().log(
input={"query": query, "document": document},
output=result,
)
return result["score"]
async def retrieval_qa(input):
embedding = await embed_text(input)
with braintrust.current_span().start_span(
name="vector search", input=input
) as span:
result = table.search(embedding).limit(TOP_K + 3).to_arrow().to_pylist()
docs = [markdown_sections[i["section_id"]]["markdown"] for i in result]
relevance_scores = []
for doc in docs:
relevance_scores.append(await relevance_score(input, doc))
span.log(
output=[
{
"doc": markdown_sections[r["section_id"]]["markdown"],
"distance": r["_distance"],
}
for r in result
],
metadata={"top_k": TOP_K, "retrieval": result},
scores={
"avg_relevance": sum(relevance_scores) / len(relevance_scores),
"min_relevance": min(relevance_scores),
"max_relevance": max(relevance_scores),
},
)
context = "\n------\n".join(docs[:TOP_K])
completion = await client.chat.completions.create(
model=QA_ANSWER_MODEL,
messages=[
{
"role": "user",
"content": f"""\
Given the following context
{context}
Please answer the following question:
Question: {input}
""",
}
],
)
return completion.choices[0].message.content
```
## Run the RAG evaluation
Now let's run our evaluation with RAG:
```python
await braintrust.Eval(
name="Coda Help Desk Cookbook",
experiment_name=f"RAG TopK={TOP_K}",
data=data[:NUM_QA_PAIRS],
task=retrieval_qa,
scores=[autoevals.Factuality(model=QA_GRADING_MODEL)],
)
```
### Analyzing the results

Select the new experiment to analyze the results. You should notice several things:
* Braintrust automatically compares the new experiment to your previous one
* You should see an increase in scores with RAG
* You can explore individual examples to see exactly which responses improved
Try adjusting the constants set at the beginning of this tutorial, such as `NUM_QA_PAIRS`, to run your evaluation on a larger dataset and gain more confidence in your findings.
## Next steps
* Learn about [using functions to build a RAG agent](/docs/cookbook/recipes/ToolRAG).
* Compare your [evals across different models](/docs/cookbook/recipes/ModelComparison).
* If RAG is just one part of your agent, learn how to [evaluate a prompt chaining agent](docs/cookbook/recipes/PromptChaining).
---
file: ./content/docs/cookbook/recipes/EvaluatingChatAssistant.mdx
meta: {
"title": "Evaluating a chat assistant",
"language": "typescript",
"authors": [
{
"name": "Tara Nagar",
"website": "https://www.linkedin.com/in/taranagar/",
"avatar": "/blog/img/author/tara-nagar.jpg"
}
],
"date": "2024-07-16",
"tags": [
"evals",
"chat"
]
}
# Evaluating a chat assistant
## Evaluating a multi-turn chat assistant
This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant.
These types of chat bots have become important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it's important to be sure the bot provides value to the user.
We will expand on this below, but the history and context of a conversation is crucial in being able to produce a good response. If you received a request to "Make a dinner reservation at 7pm" and you knew where, on what date, and for how many people, you could provide some assistance; otherwise, you'd need to ask for more information.
Before starting, please make sure you have a Braintrust account. If you do not have one, you can [sign up here](https://www.braintrust.dev).
## Installing dependencies
Begin by installing the necessary dependencies if you have not done so already.
```typescript
pnpm install autoevals braintrust openai
```
## Inspecting the data
Let's take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying [dataset.ts file](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/EvaluatingChatAssistant/dataset.ts). The `assistant` turns were generated using `claude-3-5-sonnet-20240620`.
Below is an example of a data point.
* `chat_history` contains the history of the conversation between the user and the assistant
* `input` is the last `user` turn that will be sent in the `messages` argument to the chat completion
* `expected` is the output expected from the chat completion given the input
```typescript
import dataset, { ChatTurn } from "./assets/dataset";
console.log(dataset[0]);
```
```
{
chat_history: [
{
role: 'user',
content: "when was the ballon d'or first awarded for female players?"
},
{
role: 'assistant',
content: "The Ballon d'Or for female players was first awarded in 2018. The inaugural winner was Ada Hegerberg, a Norwegian striker who plays for Olympique Lyonnais."
}
],
input: "who won the men's trophy that year?",
expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}
```
From looking at this one example, we can see why the history is necessary to provide a helpful response.
If you were asked "Who won the men's trophy that year?" you would wonder *What trophy? Which year?* But if you were also given the `chat_history`, you would be able to answer the question (maybe after some quick research).
## Running experiments
The key to running evals on a multi-turn conversation is to include the history of the chat in the chat completion request.
### Assistant with no chat history
To start, let's see how the prompt performs when no chat history is provided. We'll create a simple task function that returns the output from a chat completion.
```typescript
import { wrapOpenAI } from "braintrust";
import { OpenAI } from "openai";
const experimentData = dataset.map((data) => ({
input: data.input,
expected: data.expected,
}));
console.log(experimentData[0]);
async function runTask(input: string) {
const client = wrapOpenAI(
new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here
}),
);
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content:
"You are a helpful and polite assistant who knows about sports.",
},
{
role: "user",
content: input,
},
],
});
return response.choices[0].message.content || "";
}
```
```
{
input: "who won the men's trophy that year?",
expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}
```
#### Scoring and running the eval
We'll use the `Factuality` scoring function from the [autoevals library](https://www.braintrust.dev/docs/reference/autoevals) to check how the output of the chat completion compares factually to the expected value.
We will also utilize [trials](https://www.braintrust.dev/docs/guides/evals/write#trials) by including the `trialCount` parameter in the `Eval` call. We expect the output of the chat completion to be non-deterministic, so running each input multiple times will give us a better sense of the "average" output.
```typescript
import { Eval } from "braintrust";
import Factuality from "autoevals";
Eval("Chat assistant", {
experimentName: "gpt-4o assistant - no history",
data: () => experimentData,
task: runTask,
scores: [Factuality],
trialCount: 3,
metadata: {
model: "gpt-4o",
prompt: "You are a helpful and polite assistant who knows about sports.",
},
});
```
```typescript
Experiment gpt - 4o assistant - no history is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history
████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints
=========================SUMMARY=========================
61.33% 'Factuality' score (0 improvements, 0 regressions)
4.12s 'duration' (0 improvements, 0 regressions)
0.01$ 'estimated_cost' (0 improvements, 0 regressions)
See results for gpt-4o assistant - no history at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history
```
61.33% Factuality score? Given what we discussed earlier about chat history being important in producing a good response, that's surprisingly high. Let's log onto [braintrust.dev](https://www.braintrust.dev) and take a look at how we got that score.
#### Interpreting the results

If we look at the score distribution chart, we can see ten of the fifteen examples scored at least 60%, with over half even scoring 100%. If we look into one of the examples with 100% score, we see the output of the chat completion request is asking for more context as we would expect:
`Could you please specify which athlete or player you're referring to? There are many professional athletes, and I'll need a bit more information to provide an accurate answer.`
This aligns with our expectation, so let's now look at how the score was determined.

Click into the scoring trace, we see the chain of thought reasoning used to settle on the score. The model chose `(E) The answers differ, but these differences don't matter from the perspective of factuality.` which is *technically* correct, but we want to penalize the chat completion for not being able to produce a good response.
#### Improve scoring with a custom scorer
While Factuality is a good general purpose scorer, for our use case option (E) is not well aligned with our expectations. The best way to work around this is to customize the scoring function so that it produces a lower score for asking for more context or specificity.
```typescript
import { LLMClassifierFromSpec, Score } from "autoevals";
function Factual(args: {
input: string;
output: string;
expected: string;
}): Score | Promise {
const factualityScorer = LLMClassifierFromSpec("Factuality", {
prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{{input}}}
************
[Expert]: {{{expected}}}
************
[Submission]: {{{output}}}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
(F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
(G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`,
choice_scores: {
A: 0.4,
B: 0.6,
C: 1,
D: 0,
E: 1,
F: 0.2,
G: 0,
},
});
return factualityScorer(args);
}
```
You can see the built-in Factuality prompt [here](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml). For our customized scorer, we've added two score choices to that prompt:
```
- (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
- (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.
```
These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user's input.
We can then use this spec and the `LLMClassifierFromSpec` function to create our customer scorer to use in the eval function.
Read more about [defining your own scorers](https://www.braintrust.dev/docs/guides/evals/write#define-your-own-scorers) in the documentation.
#### Re-running the eval
Let's now use this updated scorer and run the experiment again.
```typescript
Eval("Chat assistant", {
experimentName: "gpt-4o assistant - no history",
data: () =>
dataset.map((data) => ({ input: data.input, expected: data.expected })),
task: runTask,
scores: [Factual],
trialCount: 3,
metadata: {
model: "gpt-4o",
prompt: "You are a helpful and polite assistant who knows about sports.",
},
});
```
```typescript
Experiment gpt - 4o assistant - no history - 934e5ca2 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2
████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints
=========================SUMMARY=========================
gpt-4o assistant - no history-934e5ca2 compared to gpt-4o assistant - no history:
6.67% (-54.67%) 'Factuality' score (0 improvements, 5 regressions)
4.77s 'duration' (2 improvements, 3 regressions)
0.01$ 'estimated_cost' (2 improvements, 3 regressions)
See results for gpt-4o assistant - no history-934e5ca2 at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2
```
6.67% as a score aligns much better with what we expected. Let's look again into the results of this experiment.
#### Interpreting the results

In the table we can see the `output` fields in which the chat completion responses are requesting more context. In one of the experiment that had a non-zero score, we can see that the model asked for some clarification, but was able to understand from the question that the user was inquiring about a controversial World Series. Nice!

Looking into how the score was determined, we can see that the factual information aligned with the expert answer but the submitted answer still asks for more context, resulting in a score of 20% which is what we expect.
### Assistant with chat history
Now let's shift and see how providing the chat history improves the experiment.
#### Update the data, task function and scorer function
We need to edit the inputs to the `Eval` function so we can pass the chat history to the chat completion request.
```typescript
const experimentData = dataset.map((data) => ({
input: { input: data.input, chat_history: data.chat_history },
expected: data.expected,
}));
console.log(experimentData[0]);
async function runTask({
input,
chat_history,
}: {
input: string;
chat_history: ChatTurn[];
}) {
const client = wrapOpenAI(
new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here
}),
);
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content:
"You are a helpful and polite assistant who knows about sports.",
},
...chat_history,
{
role: "user",
content: input,
},
],
});
return response.choices[0].message.content || "";
}
function Factual(args: {
input: {
input: string;
chat_history: ChatTurn[];
};
output: string;
expected: string;
}): Score | Promise {
const factualityScorer = LLMClassifierFromSpec("Factuality", {
prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data:
[BEGIN DATA]
************
[Question]: {{{input}}}
************
[Expert]: {{{expected}}}
************
[Submission]: {{{output}}}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.
(F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
(G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`,
choice_scores: {
A: 0.4,
B: 0.6,
C: 1,
D: 0,
E: 1,
F: 0.2,
G: 0,
},
});
return factualityScorer(args);
}
```
```
{
input: {
input: "who won the men's trophy that year?",
chat_history: [ [Object], [Object] ]
},
expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}
```
We update the parameter to the task function to accept both the `input` string and the `chat_history` array and add the `chat_history` into the messages array in the chat completion request, done here using the spread `...` syntax.
We also need to update the `experimentData` and `Factual` function parameters to align with these changes.
#### Running the eval
Use the updated variables and functions to run a new eval.
```typescript
Eval("Chat assistant", {
experimentName: "gpt-4o assistant",
data: () => experimentData,
task: runTask,
scores: [Factual],
trialCount: 3,
metadata: {
model: "gpt-4o",
prompt: "You are a helpful and polite assistant who knows about sports.",
},
});
```
```typescript
Experiment gpt - 4o assistant is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant
████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints
=========================SUMMARY=========================
gpt-4o assistant compared to gpt-4o assistant - no history-934e5ca2:
60.00% 'Factuality' score (0 improvements, 0 regressions)
4.34s 'duration' (0 improvements, 0 regressions)
0.01$ 'estimated_cost' (0 improvements, 0 regressions)
See results for gpt-4o assistant at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant
```
60% score is a definite improvement from 4%.
You'll notice that it says there were 0 improvements and 0 regressions compared to the last experiment `gpt-4o assistant - no history-934e5ca2` we ran. This is because by default, Braintrust uses the `input` field to match rows across experiments. From the dashboard, we can customize the comparison key ([see docs](https://www.braintrust.dev/docs/guides/evals/interpret#customizing-the-comparison-key)) by going to the [project configuration page](https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/configuration).
#### Update experiment comparison for diff mode
Let's go back to the dashboard.
For this cookbook, we can use the `expected` field as the comparison key because this field is unique in our small dataset.
In the Configuration tab, go to the bottom of the page to update the comparison key:

#### Interpreting the results
Turn on diff mode using the toggle on the upper right of the table.

Since we updated the comparison key, we can now see the improvements in the Factuality score between the experiment run with chat history and the most recent one run without for each of the examples. If we also click into a trace, we can see the change in input parameters that we made above where it went from a `string` to an object with `input` and `chat_history` fields.
All of our rows scored 60% in this experiment. If we look into each trace, this means the submitted answer includes all the details from the expert answer with some additional information.
60% is an improvement from the previous run, but we can do better. Since it seems like the chat completion is always returning more than necessary, let's see if we can tweak our prompt to have the output be more concise.
#### Improving the result
Let's update the system prompt used in the chat completion request.
```typescript
async function runTask({
input,
chat_history,
}: {
input: string;
chat_history: ChatTurn[];
}) {
const client = wrapOpenAI(
new OpenAI({
baseURL: "https://api.braintrust.dev/v1/proxy",
apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral etc. API keys here
}),
);
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content:
"You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.",
},
...chat_history,
{
role: "user",
content: input,
},
],
});
return response.choices[0].message.content || "";
}
```
In the task function, we'll update the `system` message to specify the output should be precise and then run the eval again.
```typescript
Eval("Chat assistant", {
experimentName: "gpt-4o assistant - concise",
data: () => experimentData,
task: runTask,
scores: [Factual],
trialCount: 3,
metadata: {
model: "gpt-4o",
prompt:
"You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.",
},
});
```
```typescript
Experiment gpt - 4o assistant - concise is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise
████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints
=========================SUMMARY=========================
gpt-4o assistant - concise compared to gpt-4o assistant:
86.67% (+26.67%) 'Factuality' score (4 improvements, 0 regressions)
1.89s 'duration' (5 improvements, 0 regressions)
0.01$ 'estimated_cost' (4 improvements, 1 regressions)
See results for gpt-4o assistant - concise at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise
```
Let's go into the dashboard and see the new experiment.

Success! We got a 27 percentage point increase in factuality, up to an average score of 87% for this experiment with our updated prompt.
### Conclusion
We've seen in this cookbook how to evaluate a chat assistant and visualized how the chat history effects the output of the chat completion. Along the way, we also utilized some other functionality such as updating the comparison key in the diff view and creating a custom scoring function.
Try seeing how you can improve the outputs and scores even further!
---
file: ./content/docs/cookbook/recipes/Github-Issues.mdx
meta: {
"title": "Improving Github issue titles using their contents",
"language": "typescript",
"authors": [
{
"name": "Ankur Goyal",
"website": "https://twitter.com/ankrgyl",
"avatar": "/blog/img/author/ankur-goyal.jpg"
}
],
"date": "2023-10-29",
"tags": [
"evals",
"summarization"
]
}
# Improving Github issue titles using their contents
This tutorial will teach you how to use Braintrust to generate better titles for Github issues, based on their
content. This is a great way to learn how to work with text and evaluate subjective criteria, like summarization quality.
We'll use a technique called **model graded evaluation** to automatically evaluate the newly generated titles
against the original titles, and improve our prompt based on what we find.
Before starting, please make sure that you have a Braintrust account. If you do not, please [sign up](https://www.braintrust.dev). After this tutorial, feel free to dig deeper by visiting [the docs](http://www.braintrust.dev/docs).
## Installing dependencies
To see a list of dependencies, you can view the accompanying [package.json](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/Github-Issues/package.json) file. Feel free to copy/paste snippets of this code to run in your environment, or use [tslab](https://github.com/yunabe/tslab) to run the tutorial in a Jupyter notebook.
## Downloading the data
We'll start by downloading some issues from Github using the `octokit` SDK. We'll use the popular open source project [next.js](https://github.com/vercel/next.js).
```typescript
import { Octokit } from "@octokit/core";
const ISSUES = [
"https://github.com/vercel/next.js/issues/59999",
"https://github.com/vercel/next.js/issues/59997",
"https://github.com/vercel/next.js/issues/59995",
"https://github.com/vercel/next.js/issues/59988",
"https://github.com/vercel/next.js/issues/59986",
"https://github.com/vercel/next.js/issues/59971",
"https://github.com/vercel/next.js/issues/59958",
"https://github.com/vercel/next.js/issues/59957",
"https://github.com/vercel/next.js/issues/59950",
"https://github.com/vercel/next.js/issues/59940",
];
// Octokit.js
// https://github.com/octokit/core.js#readme
const octokit = new Octokit({
auth: process.env.GITHUB_ACCESS_TOKEN || "Your Github Access Token",
});
async function fetchIssue(url: string) {
// parse url of the form https://github.com/supabase/supabase/issues/15534
const [owner, repo, _, issue_number] = url!.trim().split("/").slice(-4);
const data = await octokit.request(
"GET /repos/{owner}/{repo}/issues/{issue_number}",
{
owner,
repo,
issue_number: parseInt(issue_number),
headers: {
"X-GitHub-Api-Version": "2022-11-28",
},
}
);
return data.data;
}
const ISSUE_DATA = await Promise.all(ISSUES.map(fetchIssue));
```
Let's take a look at one of the issues:
```typescript
console.log(ISSUE_DATA[0].title);
console.log("-".repeat(ISSUE_DATA[0].title.length));
console.log(ISSUE_DATA[0].body.substring(0, 512) + "...");
```
```
The instrumentation hook is only called after visiting a route
--------------------------------------------------------------
### Link to the code that reproduces this issue
https://github.com/daveyjones/nextjs-instrumentation-bug
### To Reproduce
\`\`\`shell
git clone git@github.com:daveyjones/nextjs-instrumentation-bug.git
cd nextjs-instrumentation-bug
npm install
npm run dev # The register function IS called
npm run build && npm start # The register function IS NOT called until you visit http://localhost:3000
\`\`\`
### Current vs. Expected behavior
The \`register\` function should be called automatically after running \`npm ...
```
## Generating better titles
Let's try to generate better titles using a simple prompt. We'll use OpenAI, although you could try this out with any model that supports text generation.
We'll start by initializing an OpenAI client and wrapping it with some Braintrust instrumentation. `wrapOpenAI`
is initially a no-op, but later on when we use Braintrust, it will help us capture helpful debugging information about the model's performance.
```typescript
import { wrapOpenAI } from "braintrust";
import { OpenAI } from "openai";
const client = wrapOpenAI(
new OpenAI({
apiKey: process.env.OPENAI_API_KEY || "Your OpenAI API Key",
})
);
```
```typescript
import { ChatCompletionMessageParam } from "openai/resources";
function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] {
return [
{
role: "system",
content:
"Generate a new title based on the github issue. Return just the title.",
},
{
role: "user",
content: "Github issue: " + content,
},
];
}
async function generateTitle(input: string) {
const messages = titleGeneratorMessages(input);
const response = await client.chat.completions.create({
model: "gpt-3.5-turbo",
messages,
seed: 123,
});
return response.choices[0].message.content || "";
}
const generatedTitle = await generateTitle(ISSUE_DATA[0].body);
console.log("Original title: ", ISSUE_DATA[0].title);
console.log("Generated title:", generatedTitle);
```
```
Original title: The instrumentation hook is only called after visiting a route
Generated title: Next.js: \`register\` function not automatically called after build and start
```
## Scoring
Ok cool! The new title looks pretty good. But how do we consistently and automatically evaluate whether the new titles are better than the old ones?
With subjective problems, like summarization, one great technique is to use an LLM to grade the outputs. This is known as model graded evaluation. Below, we'll use a [summarization prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/summary.yaml)
from Braintrust's open source [autoevals](https://github.com/braintrustdata/autoevals) library. We encourage you to use these prompts, but also to copy/paste them, modify them, and create your own!
The prompt uses [Chain of Thought](https://arxiv.org/abs/2201.11903) which dramatically improves a model's performance on grading tasks. Later, we'll see how it helps us debug the model's outputs.
Let's try running it on our new title and see how it performs.
```typescript
import { Summary } from "autoevals";
await Summary({
output: generatedTitle,
expected: ISSUE_DATA[0].title,
input: ISSUE_DATA[0].body,
// In practice we've found gpt-4 class models work best for subjective tasks, because
// they are great at following criteria laid out in the grading prompts.
model: "gpt-4-1106-preview",
});
```
```
{
name: 'Summary',
score: 1,
metadata: {
rationale: "Summary A ('The instrumentation hook is only called after visiting a route') is a partial and somewhat ambiguous statement. It does not specify the context of the 'instrumentation hook' or the technology involved.\n" +
"Summary B ('Next.js: \`register\` function not automatically called after build and start') provides a clearer and more complete description. It specifies the technology ('Next.js') and the exact issue ('\`register\` function not automatically called after build and start').\n" +
'The original text discusses an issue with the \`register\` function in a Next.js application not being called as expected, which is directly reflected in Summary B.\n' +
"Summary B also aligns with the section 'Current vs. Expected behavior' from the original text, which states that the \`register\` function should be called automatically but is not until a route is visited.\n" +
"Summary A lacks the detail that the issue is with the Next.js framework and does not mention the expectation of the \`register\` function's behavior, which is a key point in the original text.",
choice: 'B'
},
error: undefined
}
```
## Initial evaluation
Now that we have a way to score new titles, let's run an eval and see how our prompt performs across all 10 issues.
```typescript
import { Eval, login } from "braintrust";
login({ apiKey: process.env.BRAINTUST_API_KEY || "Your Braintrust API Key" });
await Eval("Github Issues Cookbook", {
data: () =>
ISSUE_DATA.map((issue) => ({
input: issue.body,
expected: issue.title,
metadata: issue,
})),
task: generateTitle,
scores: [
async ({ input, output, expected }) =>
Summary({
input,
output,
expected,
model: "gpt-4-1106-preview",
}),
],
});
console.log("Done!");
```
```
{
projectName: 'Github Issues Cookbook',
experimentName: 'main-1706774628',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook/main-1706774628',
comparisonExperimentName: undefined,
scores: undefined,
metrics: undefined
}
```
```
████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Github Issues Cookbook | 10% | 10/100 datapoints
```
```
Done!
```
Great! We got an initial result. If you follow the link, you'll see an eval result showing an initial score of 40%.

## Debugging failures
Let's dig into a couple examples to see what's going on. Thanks to the instrumentation we added earlier, we can see the model's reasoning for its scores.
Issue [https://github.com/vercel/next.js/issues/59995](https://github.com/vercel/next.js/issues/59995):


Issue [https://github.com/vercel/next.js/issues/59986](https://github.com/vercel/next.js/issues/59986):


## Improving the prompt
Hmm, it looks like the model is missing certain key details. Let's see if we can improve our prompt to encourage the model to include more details, without being too verbose.
```typescript
function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] {
return [
{
role: "system",
content: `Generate a new title based on the github issue. The title should include all of the key
identifying details of the issue, without being longer than one line. Return just the title.`,
},
{
role: "user",
content: "Github issue: " + content,
},
];
}
async function generateTitle(input: string) {
const messages = titleGeneratorMessages(input);
const response = await client.chat.completions.create({
model: "gpt-3.5-turbo",
messages,
seed: 123,
});
return response.choices[0].message.content || "";
}
```
### Re-evaluating
Now that we've tweaked our prompt, let's see how it performs by re-running our eval.
```typescript
await Eval("Github Issues Cookbook", {
data: () =>
ISSUE_DATA.map((issue) => ({
input: issue.body,
expected: issue.title,
metadata: issue,
})),
task: generateTitle,
scores: [
async ({ input, output, expected }) =>
Summary({
input,
output,
expected,
model: "gpt-4-1106-preview",
}),
],
});
console.log("All done!");
```
```
{
projectName: 'Github Issues Cookbook',
experimentName: 'main-1706774676',
projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook',
experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook/main-1706774676',
comparisonExperimentName: 'main-1706774628',
scores: {
Summary: {
name: 'Summary',
score: 0.7,
diff: 0.29999999999999993,
improvements: 3,
regressions: 0
}
},
metrics: {
duration: {
name: 'duration',
metric: 0.3292001008987427,
unit: 's',
diff: -0.002199888229370117,
improvements: 7,
regressions: 3
}
}
}
```
```
████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Github Issues Cookbook | 10% | 10/100 datapoints
```
```
All done!
```
Wow, with just a simple change, we're able to boost summary performance by 30%!

## Parting thoughts
This is just the start of evaluating and improving this AI application. From here, you should dig into
individual examples, verify whether they legitimately improved, and test on more data. You can even use
[logging](https://www.braintrust.dev/docs/guides/logging) to capture real-user examples and incorporate
them into your evals.
Happy evaluating!

---
file: ./content/docs/cookbook/recipes/HTMLGenerator.mdx
meta: {
"title": "Generating beautiful HTML components",
"language": "typescript",
"authors": [
{
"name": "Ankur Goyal",
"website": "https://twitter.com/ankrgyl",
"avatar": "/blog/img/author/ankur-goyal.jpg"
}
],
"date": "2024-01-29",
"tags": [
"logging",
"datasets",
"evals"
]
}
# Generating beautiful HTML components
In this example, we'll build an app that automatically generates HTML components, evaluates them, and captures user feedback. We'll use the feedback and evaluations to build up a dataset
that we'll use as a basis for further improvements.
## The generator
We'll start by using a very simple prompt to generate HTML components using `gpt-3.5-turbo`.
First, we'll initialize an openai client and wrap it with Braintrust's helper. This is a no-op until we start using
the client within code that is instrumented by Braintrust.
```typescript
import { OpenAI } from "openai";
import { wrapOpenAI } from "braintrust";
const openai = wrapOpenAI(
new OpenAI({
apiKey: process.env.OPENAI_API_KEY || "Your OPENAI_API_KEY",
})
);
```
This code generates a basic prompt:
```typescript
import { ChatCompletionMessageParam } from "openai/resources";
function generateMessages(input: string): ChatCompletionMessageParam[] {
return [
{
role: "system",
content: `You are a skilled design engineer
who can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.
Your designs value simplicity, conciseness, clarity, and functionality over
complexity.
You generate pure HTML with inline CSS, so that your designs can be rendered
directly as plain HTML. Only generate components, not full HTML pages. Do not
create background colors.
Users will send you a description of a design, and you must reply with HTML,
and nothing else. Your reply will be directly copied and rendered into a browser,
so do not include any text. If you would like to explain your reasoning, feel free
to do so in HTML comments.`,
},
{
role: "user",
content: input,
},
];
}
JSON.stringify(
generateMessages("A login form for a B2B SaaS product."),
null,
2
);
```
```
[
{
"role": "system",
"content": "You are a skilled design engineer\nwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.\nYour designs value simplicity, conciseness, clarity, and functionality over\ncomplexity.\n\nYou generate pure HTML with inline CSS, so that your designs can be rendered\ndirectly as plain HTML. Only generate components, not full HTML pages. Do not\ncreate background colors.\n\nUsers will send you a description of a design, and you must reply with HTML,\nand nothing else. Your reply will be directly copied and rendered into a browser,\nso do not include any text. If you would like to explain your reasoning, feel free\nto do so in HTML comments."
},
{
"role": "user",
"content": "A login form for a B2B SaaS product."
}
]
```
Now, let's run this using `gpt-3.5-turbo`. We'll also do a few things that help us log & evaluate this function later:
* Wrap the execution in a `traced` call, which will enable Braintrust to log the inputs and outputs of the function when we run it in production or in evals
* Make its signature accept a single `input` value, which Braintrust's `Eval` function expects
* Use a `seed` so that this test is reproduceable
```typescript
import { traced } from "braintrust";
async function generateComponent(input: string) {
return traced(
async (span) => {
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: generateMessages(input),
seed: 101,
});
const output = response.choices[0].message.content;
span.log({ input, output });
return output;
},
{
name: "generateComponent",
}
);
}
```
### Examples
Let's look at a few examples!
```typescript
await generateComponent("Do a reset password form inside a card.");
```
```
Reset Password
```
To make this easier to validate, we'll use [puppeteer](https://pptr.dev/) to render the HTML as a screenshot.
```typescript
import puppeteer from "puppeteer";
import * as tslab from "tslab";
async function takeFullPageScreenshotAsUInt8Array(htmlContent) {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.setContent(htmlContent);
const screenshotBuffer = await page.screenshot();
const uint8Array = new Uint8Array(screenshotBuffer);
await browser.close();
return uint8Array;
}
async function displayComponent(input: string) {
const html = await generateComponent(input);
const img = await takeFullPageScreenshotAsUInt8Array(html);
tslab.display.png(img);
console.log(html);
}
await displayComponent("Do a reset password form inside a card.");
```

```
Reset Password
```
```typescript
await displayComponent("Create a profile page for a social network.");
```

```
John Doe
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla ut turpis
hendrerit, ullamcorper velit in, iaculis arcu.
500
Followers
250
Following
1000
Posts
```
```typescript
await displayComponent(
"Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode."
);
```

```
Logs Viewer
12:30 PMInfo: Cloud instance created successfully
12:45 PMWarning: High CPU utilization on instance #123
01:00 PMError: Connection lost to the database server
```
## Scoring the results
It looks like in a few of these examples, the model is generating a full HTML page, instead of a component as we requested. This is something we can evaluate, to ensure that it does not happen!
```typescript
const containsHTML = (s) => /<(html|body)>/i.test(s);
containsHTML(
await generateComponent(
"Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode."
)
);
```
```
true
```
Now, let's update our function to compute this score. Let's also keep track of requests and their ids, so that we can provide user feedback. Normally you would store these in a database, but for demo purposes, a global dictionary should suffice.
```typescript
// Normally you would store these in a database, but for this demo we'll just use a global variable.
let requests = {};
async function generateComponent(input: string) {
return traced(
async (span) => {
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: generateMessages(input),
seed: 101,
});
const output = response.choices[0].message.content;
requests[input] = span.id;
span.log({
input,
output,
scores: { isComponent: containsHTML(output) ? 0 : 1 },
});
return output;
},
{
name: "generateComponent",
}
);
}
```
## Logging results
To enable logging to Braintrust, we just need to initialize a logger. By default, a logger is automatically marked as the current, global logger, and once initialized will be picked up by `traced`.
```typescript
import { initLogger } from "braintrust";
const logger = initLogger({
projectName: "Component generator",
apiKey: process.env.BRAINTRUST_API_KEY || "Your BRAINTRUST_API_KEY",
});
```
Now, we'll run the `generateComponent` function on a few examples, and see what the results look like in Braintrust.
```typescript
const inputs = [
"A login form for a B2B SaaS product.",
"Create a profile page for a social network.",
"Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode.",
];
for (const input of inputs) {
await generateComponent(input);
}
console.log(`Logged ${inputs.length} requests to Braintrust.`);
```
```
Logged 3 requests to Braintrust.
```
### Viewing the logs in Braintrust
Once this runs, you should be able to see the raw inputs and outputs, along with their scores in the project.

### Capturing user feedback
Let's also track user ratings for these components. Separate from whether or not they're formatted as HTML, it'll be useful to track whether users like the design.
To do this, [configure a new score in the project](https://www.braintrust.dev/docs/guides/human-review#configuring-human-review). Let's call it "User preference" and make it a 👍/👎.

Once you create a human review score, you can evaluate results directly in the Braintrust UI, or capture end-user feedback. Here, we'll pretend to capture end-user feedback. Personally, I liked the login form and logs viewer, but not the profile page. Let's record feedback accordingly.
```typescript
// Along with scores, you can optionally log user feedback as comments, for additional color.
logger.logFeedback({
id: requests["A login form for a B2B SaaS product."],
scores: { "User preference": 1 },
comment: "Clean, simple",
});
logger.logFeedback({
id: requests["Create a profile page for a social network."],
scores: { "User preference": 0 },
});
logger.logFeedback({
id: requests[
"Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode."
],
scores: { "User preference": 1 },
comment:
"No frills! Would have been nice to have borders around the entries.",
});
```
As users provide feedback, you'll see the updates they make in each log entry.

## Creating a dataset
Now that we've collected some interesting examples from users, let's collect them into a dataset, and see if we can improve the `isComponent` score.
In the Braintrust UI, select the examples, and add them to a new dataset called "Interesting cases".

Once you create the dataset, it should look something like this:

## Evaluating
Now that we have a dataset, let's evaluate the `isComponent` function on it. We'll use the `Eval` function, which takes a dataset and a function, and evaluates the function on each example in the dataset.
```typescript
import { Eval, initDataset } from "braintrust";
await Eval("Component generator", {
data: async () => {
const dataset = initDataset("Component generator", {
dataset: "Interesting cases",
});
const records = [];
for await (const { input } of dataset.fetch()) {
records.push({ input });
}
return records;
},
task: generateComponent,
// We do not need to add any additional scores, because our
// generateComponent() function already computes `isComponent`
scores: [],
});
```
Once the eval runs, you'll see a summary which includes a link to the experiment. As expected, only one of the three outputs contains HTML, so the score is 33.3%. Let's also label user preference for this experiment, so we can track aesthetic taste manually. For simplicity's sake, we'll use the same labeling as before.

### Improving the prompt
Next, let's try to tweak the prompt to stop rendering full HTML pages.
```typescript
function generateMessages(input: string): ChatCompletionMessageParam[] {
return [
{
role: "system",
content: `You are a skilled design engineer
who can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.
Your designs value simplicity, conciseness, clarity, and functionality over
complexity.
You generate pure HTML with inline CSS, so that your designs can be rendered
directly as plain HTML. Only generate components, not full HTML pages. If you
need to add CSS, you can use the "style" property of an HTML tag. You cannot use
global CSS in a