# integration-inspect-osworld (OSWorld via Inspect) This example runs a real [OSWorld](https://github.com/xlang-ai/OSWorld) task through promptfoo by wrapping the Inspect-native implementation in [`inspect_evals/osworld`](https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/osworld). OSWorld is a multimodal computer-use benchmark where an agent observes an Ubuntu desktop via screenshots, acts with mouse and keyboard tools, and is graded by task-specific checks against VM state. The benchmark is described in [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https://arxiv.org/abs/2404.07972). This is an orchestration wrapper, not a from-scratch promptfoo-native computer-use agent loop. Inspect owns the Docker sandbox, `basic_agent` solver, `computer` tool, screenshots, model calls, and OSWorld scorer. Promptfoo starts one Inspect eval, dumps the `.eval` log to JSON, parses the final score, and applies a normal promptfoo assertion. ## Prerequisites You need: - Docker Engine 24.0.6 or newer, running and usable by your current user. - Docker Compose V2 available as `docker compose`. Inspect validates this with `docker compose version --format json`; a standalone `docker-compose` binary is not enough unless your `docker` command exposes it as `docker compose`. - Python with Inspect's OSWorld dependencies, Promptfoo's Python OpenTelemetry dependencies, and the SDK for whichever model provider you choose. This installs both SDKs used below: ```bash pip install 'inspect-evals[osworld]' openai anthropic opentelemetry-sdk opentelemetry-exporter-otlp-proto-http ``` - A computer-use-capable model and API key. For the default config, export `OPENAI_API_KEY`. To use Anthropic instead, export `ANTHROPIC_API_KEY` and set `vars.model` or `providers[0].config.defaultModel` to an Inspect model such as `anthropic/claude-sonnet-4-5`. - Disk and time for Inspect's OSWorld Docker image. The first run builds an image of roughly 8GB and can take several minutes before the sample starts. - Budget for a non-trivial model run. Start with one exact sample before expanding to a larger subset or the full suite. The default config uses `inspect_evals/osworld_small`, the smaller OSWorld corpus supported by Inspect. `promptfooconfig.full.yaml` switches to `inspect_evals/osworld` with `include_connected=true`, which loads every Inspect-supported full-corpus sample. In the Inspect version used for this example, that is 246 samples, not the 369-task upstream OSWorld paper corpus. ## Run For the first real verification from the repository root, run one exact sample: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ npm run local -- eval -c examples/integration-inspect-osworld/promptfooconfig.yaml --no-cache \ --filter-metadata sample_id=42e0a640-4f19-4b28-973d-729602b5a4a7 ``` Or, after copying the example with `npx promptfoo@latest init --example integration-inspect-osworld`, run: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.yaml --no-cache \ --filter-metadata sample_id=42e0a640-4f19-4b28-973d-729602b5a4a7 ``` After that succeeds, broaden to an app subset: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.yaml --no-cache \ --filter-metadata app=libreoffice_calc --max-concurrency 1 \ -o osworld-libreoffice-calc.json ``` App filters are still multi-sample runs. In the current `osworld_small` set, `app=libreoffice_calc` selects three samples; in one local GPT-5.5 verification on April 29, 2026, that sequential subset took 12m31s and used 533,101 total tokens. Treat that as scale guidance, not a fixed benchmark. To run the full supported small suite, remove the metadata filter and set a concurrency appropriate for your machine: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.yaml --no-cache --max-concurrency 6 \ -o osworld-results.json ``` To run Inspect's full supported corpus through Promptfoo, use the dedicated full-suite config: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.full.yaml --no-cache --max-concurrency 3 \ -o osworld-full-results.json ``` That config keeps the same wrapper but switches both moving pieces that define the run: ```yaml providers: - id: file://provider.py config: task: inspect_evals/osworld taskParameters: include_connected: true tests: file://osworld_tests.py:generate_full_tests ``` Because the full config includes connected samples, it is more sensitive to the runtime network environment than the default small-suite config. The full config also uses larger timeouts than the small config: - `timeout: 7500000` gives Promptfoo's Python worker a little over two hours. - `timeoutSeconds: 7200` gives the inner Inspect subprocess two hours. Some full-suite Writer rows can exceed the small config's 30-minute timeout budget, so keep the full-suite timeouts larger than the exact-sample and small-suite defaults. `promptfooconfig.yaml` keeps the shared assertion and tracing metadata in `defaultTest`, then asks `osworld_tests.py` to generate the OSWorld rows: ```yaml defaultTest: metadata: tracingEnabled: true assert: - type: python value: file://assertion.py tests: file://osworld_tests.py:generate_tests ``` The loader calls Inspect's `osworld_small().dataset` or `osworld(include_connected=True).dataset` and returns one Promptfoo test case per supported sample. Each row sets `vars.prompt`, `vars.app`, `vars.sample_id`, and matching filterable metadata. Because Inspect supplies the sample ids, updating `inspect-evals` updates the generated row list without maintaining a local copy. The default config runs the full Inspect-supported `osworld_small` suite, which is 21 samples in the version used for the reference run below. To run a broader subset after the exact-sample check, filter by app metadata: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.yaml --no-cache --filter-metadata app=libreoffice_calc ``` To run the smallest real end-to-end validation, filter by `sample_id`: ```bash PROMPTFOO_ENABLE_OTEL=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318 \ promptfoo eval -c promptfooconfig.yaml --no-cache \ --filter-metadata sample_id=42e0a640-4f19-4b28-973d-729602b5a4a7 ``` For custom subsets, filter by metadata at the CLI. The generated metadata uses OSWorld app ids, normalizes VS Code to `vscode`, and keeps multi-app tasks under `multi_apps`. Use the run scopes intentionally: 1. `mockllm/model --limit 0` checks the Inspect CLI shape without model spend. 2. `--filter-metadata sample_id=...` is the smallest real end-to-end validation. 3. `--filter-metadata app=...` is a broader app slice and may include multiple samples. 4. No filter on `promptfooconfig.yaml` runs the full small suite. 5. `promptfooconfig.full.yaml` runs Inspect's full supported corpus and is the benchmark-style configuration. The example config sets two timeouts because both layers need enough time: - `providers[0].config.timeout` is promptfoo's Python worker timeout in milliseconds. - `providers[0].config.timeoutSeconds` is the inner Inspect subprocess timeout in seconds. ## Expected output The provider returns text like: ```text Sample on app libreoffice_calc: score=1.0 status=pass Final answer: ``` It also returns metadata for the promptfoo UI and assertions: ```json { "inspect_log_path": "/absolute/path/to/examples/integration-inspect-osworld/inspect_logs/.../*.eval", "score": 1.0, "status": "pass", "sample_id": "...", "model": "openai/gpt-5.5", "num_messages": 42, "duration_seconds": 600.0 } ``` The Python assertion passes when `metadata.score >= 1.0` or `metadata.status == "pass"`. If Inspect exits before a scored sample is available, or if the selected sample has no OSWorld scorer result, the provider returns an error instead of converting that condition into a benchmark failure. For subprocess failures, Promptfoo stores only a concise error plus the local log path/status/duration; inspect the local Inspect logs when you need the detailed trajectory or raw tool output. ## Reference GPT-5.5 run A local traced run of the generated 21-sample suite with exact `sample_id` selectors and `--max-concurrency 6` completed in 20m 9s. GPT-5.5 passed 13 samples and produced 7 scored failures. One concurrent run hit an Inspect computer-tool runtime error before scoring; rerunning that exact `sample_id` alone with `--max-concurrency 1` completed normally with score `0.0`. After that rerun, the report had 13 passes, 8 scored failures, 0 provider errors, and mean OSWorld score `0.665`. Promptfoo recorded 21 trace records and 21 Python provider spans for the concurrent run. For larger benchmark reports, rerun provider-error samples by exact `sample_id` before publishing a pass rate. Count reruns that produce an OSWorld score as normal passes or failures, and keep repeated provider errors separate from scored benchmark failures. For the full supported corpus, a local GPT-5.5 run on April 30, 2026 used `promptfooconfig.full.yaml`, `--max-concurrency 3`, and a 6-vCPU / 16-GiB Colima VM. The 246-sample run took 5h27m5s and used 54,421,072 total tokens. The raw run ended at 138 passes, 101 scored failures, and 7 provider errors. Rerunning those seven rows one at a time recovered one pass and two ordinary scored failures; four rows repeated as provider errors. The reconciled report was therefore 139 passes, 103 scored failures, 4 provider errors, and mean OSWorld score `0.594` across the 242 scored rows. The seven targeted reruns added 1,917,890 tokens. The repeated provider errors were not model failures: one row reproduced an Inspect computer-tool runtime error, one row reproduced an OSWorld scorer missing-image-artifact error, and two VLC rows reproduced an OSWorld scorer environment error. Keep those rows outside the scored denominator unless a later rerun produces an OSWorld score. ## Inspect logs and traces Inspect writes `.eval` files under `examples/integration-inspect-osworld/inspect_logs/`. They are ignored by git because they can include screenshots, trajectories, tool calls, model outputs, and other large run artifacts. For trace-level visibility into the OSWorld desktop trajectory, use Inspect's viewer: ```bash inspect view --log-dir examples/integration-inspect-osworld/inspect_logs ``` The example config enables Promptfoo OpenTelemetry tracing. Set `PROMPTFOO_ENABLE_OTEL=true` for Python provider spans. This records the Python provider call and links it to the eval result, but it does not translate Inspect's internal screenshots, mouse moves, keyboard actions, or scorer events into Promptfoo trajectory spans. Use Inspect's `.eval` log for those steps. ## Smoke test without model spend To check the Inspect CLI shape without running a full OSWorld sample: ```bash inspect eval inspect_evals/osworld_small --model mockllm/model --limit 0 --log-dir inspect log dump ``` A real end-to-end OSWorld run still requires Docker, the first-run image build, and provider credentials. Use one exact sample before spending on larger slices.