# Codex App Server Provider Notes These notes track the planned Promptfoo integration for the Codex app-server protocol. They are intentionally implementation-facing: keep them current as the provider, docs, examples, and verification expand. For the broader coding-agent provider taxonomy, see [`coding-agent-provider-taxonomy.md`](./coding-agent-provider-taxonomy.md). ## Objective Add an experimental Promptfoo provider that drives `codex app-server` directly. The provider should complement, not replace, the existing OpenAI Codex SDK provider: - Codex SDK provider: best default for CI and automation. - Codex app-server provider: best for evaluating rich-client behavior exposed by the Codex app-server protocol, including streamed item events, approvals, skills, plugins, apps, filesystem requests, and thread lifecycle primitives. Primary provider IDs: - `openai:codex-app-server` - `openai:codex-app-server:` - `openai:codex-desktop` - `openai:codex-desktop:` Optional top-level aliases may be added after the OpenAI-scoped provider is stable: - `codex:app-server` - `codex:desktop` ## Source Material - Official docs: https://developers.openai.com/codex/app-server - Local CLI: `/Applications/Codex.app/Contents/Resources/codex app-server --help` - Local generated schema command: ```bash codex app-server generate-ts --out /tmp/codex-app-server-schema/ts codex app-server generate-json-schema --out /tmp/codex-app-server-schema/json ``` Current local schema inspection was generated from `codex-cli 0.118.0`. ## Protocol Shape Transport: - `stdio://` default, JSONL messages. - `ws://IP:PORT` experimental, one JSON-RPC message per WebSocket text frame. Handshake: 1. Send `initialize` with Promptfoo client metadata. 2. Send `initialized` notification. 3. Start or resume a thread. 4. Start a turn. 5. Read notifications until `turn/completed`. Core client requests: - `initialize` - `thread/start` - `thread/resume` - `thread/archive` - `thread/unsubscribe` - `thread/read` - `turn/start` - `turn/steer` - `turn/interrupt` - `review/start` - `model/list` - `skills/list` - `plugin/list` - `plugin/read` - `app/list` High-risk client requests that should not be exposed casually: - `fs/writeFile` - `fs/remove` - `fs/copy` - `config/value/write` - `config/batchWrite` - `plugin/install` - `plugin/uninstall` - `command/exec` Core server notifications: - `thread/started` - `thread/status/changed` - `turn/started` - `turn/completed` - `item/started` - `item/completed` - `item/agentMessage/delta` - `item/commandExecution/outputDelta` - `item/fileChange/outputDelta` - `item/mcpToolCall/progress` - `serverRequest/resolved` - `thread/tokenUsage/updated` - `error` Core server requests requiring deterministic Promptfoo responses: - `item/commandExecution/requestApproval` - `item/fileChange/requestApproval` - `item/permissions/requestApproval` - `item/tool/requestUserInput` - `mcpServer/elicitation/request` - `item/tool/call` ## Provider Contract ### Inputs Promptfoo prompt strings remain the default. The provider should also accept a JSON array of Codex input items: ```json [ { "type": "text", "text": "Review this project" }, { "type": "local_image", "path": "/absolute/path/to/screenshot.png" }, { "type": "skill", "name": "skill-creator", "path": "/absolute/path/SKILL.md" } ] ``` Supported input item types for the first implementation: - `text` - `local_image`, mapped to app-server `inputImage` - `skill` Unknown prompt JSON should be treated as plain text instead of throwing. ### Output The provider response should include: - `output`: final assistant text, assembled from `item/agentMessage/delta` and completed `agentMessage` items. - `sessionId`: thread id. - `raw`: serialized protocol-level turn summary and selected notifications. - `metadata.codexAppServer`: thread id, turn id, model, cwd, sandbox, approvals, server requests, item counts, command/file/tool trajectories, and app-server protocol data useful for debugging. - `metadata.skillCalls` / `metadata.attemptedSkillCalls`: heuristic skill usage where available. - `tokenUsage`: from `thread/tokenUsage/updated` if emitted. - `cost`: estimated from Promptfoo's Codex pricing table when model is known. ### Config Provider-level config should be strict. Prompt-level merged config should strip unknown keys so generic Promptfoo prompt config does not break rows. Core config: - `apiKey` - `base_url` - `working_dir` - `additional_directories` - `skip_git_repo_check` - `codex_path_override` - `model` - `model_provider` - `service_tier` - `sandbox_mode` - `sandbox_policy` - `approval_policy` - `approvals_reviewer` - `model_reasoning_effort` - `reasoning_summary` - `personality` - `output_schema` - `thread_id` - `persist_threads` - `thread_pool_size` - `ephemeral` - `persist_extended_history` - `experimental_raw_events` - `experimental_api` - `cli_config` - `cli_env` - `inherit_process_env` - `reuse_server` - `deep_tracing` - `request_timeout_ms` - `startup_timeout_ms` - `server_request_policy` ### Safety Defaults Default stance should favor repeatable evals over convenience: - `approval_policy`: `never` - `sandbox_mode`: `read-only` - `network_access_enabled`: `false` - `ephemeral`: `true` - `reuse_server`: `true` unless `deep_tracing` is enabled - `inherit_process_env`: `false` - Server-side approval requests: decline/cancel or empty grants unless explicitly configured. Rationale: - The app-server exposes shell, filesystem, app connector, plugin, and config surfaces. - Promptfoo evals should be deterministic and should not block on human approval. - Eval prompts and target behavior can be adversarial. ## Implementation Phases 1. Stdio JSON-RPC client - Spawn `codex app-server`. - Parse JSONL stdout. - Route responses, notifications, and server requests. - Capture stderr for debug logs. - Support abort and timeout. 2. Provider lifecycle - Register with `providerRegistry`. - Reuse app-server process by default. - Shut down child processes, pending requests, and readline handles. - Disable reuse when `deep_tracing` is enabled. 3. Thread and turn execution - Validate working directory. - Start/resume threads. - Start turns with prompt input, model, cwd, sandbox, approvals, effort, personality, service tier, and output schema. - Serialize turns per reused thread. - Unsubscribe/archive non-persistent threads during cleanup. 4. Streaming aggregation - Track `turnId`. - Assemble assistant deltas. - Store completed items. - Build item counts and trajectory metadata. - Capture command output, file changes, MCP calls, dynamic tool calls, web search, plans, reasoning summaries, and review output. 5. Server request handling - Deterministically answer command approvals. - Deterministically answer file-change approvals. - Return empty permission grants by default. - Support configured answers for `tool/requestUserInput`. - Decline/cancel MCP elicitation by default. - Support static dynamic-tool responses. - Record all requests and decisions in metadata. 6. Tracing - Wrap `callApi` in `withGenAISpan`. - Add item-level spans when streaming notifications arrive. - Sanitize command output, tool arguments, and message text before trace attributes. - Inject OTEL env when `deep_tracing` is enabled. 7. Docs and examples - Add provider docs. - Add provider index entry. - Add examples for basic usage, read-only repo review, structured output, approval handling, skills, and tracing. 8. Verification - Unit tests with mocked child process and mocked protocol frames. - Registry tests for provider IDs and model-in-path parsing. - Docs/examples lint where applicable. - Local smoke config using a harmless prompt and `sandbox_mode: read-only` if credentials/login are available. - Final dogfood: run the new provider against the git diff and iterate on comments. ## Progress Log ### 2026-04-09 - Added initial provider implementation at `src/providers/openai/codex-app-server.ts`. - Added provider IDs under the OpenAI registry: - `openai:codex-app-server` - `openai:codex-app-server:` - `openai:codex-desktop` - `openai:codex-desktop:` - Implemented stdio JSON-RPC lifecycle: - spawn `codex app-server --listen stdio://` - `initialize` - `initialized` - `thread/start` - `thread/resume` - `turn/start` - notification handling through `turn/completed` - `thread/unsubscribe`/`thread/archive` cleanup modes - Implemented safe config defaults: - `approval_policy: never` - `sandbox_mode: read-only` - `ephemeral: true` - `thread_cleanup: unsubscribe` - process env isolation unless `inherit_process_env: true` - Implemented deterministic server request responses: - command execution approvals default to `decline` - file changes default to `decline` - permission requests default to empty grants - user input requests default to empty answers - MCP elicitations default to `decline` - dynamic tools can use static configured responses - Implemented output normalization: - assistant delta aggregation - completed `agentMessage` fallback/preference - token usage from `thread/tokenUsage/updated` - cost estimate for known Codex models - metadata with item counts, items, server request decisions, thread/turn ids - Implemented provider-level GenAI tracing and item spans. - Added mocked protocol tests in `test/providers/openai-codex-app-server.test.ts`. - Added registry tests in `test/providers/index.test.ts`. - Verification so far: - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - `npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false` - `npm run tsc -- --pretty false` - Expanded mocked protocol tests to cover: - `thread/resume` - structured prompt input normalization - default `thread/unsubscribe` cleanup - user input request policy - dynamic tool static response policy - metadata sanitization - Ran a real local smoke eval through `npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache -o /tmp/promptfoo-codex-app-server-example.json`. - Result: pass. - Provider returned Codex app-server `sessionId`, token usage, item counts, thread id, turn id, and structured JSON output. - Ran docs build: - `cd site && SKIP_OG_GENERATION=true npm run build` - Result: pass. - First dogfood review through `examples/openai-codex-app-server/review-diff/promptfooconfig.yaml` found four actionable provider issues: - startup timeout could leak a spawned app-server and leave a rejected reusable connection promise cached - reused connections closed over the first turn's server request policy - app-server exit during a turn could leave the eval waiting forever when no turn timeout was configured - `raw` response payload serialized unsanitized protocol items - Fixed the dogfood findings and added regression coverage: - failed startup closes the process and a later call spawns a fresh process - active turns store their effective config so prompt-level server request policies are honored on reused servers - connection exit resolves active turns with a provider error and removes the dead connection from reuse maps - `raw` now contains sanitized thread, turn, token usage, notifications, and item metadata - final output now uses the last completed `agentMessage`, which avoids concatenating progress messages with final structured review output - Verification after fixes: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 13 provider tests. - Second dogfood review passed the Promptfoo eval and returned valid JSON, but still reported two provider comments: - legacy `execCommandApproval` / `applyPatchApproval` requests identify the active thread with `conversationId` and expect legacy review decisions - persistent thread-pool eviction deleted local handles without unsubscribing the evicted loaded thread - Additional hardening from the second dogfood pass: - stdio parser now buffers partial JSON-RPC lines and rejoins literal newlines inside command-output strings as escaped newlines before parsing - legacy approval requests now map prompt-level policy to `approved`, `approved_for_session`, `denied`, and `abort` - legacy server requests can find active turn state by `conversationId` - evicted persistent cached threads now send `thread/unsubscribe` before being removed - added regression coverage for literal-newline JSON-RPC notifications, legacy approval requests, and persistent thread-pool eviction - Verification after second dogfood fixes: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 16 provider tests. - Third dogfood review passed transport/eval and reported two lifecycle comments: - stale persistent thread handles remained after a reused app-server process exited - JSON-RPC request timeout cleanup removed the pending request but left an abort listener attached - Additional hardening from the third dogfood pass: - connection close now removes cached thread handles owned by that connection key - per-request timeout cleanup now removes abort listeners before rejecting - added regression coverage for cached-thread invalidation after process exit and abort-listener cleanup on JSON-RPC timeout - Verification after third dogfood fixes: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 18 provider tests. - Fourth dogfood review passed transport/eval and reported two thread-cache comments: - persistent thread caching was still enabled for fresh-per-call app-server processes (`reuse_server: false` or `deep_tracing`) - pool eviction could unsubscribe an active cached thread before its turn completed - Additional hardening from the fourth dogfood pass: - thread caching is now allowed only when the app-server connection itself is reusable - active/reserved thread ids are protected with a small refcount while a call is using them - thread-pool eviction skips protected threads and temporarily allows the pool to exceed its soft cap rather than evicting an in-flight turn - added regression coverage for non-reusable persistent-thread configs and active-turn eviction avoidance - Verification after fourth dogfood fixes: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 20 provider tests. - Fifth dogfood review passed transport/eval and reported one persistent-thread race: - concurrent calls with the same persistent-thread cache key could both miss the cache while the first `thread/start` was still pending, creating duplicate persistent threads and leaking the earlier one - Additional hardening from the fifth dogfood pass: - added an in-flight thread promise map keyed by thread cache key - concurrent same-cache `thread/start` / `thread/resume` callers now share the same pending thread handle - added regression coverage for concurrent same-cache persistent calls, ensuring only one `thread/start` is sent and both turns use the shared thread - Verification after fifth dogfood fix: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 21 provider tests. - Sixth dogfood review passed transport/eval and reported two cache/default comments: - reusable connections could keep the first request timeout for requests that did not pass a per-call timeout - persistent thread cache keys omitted thread-start options such as `ephemeral`, `experimental_raw_events`, and `persist_extended_history` - Additional hardening from the sixth dogfood pass: - all provider-owned app-server requests now pass the effective per-call request timeout explicitly - persistent thread cache keys now include thread-start options that can change thread semantics - added regression coverage for prompt-level request timeouts on reused connections and thread-start option changes in persistent cache keys - Verification after sixth dogfood fixes: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 23 provider tests. - Seventh dogfood retry: - first attempt hit an external Codex connectivity failure (`Network is unreachable`, `Reconnecting... 2/5`), which the provider surfaced as a clean provider error - retry completed transport/eval and reported one metadata issue: skill-root detection used the parent process env instead of the resolved app-server child env - Additional hardening from the seventh dogfood pass: - turn state now carries the resolved app-server environment produced by `prepareEnvironment` - skill root detection now uses the child env for `CODEX_HOME`, `HOME`, and `USERPROFILE` - added regression coverage for `cli_env.HOME` skill-call metadata detection - Verification after seventh dogfood fix: - `npm run tsc -- --pretty false` - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false` - Result: pass, 24 provider tests. - Eighth dogfood review: - `npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-review.json` - Result: pass. - Provider output: `{"comments":[],"summary":"No actionable findings; TypeScript and focused provider tests passed."}` - This confirms the provider can be used to review the current git diff and return schema-valid JSON with no remaining actionable comments from the dogfood reviewer. - Final verification sweep: - `npm run f`: pass with existing complexity warnings only; no formatting changes needed - `npm run tsc -- --pretty false`: pass - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false`: pass, 24 provider tests - `npx vitest run test/providers/index.test.ts -t "Codex app-server|Codex desktop" --sequence.shuffle=false`: pass, 2 registry tests - `npm run l`: pass with existing complexity warnings only - `cd site && SKIP_OG_GENERATION=true npm run build`: pass - `npm run local -- eval -c examples/openai-codex-app-server/promptfooconfig.yaml --no-cache --no-share -o /tmp/promptfoo-codex-app-server-example.json`: pass ## QA Matrix Required mocked unit tests: - Constructor defaults and strict config validation. - Prompt-level unknown config stripping. - Missing/inaccessible/non-directory working directory. - Git check and `skip_git_repo_check`. - API key/env isolation and explicit `cli_env`. - Handshake order: `initialize`, `initialized`, `thread/start`, `turn/start`. - Model from provider path overrides/defaults correctly. - `thread_id` uses `thread/resume`. - `persist_threads` reuses cached thread and serializes turns. - Non-persistent calls unsubscribe/archive as configured. - Assistant deltas aggregate into final output. - Completed `agentMessage` fallback works when deltas are missing. - Token usage from `thread/tokenUsage/updated`. - Error notification produces provider error. - Failed turn produces provider error. - Abort before start. - Abort during turn sends `turn/interrupt` and returns aborted error. - Command approval request default decline. - File change request default decline. - Permission request default empty grant. - User input request configured answers. - Dynamic tool call configured static response. - MCP elicitation default decline/cancel. - Metadata contains item counts, trajectories, approvals, raw notifications, and server request decisions without leaking API keys. - `cleanup` kills child process and unregisters provider. - `deep_tracing` injects OTEL env and disables reuse/thread persistence. - Provider-level GenAI tracing records response body, token usage, session id, and item count attributes. Required docs/examples checks: - Provider docs render in Docusaurus. - Examples are listed and runnable from repo root with `npm run local -- eval ... --no-cache`. - Config docs call out experimental status, safety defaults, and difference from Codex SDK. ## Open Questions - Whether to expose WebSocket transport in the first public version. Stdio is enough for Promptfoo-managed app-server processes; WebSocket is useful for external clients but adds auth and lifecycle complexity. - Whether to support top-level `codex:*` aliases immediately or keep all new IDs under `openai:*` for consistency with the existing Codex SDK provider. - Whether to persist generated app-server protocol types in source. The current plan is to implement a narrow local type surface and document how to regenerate schemas instead of committing a large generated bundle. ## Critical Audit Follow-up Review feedback and red-team audit items addressed after the initial dogfood pass: - Fixed registry env propagation for object-shaped provider configs. `loadApiProvider` already merges suite-level and provider-level env into `providerOptions.env`; the registry now passes that merged env to `OpenAICodexAppServerProvider`. - Fixed `service_tier` to match the generated app-server schema from `codex-cli 0.118.0`: `fast` and `flex` only. - Reset the hoisted `spawn` mock implementation in `beforeEach` to satisfy `test/AGENTS.md` mock isolation rules. - Regenerated app-server TypeScript and JSON Schema into `/tmp/codex-app-server-schema-current.XVTCwL` during the audit and compared rare app-server fields against the implementation. - Added coverage for schema-supported rare fields: - `model_reasoning_effort: none` - exact `personality` values: `none`, `friendly`, `pragmatic` - granular approval policy objects - app-server command approval amendment objects - session-scoped permission grants - accepted MCP elicitation responses with content and metadata - `base_instructions`, `developer_instructions`, and `collaboration_mode` - Dogfood review then found additional issues: - reusable app-server connections stayed alive after JSON-RPC request timeouts, which could leave late side-effecting responses unmanaged - docs listed `thread_pool_size` as unlimited even though the implementation defaults to `1` - provider cleanup cleared active turns before resolving them, which could hang shutdown while a turn was in flight - raw JSON-RPC notifications were retained even when `include_raw_events` was false - spawned app-server processes could be missed if cleanup ran while `initialize` was still pending - concurrent persistent thread starts could temporarily exceed `thread_pool_size` and remain over capacity after active turns finished - deep-tracing calls that shared a `thread_id` could overlap turns because the queue key returned early - the OpenAI provider docs heading change would have broken the existing `#codex-sdk` anchor - default `thread_id` resumes skipped unsubscribe cleanup - sent JSON-RPC request aborts kept the reusable app-server alive - retryable app-server `error` notifications with `willRetry: true` were treated as terminal - concurrent default-cleanup `thread_id` rows could unsubscribe while another row was queued for the same thread - Fixed these by closing/evicting connections on timeout and abort, resolving active turns during cleanup, tracking pending initialization processes, making raw event retention opt-in, rebalancing the persistent thread pool after turns finish, serializing explicit `thread_id` turns even under deep tracing, preserving the OpenAI docs `#codex-sdk` heading, default-unsubscribing non-persistent resumed threads, honoring retryable app-server errors, and deferring resumed-thread unsubscribe until no other protected queued caller remains. - Updated docs to explain why the app-server provider should stay separate from the Codex SDK provider: the SDK is the right default for CI and automation, while app-server is for rich-client protocol event surfaces and does not attach to a running Codex Desktop app. Latest focused verification after these fixes: - `npx vitest run test/providers/openai-codex-app-server.test.ts --sequence.shuffle=false`: pass, 35 provider tests. - `npm run local -- eval -c examples/openai-codex-app-server/review-diff/promptfooconfig.yaml --no-cache --no-share`: pass with `{"comments":[],"summary":"No actionable issues found in the current diff."}`.