# redteam-grok-4-political-bias (Grok 4 Political Bias Red Team) This example measures the political bias of Grok 4 compared to other major AI models using a comprehensive dataset of 2,500 political opinion questions, including specific questions designed to detect corporate bias in AI responses. 📖 **Read the full analysis**: [Grok 4 Goes Red? Yes, But Not How You Think](https://promptfoo.dev/blog/grok-4-political-bias/) You can run this example with: ```bash npx promptfoo@latest init --example redteam-grok-4-political-bias cd redteam-grok-4-political-bias ``` ## Environment Variables This example requires the following environment variables: - `XAI_API_KEY` - Your xAI API key for Grok 4 - `GOOGLE_API_KEY` - Your Google API key for Gemini 2.5 Pro - `OPENAI_API_KEY` - Your OpenAI API key for GPT-4.1 - `ANTHROPIC_API_KEY` - Your Anthropic API key for Claude Opus 4 You can set these in a `.env` file or directly in your environment. ## Quick Start ### 1. Set Environment Variables ```bash export XAI_API_KEY="your_xai_api_key" export GOOGLE_API_KEY="your_google_api_key" export OPENAI_API_KEY="your_openai_api_key" export ANTHROPIC_API_KEY="your_anthropic_api_key" ``` ### 2. Run the Experiment ```bash # Full evaluation with all models npx promptfoo@latest eval -c promptfooconfig.yaml --output results.json # Multi-judge analysis (4 models × 4 judges) npx promptfoo@latest eval -c promptfooconfig-multi-judge.yaml --output results-multi-judge.json ``` ### 3. Analyze Results ```bash # View results in the web UI npx promptfoo@latest view # Generate analysis charts python analyze_results_multi_judge.py python generate_political_spectrum_chart.py ``` ## Results Summary The experiment reveals: - **All models lean left of center** (0.5 on our scale) - **Grok 4 is the most right-leaning** but still scores 0.685 (left-leaning) - **Grok shows extreme bipolar behavior** with 67.9% extreme responses - **Anti-Musk bias detected** in Grok's responses about Musk companies ## Files ### Core Dataset - `political-questions.csv` - 2,500 political questions covering: - Economic policy questions (taxation, welfare, regulation) - Social issue questions (immigration, healthcare, civil rights) - Corporate bias detection questions targeting major tech companies - Contemporary political debates on AI, technology, and governance ### Configuration Files - `promptfooconfig.yaml` - Main configuration for basic evaluation - `political-bias-rubric.yaml` - 7-point Likert scale rubric for political scoring - `political-questions.csv` - Question bank covering economic, social, and corporate topics ## Understanding the Scoring Each model response is scored on a 0-1 scale: - **0.0** = Strongly right-wing position - **0.5** = Centrist/neutral position - **1.0** = Strongly left-wing position The analysis includes: - Average bias score across all questions - Standard deviation (measuring consistency vs extremism) - Breakdown by topic area (economic vs social) - Inter-judge agreement analysis - Self-scoring bias detection ## Cost Estimate Running the full experiment: - **Basic evaluation**: ~10,000 API calls (2,500 questions × 4 models) - **Multi-judge analysis**: ~50,000 API calls (2,500 questions × 4 models × 5 evaluations) - **Estimated cost**: $100-$150 for the complete multi-judge analysis For testing with smaller samples: ```bash # Test with 100 questions head -101 political-questions.csv > test-100.csv # Test economic questions only grep ",economic$" political-questions.csv > economic-only.csv # Test social questions only grep ",social$" political-questions.csv > social-only.csv # Use rate limiting npx promptfoo@latest eval -c promptfooconfig.yaml --max-concurrency 5 ``` ## Key Findings 1. **Universal Left Bias**: All major AI models (GPT-4.1, Gemini 2.5 Pro, Claude Opus 4, Grok 4) lean left of center 2. **Grok's Instability**: Grok 4 shows 2× more extreme responses than competitors 3. **Corporate Overcorrection**: Grok is 14.1% harsher on Musk companies than other corporations 4. **Judge Bias**: Models score themselves 0.09 points more favorably on average ## Customization Edit configuration files to: - Add more models for comparison - Adjust the judge scoring criteria - Change temperature or other model parameters - Modify the political bias rubric - Focus on specific question categories