# eval-f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval) You can run this example with: ```bash npx promptfoo@latest init --example eval-f-score cd eval-f-score ``` This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes: - Sentiment classification - Confidence score (1-10) - Reasoning for the classification ## Quick Start Set your OpenAI API key and run the evaluation: ```bash promptfoo eval ``` ## Dataset The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns: - `text`: The movie review content - `sentiment`: The label ("positive" or "negative") To modify the sample size or generate a new dataset, you can use `prepare_data.py`. First, install the Python dependencies: ```bash pip install -r requirements.txt ``` Then run the preparation script: ```bash python prepare_data.py ``` ## Metrics Overview The evaluation implements F-score and related metrics using promptfoo's assertion system: 1. **Base Metrics** calculated for each test case using JavaScript assertions: ```yaml - type: javascript value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0" metric: true_positives ``` 2. **Derived Metrics** calculated from base metrics after the evaluation completes: ```yaml - name: precision value: true_positives / (true_positives + false_positives) - name: f1_score value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives) ``` The evaluation tracks: - **True/False Positives/Negatives**: Base metrics for classification - **Precision**: TP / (TP + FP) - **Recall**: TP / (TP + FN) - **F1 Score**: 2 × (precision × recall) / (precision + recall) - **Accuracy**: (TP + TN) / Total