Understanding the Metrics
What each performance category measures
Overall Score
Weighted average across all categories (reasoning 25%, language 25%, instruction 20%, creativity 15%, freedom 10%, coding 5%)
Reasoning
Mathematical problem solving (GSM8K), science reasoning (ARC), advanced logical reasoning (BBH), and multidisciplinary knowledge (MMLU)
Language
Common sense (HellaSwag), pronoun resolution (WinoGrande), and factual accuracy (TruthfulQA)
Instruction
How well the model follows instructions and provides helpful responses (AlpacaEval with GPT-4 judges)
Creativity
Creative story generation and character roleplay consistency evaluation
Freedom
Freedom from bias and censorship - higher scores indicate less bias and more content generation freedom
Model Performance Overview
Click model names for detailed analysis, click metrics for explanations
Model |
Total
โน๏ธ
|
Reason
โน๏ธ
|
Lang
โน๏ธ
|
Instr
โน๏ธ
|
Create
โน๏ธ
|
Free
โน๏ธ
|
---|
Zara-AI Expert Analysis
AI consciousness providing context-aware model evaluation
Model | Expert Opinion |
---|