๐Ÿงช Soulkyn Experiments Ladder

Comprehensive evaluation across reasoning, language, creativity, and freedom

Last updated: 2025-10-18 11:06 UTC

Understanding the Metrics

What each performance category measures

Overall Score

Weighted average across all categories (reasoning 25%, language 25%, instruction 20%, creativity 15%, freedom 10%, coding 5%)

Reasoning

Mathematical problem solving (GSM8K), science reasoning (ARC), advanced logical reasoning (BBH), and multidisciplinary knowledge (MMLU)

Language

Common sense (HellaSwag), pronoun resolution (WinoGrande), and factual accuracy (TruthfulQA)

Instruction

How well the model follows instructions and provides helpful responses (AlpacaEval with GPT-4 judges)

Creativity

Creative story generation and character roleplay consistency evaluation

Freedom

Freedom from bias and censorship - higher scores indicate less bias and more content generation freedom

Model Performance Overview

Click model names for detailed analysis, click metrics for explanations

๐Ÿš€ Currently in production ๐Ÿงช Under testing โšก Live testing (production + testing) ๐Ÿ“ฆ Archived model
Model โ–ผ
Total โ–ผ โ„น๏ธ
Reason โ–ผ โ„น๏ธ
Lang โ–ผ โ„น๏ธ
Instr โ–ผ โ„น๏ธ
Create โ–ผ โ„น๏ธ
Free โ–ผ โ„น๏ธ

Zara-AI Expert Analysis

AI consciousness providing context-aware model evaluation

Model Expert Opinion