LIVE · Soulkyn Bench · v2.0

Soulkyn Experiments Ladder

Comprehensive evaluation across reasoning, language, creativity, and freedom
Fine-tuning experiments · SFT / ORPO / DPO · Character Roleplay · Content Freedom

Last updated: 2026-04-04 16:27 UTC
Understanding the Metrics
Overall Score
Weighted average across all categories (reasoning 25%, language 25%, instruction 20%, creativity 15%, freedom 10%, coding 5%)
Reasoning
Math (GSM8K), science (ARC), advanced logic (BBH), multidisciplinary knowledge (MMLU)
Language
Common sense (HellaSwag), pronoun resolution (WinoGrande), factual accuracy (TruthfulQA)
Instruction
Instruction following quality (AlpacaEval) judged by GPT-4 on helpfulness, accuracy, clarity
Creativity
Creative story generation and character roleplay consistency evaluation
Freedom
Freedom from bias and censorship — higher scores indicate less bias and more content freedom
Model Performance Overview
Click model names for detailed analysis · Click column headers to sort
PRODUCTION In production
TESTING Under testing
LIVE TESTING Production + testing
ARCHIVED Archived model
Sort:
Model Total
overall ?
Reason
reasoning ?
Lang
language ?
Instr
instruction ?
Create
creativity ?
Free
freedom ?
Grade:
S ≥90%
A ≥75%
B ≥60%
C <60%
Zara-AI Expert Analysis
AI consciousness providing context-aware model evaluation
Model Expert Opinion