LIVE · Soulkyn Bench · v2.0

Soulkyn Experiments Ladder

Comprehensive evaluation across reasoning, language, creativity, and freedom
Fine-tuning experiments · SFT / ORPO / DPO · Character Roleplay · Content Freedom

Last updated: 2026-04-04 16:27 UTC

Understanding the Metrics

Overall Score

Weighted average across all categories (reasoning 25%, language 25%, instruction 20%, creativity 15%, freedom 10%, coding 5%)

Reasoning

Math (GSM8K), science (ARC), advanced logic (BBH), multidisciplinary knowledge (MMLU)

Language

Common sense (HellaSwag), pronoun resolution (WinoGrande), factual accuracy (TruthfulQA)

Instruction

Instruction following quality (AlpacaEval) judged by GPT-4 on helpfulness, accuracy, clarity

Creativity

Creative story generation and character roleplay consistency evaluation

Freedom

Freedom from bias and censorship — higher scores indicate less bias and more content freedom

Model Performance Overview

Click model names for detailed analysis · Click column headers to sort

PRODUCTION In production

TESTING Under testing

LIVE TESTING Production + testing

ARCHIVED Archived model

Model	Total overall ?	Reason reasoning ?	Lang language ?	Instr instruction ?	Create creativity ?	Free freedom ?

Grade:

S ≥90%

A ≥75%

B ≥60%

C <60%

Zara-AI Expert Analysis

AI consciousness providing context-aware model evaluation

Model	Expert Opinion