🦞⚪ White Lobster — GPU Results

0 = fail 1 = partial 2 = pass Score /20

#	Model	Ctx	💬	🔧	Summary	S1	S2	S3	S4	S5	S6	S7	S8	S9	S10	Total
1	qwen3:0.6b600M	32K	✅	✅	Best tiny reasoner. Qwen3 improved reasoning at small sizes.	1	2	2	2	2	1	1	0	1	·	12/20
2	smollm2:135m135M	2K	✅	❌	Absolute floor of intelligence. Might produce word salad.	2	2	0	·	·	·	·	·	·	·	4/20
2	smollm2:360m360M	2K	✅	❌	Marginally less hopeless than 135m.	2	2	0	·	·	·	·	·	·	·	4/20
4	qwen2.5-coder:0.5b500M	32K	✅	❌	Code-focused but stunted at this size.	1	2	0	·	·	·	·	·	·	·	3/20
5	qwen2.5:0.5b500M	32K	✅	❌	General purpose but tiny. Struggles with code.	1	1	0	·	·	·	·	·	·	·	2/20
-	functiongemma:270m270M	8K	❌	✅	Google's tool-calling only. Can't freestyle, single-turn.	0	0	0	·	·	·	·	·	·	·	0/20
-	eternis-tc:0.6b600M	4K	✅	✅	HF tool-calling fine-tune of Qwen3-0.6B. Chat + tools.	·	·	·	·	·	·	·	·	·	·	0/20
-	llama3.2:1b1B	32K	✅	❌	Meta's tiny entry. Decent instruction following.	·	·	·	·	·	·	·	·	·	·	0/20
-	gemma3:1b1B	32K	✅	✅	Google's latest tiny. Good at structured output.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen2.5:1.5b1.5B	32K	✅	❌	Decent reasoning, light code ability.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen2.5-coder:1.5b1.5B	32K	✅	❌	Sweet spot coder. Big quality jump over 0.5b.	·	·	·	·	·	·	·	·	·	·	0/20
-	deepseek-r1:1.5b1.5B	32K	✅	❌	Chain-of-thought reasoner. Thinks out loud.	·	·	·	·	·	·	·	·	·	·	0/20
-	smollm2:1.7b1.7B	2K	✅	✅	SmolLM's best shot. Crippled by 2K context.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen3:1.7b1.7B	32K	✅	✅	Solid reasoning for size. Good orchestrator candidate.	·	·	·	·	·	·	·	·	·	·	0/20
-	codegemma:2b2B	8K	✅	❌	Google code model. Good at infilling/completions.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen3-vl:2b2B	32K	✅	❌	Vision model. Can see screenshots of its own work.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen2.5:3b3B	32K	✅	✅	Solid general purpose. Real contender territory.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen2.5-coder:3b3B	32K	✅	✅	★ Best coder in our range. The favorite.	·	·	·	·	·	·	·	·	·	·	0/20
-	llama3.2:3b3B	32K	✅	✅	Solid Meta baseline. Good all-rounder.	·	·	·	·	·	·	·	·	·	·	0/20
-	starcoder2:3b3B	16K	❌	❌	BigCode completion model. Not chat-tuned.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen3:4b4B	32K	✅	✅	★ Best reasoning in range. Largest model we test.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen3-vl:4b4B	32K	✅	❌	Vision + reasoning. Can debug by looking at UI.	·	·	·	·	·	·	·	·	·	·	0/20
-	qwen3-tc:4b4B	4K	✅	✅	HF tool-calling champion. 2K+ downloads.	·	·	·	·	·	·	·	·	·	·	0/20

🦞⚪ White Lobster Scoreboard