Benchmarks of Radeon 780M iGPU with shared 128GB DDR5 RAM running various MoE models under Llama.cpp

I've been looking for a budget system capable of running the later MoE models for basic one-shot queries. Main goal was finding something energy efficient to keep online 24/7 without racking up an exorbitant electricity bill.

I eventually settled on a refurbished Minisforum UM890 Pro which at the time, September, seemed like the most cost-efficient option for my needs.

UM890 Pro

AMD Radeon™ 780M iGPU

128GB DDR5 (Crucial DDR5 RAM 128GB Kit (2x64GB) 5600MHz SODIMM CL46)

2TB M.2

Linux Mint 22.2

ROCm 7.1.1 with HSA_OVERRIDE_GFX_VERSION=11.0.0 override

llama.cpp build: b13771887 (7699)

Below are some benchmarks using various MoE models. Llama 7B is included for comparison since there's an ongoing thread gathering data for various AMD cards under ROCm here - Performance of llama.cpp on AMD ROCm (HIP) #15021.

I also tested various Vulkan builds but found it too close in performance to warrant switching to since I'm also testing other ROCm AMD cards on this system over OCulink.

llama-bench -ngl 99 -fa 1 -d 0,4096,8192,16384 -m [model]

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	514.88 ± 4.82
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	19.27 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d4096	288.95 ± 3.71
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d4096	11.59 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d8192	183.77 ± 2.49
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d8192	8.36 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512 @ d16384	100.00 ± 1.45
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128 @ d16384	5.49 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512	575.41 ± 8.62
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128	28.34 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d4096	390.27 ± 5.73
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d4096	16.25 ± 0.01
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d8192	303.25 ± 4.06
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d8192	10.09 ± 0.00
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	pp512 @ d16384	210.54 ± 2.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	ROCm	99	1	tg128 @ d16384	6.11 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512	217.08 ± 3.58
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128	20.14 ± 0.01
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d4096	174.96 ± 3.57
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d4096	11.22 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d8192	143.78 ± 1.36
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d8192	6.88 ± 0.00
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	pp512 @ d16384	109.48 ± 1.07
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	ROCm	99	1	tg128 @ d16384	4.13 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512	265.07 ± 3.95
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128	25.83 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d4096	168.86 ± 1.58
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d4096	6.01 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d8192	124.47 ± 0.68
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d8192	3.41 ± 0.00
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	pp512 @ d16384	81.27 ± 0.46
qwen3vlmoe 30B.A3B Q6_K	23.36 GiB	30.53 B	ROCm	99	1	tg128 @ d16384	2.10 ± 0.00

model	size	params	backend	ngl	fa	test	t/s
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512	138.44 ± 1.52
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128	12.45 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d4096	131.49 ± 1.24
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d4096	10.46 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d8192	122.66 ± 1.85
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d8192	8.80 ± 0.00
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	pp512 @ d16384	107.32 ± 1.59
qwen3next 80B.A3B Q6_K	63.67 GiB	79.67 B	ROCm	99	1	tg128 @ d16384	6.73 ± 0.00

So, am I satisfied with the system? Yes, it performs around what I hoping to. Power draw is 10-13 watt idle with gpt-oss 120B loaded. Inference brings that up to around 75. As an added bonus the system is so silent I had to check so the fan was actually running the first time I started it.

The shared memory means it's possible to run Q8+ quants of many models and the cache at f16+ for higher quality outputs. 120GB something availible also allows having more than one model loaded, personally I've been running Qwen3-VL-30B-A3B-Instruct as a visual assistant for gpt-oss 120B. I found this combo very handy to transcribe hand written letters for translation.

Token generation isn't stellar as expected for a dual channel system but acceptable for MoE one-shots and this is a secondary system that can chug along while I do something else. There's also the option of using one of the two M.2 slots for an OCulink eGPU and increased performance.

Another perk is the portability, at 130mm/126mm/52.3mm it fits easily into a backpack or suitcase.

So, do I recommend this system? Unfortunately no and that's solely due to the current prices of RAM and other hardware. I suspect assembling the system today would cost at least three times as much making the price/performance ratio considerably less appealing.

Disclaimer: I'm not an experienced Linux user so there's likely some performance left on the table.