If you want to run a capable language model on your own hardware in 2026—no cloud bills, no rate limits, no data leaving your machine—you have more options than ever. Llama 3.1 70B, Qwen 2.5 72B, Mistral Large 2, Gemma 2 27B, and Microsoft's Phi-4 all claim to deliver ChatGPT-class reasoning without the subscription. But most buyers hit the same wall: the model loads, then crawls at 0.8 tokens per second, turning a simple prompt into a two-minute wait.
We tested five of the most capable open-weight models released between October 2024 and April 2026 on three hardware platforms: Apple M3 Max (128GB unified memory), NVIDIA RTX 4090 (24GB VRAM + 64GB system RAM), and AMD Strix Halo (96GB unified LPDDR5X). We measured inference speed, MMLU accuracy, GPQA reasoning performance, and real-world coding ability on SWE-Bench Verified. We ran each model in Ollama 0.4.2 and LM Studio 0.3.8 with quantization levels from Q4_K_M to Q8_0. Tests ran between March 15 and April 28, 2026.
Here are the models worth running—and the ones that look good on paper but fail in practice.
Best Overall: Qwen 2.5 72B Instruct
Qwen 2.5 72B Instruct
For most users with 64GB+ of unified memory or VRAM, Qwen 2.5 72B delivers the best balance of speed, reasoning quality, and multilingual capability. It outperformed Llama 3.1 70B on MMLU by 2.8 points and matched GPT-4 on GPQA in our March tests.
- ✓Fastest 70B-class model on all three test platforms
- ✓Exceptional multilingual performance (29 languages)
- ✓128K context window handles full codebases
- ✓Apache 2.0 license permits commercial use
- ✕Requires 48GB minimum for Q4 quantization
- ✕Occasional verbose responses compared to Llama
- ✕Tool-calling support still lags OpenAI function schemas
Qwen 2.5 72B Instruct, released by Alibaba Cloud in September 2025, is the most capable open-weight model we tested. On the M3 Max with Q4_K_M quantization (42GB memory footprint), it sustained 18.4 tokens per second during a 4,096-token output—fast enough that responses feel conversational. The RTX 4090 pushed that to 22.1 t/s with the same quantization, while Strix Halo managed 16.8 t/s.
Where Qwen pulls ahead is reasoning quality. On MMLU (Massive Multitask Language Understanding), it scored 86.2%, ahead of Llama 3.1 70B (83.4%) and Mistral Large 2 (84.0%). On GPQA Diamond—a graduate-level science benchmark—Qwen matched GPT-4's 56.1% score, the highest of any open model we tested. In SWE-Bench Verified, a real-world coding benchmark where models attempt to resolve GitHub issues, Qwen resolved 48 of 100 tasks, compared to Llama's 41.
The 131,072-token context window is the longest in this category, allowing you to feed in entire research papers or 15,000+ lines of code without truncation. We tested it with a 98,000-token input (a full codebase dump) and the model maintained coherence through a 6,000-token response, accurately referencing functions defined 80,000 tokens earlier.
Fast enough for real-time conversation. Llama 3.1 70B managed 14.2 t/s on the same hardware.
Best for Apple Silicon: Llama 3.1 70B Instruct
Llama 3.1 70B Instruct
Meta's flagship open model delivers the most polished experience on macOS. Ollama and LM Studio both support Metal acceleration out of the box, and the model's concise output style saves tokens and time.
- ✓Best macOS integration and Metal optimization
- ✓More concise responses than Qwen or Mistral
- ✓Strong creative writing and instruction-following
- ✓Widest third-party tooling support
- ✕MMLU and GPQA scores trail Qwen by 2–3 points
- ✕Multilingual performance weaker outside Romance languages
- ✕License restricts use if you exceed 700M monthly users
Llama 3.1 70B, released in July 2024, remains the most widely adopted open-weight model, and for good reason: it works everywhere. On the M3 Max, it ran at 14.2 t/s with Q4_K_M quantization, slightly slower than Qwen but still comfortable for interactive use. The model's strength is polish—responses are shorter, better formatted, and require less post-editing than Qwen's occasionally verbose output.
Where Llama falls short is reasoning. On MMLU it scored 83.4%, and on GPQA Diamond it managed 51.2%—still strong, but noticeably behind Qwen and Mistral Large 2. In SWE-Bench Verified, it resolved 41 of 100 tasks. For coding, especially in Python, Qwen and Mistral are measurably stronger.
But Llama has the ecosystem. Every local AI tool supports it first. Ollama, LM Studio, Jan, GPT4All, and llamafile all ship with Llama presets. If you're on a Mac and you want the path of least resistance, this is it.
Best for NVIDIA GPUs: Mistral Large 2 (123B)
Mistral Large 2
If you have a 24GB GPU and 64GB of system RAM, Mistral Large 2 delivers the highest reasoning scores in this category. GPQA performance matched GPT-4o, and coding ability exceeded both Llama and Qwen.
- ✓Highest GPQA score of any open model (56.3%)
- ✓Exceptional math and code reasoning
- ✓Supports function calling natively
- ✓Best performance on NVIDIA hardware with offloading
- ✕Requires 80GB+ total memory for Q4 quantization
- ✕Slower on Apple Silicon (11.2 t/s on M3 Max)
- ✕License prohibits commercial use without Mistral agreement
Mistral Large 2, released in July 2024, is a 123-billion-parameter model that punches above its weight class. On the RTX 4090, with layers offloaded between VRAM and system RAM, it ran at 13.8 t/s—slower than the 70B models, but acceptable for tasks that demand deeper reasoning.
Don't miss the next investigation.
Get The Editorial's morning briefing — deeply researched stories, no ads, no paywalls, straight to your inbox.
This model excels at mathematics and formal reasoning. On GPQA Diamond, it scored 56.3%, the highest result we recorded—fractionally ahead of Qwen and GPT-4. On GSM8K (grade-school math), it achieved 91.2%. In SWE-Bench Verified, it resolved 52 of 100 tasks, the strongest showing in our test group.
The trade-off is hardware. Mistral Large 2 at Q4 quantization requires 82GB of total memory. On the M3 Max, that meant heavy swapping and speeds dropping to 11.2 t/s. On the RTX 4090 + 64GB RAM setup, the model spread across VRAM and system memory, maintaining usable speed. If you have a high-end NVIDIA card and plenty of DDR5, this is the model to run.
Best Value: Gemma 2 27B Instruct
Gemma 2 27B Instruct
If you have 24–32GB of RAM and want the best model you can actually run at usable speed, Gemma 2 27B is the answer. It's fast, accurate, and fits on mainstream hardware.
- ✓Runs comfortably on 24GB systems (18GB Q4 footprint)
- ✓Fastest model in this guide: 28.6 t/s on M3 Max
- ✓Strong instruction-following and safety tuning
- ✓Permissive license for commercial use
- ✕8K context window limits long-document tasks
- ✕MMLU and GPQA scores trail 70B+ models by 8–11 points
- ✕Weaker code generation than Qwen or Mistral
Gemma 2 27B, released by Google DeepMind in June 2024, is the best model you can run on a laptop with 24GB of RAM. At Q4 quantization it uses 18GB, leaving headroom for your OS and browser. On the M3 Max it ran at 28.6 t/s, the fastest result in our tests. Even the RTX 4090, typically faster for larger models, only managed 31.2 t/s—the difference is barely perceptible.
Performance is a step below the 70B models—MMLU at 75.2%, GPQA at 42.1%—but for everyday assistant tasks, email drafting, summarization, and light coding, it's more than sufficient. The 8,192-token context window is the main limitation: you can't feed it a full codebase or a 50-page PDF. But for users who don't need long-context reasoning, Gemma 2 27B delivers 80% of the capability at half the hardware cost.
Best for Laptops: Microsoft Phi-4 (14B)
Microsoft Phi-4
Phi-4 is the only model in this guide that runs acceptably on 16GB of RAM. MMLU performance is competitive with models twice its size, but code generation and long-context work suffer.
- ✓Runs on 16GB systems (11GB Q4 footprint)
- ✓MMLU score competitive with 27B models
- ✓MIT license permits all use cases
- ✓Fastest load time in this guide
- ✕Code generation trails all other models tested
- ✕Struggles with multilingual prompts
- ✕16K context insufficient for research tasks
Phi-4, released in December 2024, is Microsoft's smallest capable model, and it's designed for constrained hardware. At Q4 quantization it uses just 11GB of RAM, making it the only model in this guide that runs smoothly on a 16GB MacBook Air or ThinkPad. On the M3 Max it sustained 34.2 t/s—the fastest overall—but speed matters less when the output quality can't keep up.
MMLU came in at 73.0%, which is impressive for a 14B model—higher than many 30B models from 2023. But GPQA dropped to 38.4%, and SWE-Bench Verified produced just 18 resolved tasks. For coding, Phi-4 is not competitive. For general Q&A, summarization, and educational tasks, it works.
Performance Breakdown: Speed vs Accuracy
Local LLM Comparison — Tested April 2026
All models tested with Q4_K_M quantization on M3 Max (128GB), RTX 4090 (24GB + 64GB RAM), Strix Halo (96GB)
| Spec | Qwen 2.5 72B Free Editor's Choice | Llama 3.1 70B Free | Mistral Large 2 Free | Gemma 2 27B Free Best Value | Phi-4 Free Best Budget |
|---|---|---|---|---|---|
| Parameters | 72.7B | 70.6B | 123B | 27.2B | 14B |
| MMLU score | 86.2% | 83.4% | 84.0% | 75.2% | 73.0% |
| GPQA Diamond | 56.1% | 51.2% | 56.3% | 42.1% | 38.4% |
| SWE-Bench Verified | 48/100 | 41/100 | 52/100 | 29/100 | 18/100 |
| Speed (M3 Max, t/s) | 18.4 | 14.2 | 11.2 | 28.6 | 34.2 |
| Speed (RTX 4090, t/s) | 22.1 | 17.8 | 13.8 | 31.2 | 38.1 |
| RAM required (Q4) | 48GB | 46GB | 82GB | 18GB | 11GB |
| Context window | 131K | 131K | 131K | 8K | 16K |
Source: The Editorial benchmarks, March–April 2026; MMLU and GPQA from OpenLLM Leaderboard; SWE-Bench from Princeton NLP
The trade-off is clear: larger models deliver better reasoning but demand more RAM and run slower. Qwen 2.5 72B occupies the sweet spot—strong performance, manageable hardware requirements, and speed that doesn't feel like dial-up.
Tokens per second, Q4_K_M quantization, measured during 4,096-token output
Source: The Editorial lab testing, April 2026
What About Quantization? Q4 vs Q8 Tested
Quantization reduces model size by lowering numerical precision. Q4_K_M uses 4-bit weights; Q8_0 uses 8-bit. The practical question: does the quality loss justify the speed gain?
We tested Qwen 2.5 72B and Llama 3.1 70B at Q4_K_M, Q5_K_M, and Q8_0. On MMLU, the difference between Q4 and Q8 was 0.4 percentage points for Qwen and 0.6 for Llama—within the margin of error. On GPQA, Q4 and Q8 were identical. In subjective quality tests (creative writing, summarization), reviewers could not reliably distinguish Q4 from Q8 outputs.
But Q8 uses nearly double the RAM. Qwen 2.5 72B at Q8 requires 84GB; at Q4 it needs 48GB. On the M3 Max, Q8 dropped speed from 18.4 t/s to 12.1 t/s due to memory pressure. Unless you have 128GB+ and need every last basis point of accuracy, Q4_K_M is the right choice.
Q4 VS Q8: QUALITY GAP IS NEGLIGIBLE
Across 500 MMLU questions, Qwen 2.5 72B scored 86.2% with Q4_K_M quantization and 86.6% with Q8_0. On GPQA Diamond, both quantization levels produced identical 56.1% scores. Human evaluators, blind-testing creative writing outputs, identified the Q8 model correctly in 52% of trials—statistically indistinguishable from random guessing.
Source: The Editorial lab testing, April 2026Ollama vs LM Studio: Which One to Use
Ollama and LM Studio are the two most popular tools for running local LLMs. Ollama is command-line first, minimal, and fast to set up. LM Studio offers a GUI, built-in model search, and better visibility into memory usage and inference stats.
We tested both with identical models and prompts. On macOS, Ollama's Metal backend produced speeds 4–7% faster than LM Studio across all models. On Windows with the RTX 4090, LM Studio's CUDA implementation was 2–3% faster. The differences are small enough that preference matters more than performance.
Ollama integrates seamlessly with CLI workflows, supports OpenAI-compatible API endpoints, and installs in under a minute. LM Studio offers more control over sampling parameters (temperature, top-p, repeat penalty) and shows real-time token generation stats. If you're comfortable with the terminal, use Ollama. If you want a GUI and fine-grained control, use LM Studio. Both work reliably.
Hardware Requirements: What You Actually Need
The table below shows minimum and recommended RAM for each model at Q4 quantization. These are measured values, not estimates. Minimum means the model loads and runs; recommended means it runs at usable speed without swapping.
Join the conversation
What do you think? Share your reaction and discuss this story with others.
