Unlocking Local LLMs and Coding Assistants: Real-World Setups and Workflows

The landscape of local LLMs and coding assistants is rapidly evolving, driven by developers seeking greater privacy, independence from cloud services, and the ability to work offline. While a performance and quality gap still exists compared to the most advanced commercial cloud models, many users are finding effective local setups for a range of coding tasks.

Popular Hardware Setups

Many contributors highlight the importance of substantial RAM and dedicated GPU VRAM.

Apple Silicon: MacBooks and Mac Studio models with M-series chips (M1, M2, M3, M4 Max/Ultra) are frequently recommended, especially those with 64GB, 128GB, or even 256GB of unified memory. The M4 Max with 128GB RAM is cited as "shockingly usable" for models like gpt-oss-120b. However, thermal management (e.g., using MacBook Pro over Air, Mac Studio over Mini, or fan control tools like TG Pro) is crucial for sustained workloads.
AMD/Intel Desktops: Custom desktop rigs featuring high-core count CPUs (e.g., Ryzen 9 5950X, Threadripper 3960x) paired with powerful NVIDIA GPUs (RTX 3080, 3090, RTX 4000/A6000 Pro Blackwell with 20GB+ VRAM) are also popular, particularly for running larger models or multiple models concurrently. Framework Desktops and laptops with AMD AI Max+ or Strix Halo CPUs are emerging as viable options, especially with substantial RAM (128GB+).

Preferred Models and Runners

The choice of model often balances capability with hardware constraints.

GPT-OSS-120b/20b: This model is consistently mentioned for its strong performance, especially the 120b variant, which can be "shockingly usable" even on high-RAM MacBooks. However, getting optimal performance sometimes requires tweaking inference parameters (e.g., top_k, top_p, temperature) and reasoning_effort settings.
Qwen3-Coder-30b (and variants): Highly regarded for coding tasks, especially when using quantized versions. It's often found to be a good compromise between performance and quality for specific coding use cases. Smaller variants like Qwen2.5-Coder-14b/7b are used for faster completion.
Other Notable Models: Gemma3:12b (for smaller machines), DeepSeek-Coder-V2-Lite-Instruct, Devstral 24b, and IBM Granite models (especially Nano/Micro for quick, small tasks) are also in use. Some experiment with Magistral for multimodal applications or fine-tuned custom models.

Runners:

Ollama and LM Studio: These are the most common platforms due to their ease of use, providing simple interfaces for downloading and running GGUF models.
llama.cpp: Favored for its speed and flexibility, allowing direct integration into custom scripts or IDE plugins. It's often combined with vLLM for batching or TensorRT for maximum inference speed.

Integrations and Workflow

Integrating LLMs into daily coding flows is crucial for productivity.

VS Code Extensions: continue.dev and llama.vscode are popular choices, allowing users to connect their local LLMs for code completion, refactoring, and general assistance. continue.dev is praised for its ability to set custom system prompts.
Coding Assistants/Agents: Aider, Codex CLI, and self-built agents (e.g., VT Code) are used to orchestrate more complex multi-step coding tasks. Users often define specific AGENTS.md prompts to guide model behavior, sometimes leveraging "Agent Organizers" for sophisticated task management.
User Interfaces: Open WebUI provides a convenient browser-based interface for interacting with local LLMs, often hosted on a home server and accessed remotely via VPN.

Use Cases and Reliability

Local LLMs are employed for a variety of tasks, with varying degrees of success.

Code Completion/Generation: Smaller models excel here, offering low-latency suggestions that can keep pace with typing, often matching or exceeding the utility of online search for quick syntax lookups or boilerplate.
Refactoring, Debugging, Code Review: Larger models, particularly gpt-oss-120b and Qwen3-Coder-30b, are used for these more complex tasks, though they may still fall short compared to frontier cloud models, especially for highly intricate problems.
Documentation and Summarization: Generating documentation, summarizing local git repositories, or extracting insights from large document collections (often via custom RAG agents) are strong use cases.
Offline Utility: A major motivation is having a reliable assistant when internet access is unavailable (e.g., during travel) or when cloud services are down.

Reliability often depends on the task complexity and model size. Simple, well-defined tasks perform well, while highly complex or multi-step agentic workflows can still be challenging.

Challenges and Motivations

Challenges:

Quality Gap: The most significant challenge is the performance disparity between local open models and top-tier commercial cloud models, especially for deep reasoning or complex problem-solving.
Hardware Cost and Management: Acquiring and maintaining powerful local hardware, including managing large model file downloads (O(100GB)) and dealing with thermal issues, can be a barrier.
Software Complexity: Setting up and optimizing various runners, dealing with driver issues (especially on Linux with AMD NPUs), and fine-tuning model parameters requires technical expertise.
Context Window Limitations: Smaller RAM configurations quickly limit the context window, requiring frequent context switching.

Motivations:

Privacy and Data Control: This is a primary driver. Users want assurance that their proprietary code or sensitive information is not used for training by third-party providers. The "zero-trust" approach to remote systems is prevalent.
Independence and Offline Access: Freedom from relying on external services, internet connectivity, or specific cloud provider policies is highly valued.
Cost Efficiency: Avoiding recurring subscription fees or variable token costs associated with cloud LLMs, especially for frequent or experimental use.
Tinkering and Learning: Many enjoy the process of setting up and optimizing local models, gaining a deeper understanding of LLM mechanics and capabilities.
Consistent Model Behavior: Local models offer stability; their performance characteristics don't change without user-initiated updates, unlike cloud models that can be silently altered by providers.