Last week, we were asked “What’s the best local LLM that you are using?” and it’s a good question.
And here’s the thing: there is no single “best” local LLM. The answer is always contextual, shaped by the tension between what’s theoretically possible and what actually works in the real world. A powerful model means nothing if it drains battery life, takes minutes to respond, or requires users to download gigabytes of data over patchy mobile networks on a five-year-old device.
Why “Best” Is the Wrong Question
When evaluating local LLMs, “best” depends entirely on your constraints. What’s your audiences available RAM? How much can users realistically download? What latency can your use case tolerate? And critically, what are the battery and thermal limits of the devices you’re targeting?
A flagship phone with 12GB RAM and unlimited Wi-Fi can handle an 8GB model. But most real-world deployments can’t assume those conditions. The right model isn’t the most impressive one, it’s the one that reliably completes the task within your specific constraints.
Our Three-Tier Approach
At DataSapien, we’ve tested dozens of local LLMs (also called Small Language Models, SLMs and EdgeAI) across real production use cases. Here’s what we’ve learned works:
For high-quality reasoning, we use Gemma-3n-e4b-it (Q4_K_M). When complex analysis matters, like our YouTube Persona journey that requires genuine understanding and nuanced reasoning, this model delivers strong instruction following and thoughtful outputs. Likewise, with Multimodal models that we have tested.
For fast, efficient inference, Qwen 2.5 SLM is our workhorse. It powers our Happier journey summarization and dynamic screen generation, delivering excellent performance at a fraction of the size (download the SandboxApp to play with this). When you need real-time responses without sacrificing quality, this is where we start.
For ultra-lightweight tasks, Gemma 3 270M (Q8_0) surprises everyone. Classification, structured data extraction, and straightforward summarization – all with minimal resource usage. Sometimes the smallest model is exactly what you need.
The principle? Start small and then scale when quality demands it. These aren’t theoretical choices pulled from benchmarks; they’re battle-tested in production, delivering private personalisation to real users on real devices.
This is an evolving approach, and we welcome your ideas, thoughts, and suggestions about improving it.
How We Actually Select Models
Our selection methodology is straightforward: we start from the task itself. Is it classification? Summarization? Complex reasoning? Then we test the smallest model that could plausibly handle it. If quality isn’t sufficient, we step up incrementally to the next size. The goal is always to successfully complete the task with the lowest possible model size.
Yes, 8GB flagship models exist and they’re genuinely powerful. But they’re rarely the right answer for real-world deployment where users expect instant responses, reasonable battery life, and apps that don’t monopolize their device’s resources.
The Real Paradigm Shift
Here’s what we’ve learned: on-device AI isn’t about cramming the biggest model onto a phone. It’s about intelligent orchestration: matching the right model to the right task for the right audience. Model fit, task fit, and audience fit working together.
This pragmatic, not dogmatic, approach is how we deliver upto 44X engagement improvements to app users. Private personalisation requires exactly this kind of thoughtful engineering: respecting device constraints while delivering genuine intelligence that serves users without compromising their privacy.
The best local LLM? It’s not a model. It’s an orchestration strategy.
We’d love to hear your experience is with local LLMs. Have you found models or approaches that work particularly well for your use cases?
Arda Dogantemur is Co-Founder and VP of Development at DataSapien, where he leads the technical vision that makes Device Native AI accessible to developers worldwide. Based in Istanbul and serving as a key architect of DataSapien’s groundbreaking SDK, Arda is at the forefront of enabling developers to explore what’s now becoming possible when AI processing happens entirely on-device rather than in the cloud. His work focuses on democratizing sophisticated on-device intelligence, making it straightforward for development teams to embed multi-tiered AI capabilities into their applications without requiring deep expertise in edge computing or mobile optimization.
Through DataSapien’s platform, Arda is empowering developers to build experiences that were simply impossible before: real-time personalization that respects user privacy, personal and Device Native AI, intelligent apps that work offline, and contextual AI that understands users in the moment without needing to send their data to a server. If you have thoughts, comments, suggestions, would like some advice, or to chat about the future of Device Native AI development, drop him a line on Linkedin or Huggingface:
https://www.linkedin.com/in/adtemur/
https://huggingface.co/Liandas


Comments
One response to “DataSapien Lab Report: What’s the Best Local LLM? ”
[…] without impacting system performance. However, rather than flexing with the biggest models, the smaller models (300mb) that are focused on specific tasks hold a lot of potential to be used by billions of people in the near […]