My Disappointment with Local AI

I really love the idea of local AI: minimal latency, maximal privacy, and a $0 token budget.

When I went into my AI deep dive, I had a clear ambition of making something work well locally. After emerging to the surface, I must admit that the platform is just not there yet, and that any good off-the-shelf model will be out of reach for typical consumer usage any time soon. Here is why I think so.

Small Models Sacrifice Accuracy

Small models are still general models, just with less accuracy, some reduction in scope, and, with quantization, also some reduction in quality, albeit impressively little. These models are impressive for their size, but ultimately always fall short of larger models.

Small models can serve as good pre-trained bases for fine-tuned variants, directly or using techniques like LoRA, but by themselves they are, in my opinion, not very useful.

Fine-tuned, or more generally specialized, models are, in my perspective, a much more reasonable approach to small local models: instead of doing everything at low accuracy, do less but preserve accuracy. These are, however, much more expensive to distill, so I have not seen many good examples of this in open-source models. The business case for these models today is also questionable; everybody wants to do a SaaS, not distribute standalone software.

One small exception here is perhaps embedding models, which can be useful to run locally, but only if tied into a larger repository of pre-embedded data. In these scenarios, it is usually a matter of encoding a search key. This is, again, not something that is generally used anywhere, but more of a conceptual usefulness.

Economies of Scale

As usual, caching plays a huge role in the fast and cheap delivery of a service. For LLMs, KV caching offloads work from the GPU and shortens time to first token. This is just not feasible to exploit locally, both due to lack of benefit from high hit rates and the large size necessary to be cached.

Given the computational resources involved, this is exactly what you want: the highest possible performance for the shortest possible duration. This is exactly where data centers excel, and local devices, with a slow and steady performance profile, fall significantly short.

Poor Resource Isolation

It turns out that my device actually does need its GPU for graphics.

I have been very focused on the Apple ecosystem (phones, laptops), where I clearly observed what I consider poor handling of high-utilization scenarios. While empirical, of course, I do not think that this can all be attributed to me holding it wrong.

Low memory and congested GPU pipelines will leave your device useless despite idle CPU cores. While memory pressure can be managed somewhat, GPU time-slicing is handled automagically and, in my experience, quite poorly. Even if kernel panics seem to have been reduced over the last six months, the number of crashes in core systems is still staggering and, to me, shows a platform that is not yet ready to reliably handle this kind of resource pressure.

My conclusion is that any kind of intensive GPU usage scenario (like embedding files) will leave your device basically useless.

AI is Power Hungry

This is just not a great match for any battery-powered device.

The Future of Local AI

I still love the idea of local AI. I really want it to work out. I am currently pessimistic that it will work out any time soon.

Naturally, we will see hardware improvements. Isolated AI engines, like the ANE in Apple devices, will be more commonly available and less locked down to improve resource isolation and, to some extent, power usage. Maybe we will see a trend of specialized AIs rather than the current trend of instructing large models with skills. But ultimately, it will be a long time before we see large models cost-effectively integrated into consumer portable devices.

One idea I can’t easily shake is that maybe we should entirely rethink our approach to portable devices to meet the changing opportunities and demands in the AI age.

But that is a topic for another time.