A Month Without Frontier Models

There are a few housekeeping items I wanted to address. First, I moved my personal blog from https://randoneering.tech to https://blog.randoneering.dev. This was simply because I wanted to put a separation between Randoneering, LLC and my personal blog. Nothing wild. What that did for me was send me down a quest to understand ATProto and go all in with integrations on the site. It was genuinely a fun thing to do and, now that the bug has bitten me, I plan to continue to add as many lexicons as I can. (https://atproto.com/docs)

Second, I joined the Kaneo Core Team in June and I couldn't be more excited. Working on this project has been a learning experience for me; I was building on my DevOps-related skills and diving into building Helm charts (and working in k8s). Thank you to Andrej and Tin for trusting me to add more value to the infrastructure and deployment process of Kaneo!

Finally, there were a few new things added to pgFirstAid since the last post. One of those things was examples of CI workflows that utilize pgFirstAid. I have had a few users reach out about this and I hope it is what they were asking for! If not, let me know. There are a few projects around pgFirstAid that I will be working on soon.

"Surviving" on LLMs

Something that has been on my mind lately as price per token continues to rise is how I can be as cheap as possible and not give up using coding agents all together. Before January of this year, this wouldn't even have been something I would be discussing. Yet, after several rejections from companies I would have loved to work at, the clear message to me was I was not using "AI" enough in my workflow. Naturally, I went all in, spent countless hours tweaking my skills, using mcp servers, and eventually landing on using both opencode and pi for my harnesses. Yes, I went through the honeymoon stage, but I am happy that ended just as quickly as it started. The inner engineer in me feels the "responsible" way of implementing AI in my workflow is to keep reviewing code produced. After all, any PRs I open using code generated by one of these harnesses have my name and reputation on it. No co-author commits here.

Of course, I only ever paid for the cheapest plan for both Claude and Codex, but I wanted to prove to myself that we do not need to sell ourselves to Dario or Sammy Boy. We can absolutely take advantage of these open-weight models and use existing hardware to run them (if we were so lucky to have it already before the great rise of RAM prices). In the past, I have used ollama at work, but only ever through the chat interface. It was fine, for the most part, and the quality of code snippets I would get were decent enough to make it in dev but not always production.

In my situation, I already purchased a 3060 (12GB model) prior to all of the price increases and it was in my nix-wks. I just needed to configure something like ollama in my flake and test it out with opencode, zed, and pi.

Setup, Usage, Thoughts

With previous experience using Ollama, I immediately added it to my flake, configured gemma4:e2b, and started using the model in opencode. No dice. Working through various errors (chat template was incorrect, returning json artifacts in responses), I finally was able to get a response back; in about 2-3 minutes. Not great. I posted in a few community discords I am in and received an overwhelming response of "yup, that sounds about right with ollama." Some of these community members were generous enough to send me some config files on how they had set up llama.cpp with the same models and had much greater success. That was the route I went, and into the flake it went.

For about a week, I was using gemma4 to do some code review of my work, build some random projects I was prototyping, and generally was happy with the results. With some of the optimizations made (with the help of the same community members), I was getting about 50 tok/s on the model. What was clear, however, is how inconsistent tool usage was with any harness. It took several iterations to get pi or opencode to use the correct tooling. Additionally, I had to make sure none of the mcp servers were connected while using opencode as it would add to the wait for a response. Even when I was able to dial everything in, I was still not happy with the quality of the work coming from gemma4. I swapped through various models of qwen, gemma, gpt-oss, etc. I just was not impressed, but maybe that was because I was comparing it to a model (GPT-5.4) that responded quickly and, with the right series of skills and prompting, produced far better results.

So I came to a conclusion; I could probably spend more time getting things tuned or find another alternative to the Frontier models. I know opencode offers free models (accepting that they would go away at any point and were using your interactions to train the model itself). I had used Big Pickle and it was pretty good. It was pretty fast actually. Sometimes, too fast for me to follow along. This actually made me think "why not just freeload and use the free models?"

Free Tier

Honestly, I am really impressed with some of the free tier models you can get with opencode. If you haven't tried some of them, you really should. I started to use Big Pickle when I was building some POCs for tools I have been wanting to build. It was blazing fast, though it made mistakes, but so does Opus 4.8. After a stint with that, DeepSeek v4 Flash came out and that truly blew me away. It has been a solid workhorse for me for building this site and assisting me with learning Go. I honestly will be bummed when I cannot use this anymore for free. Not enough to purchase any subscription, of course.

Impressed

That is it. Did running LLMs with my existing hardware work well enough to keep me from paying for a Frontier model? No, but getting by using the free tier models in opencode absolutely did. I am going to stick with this plan as far as I can go. I don't feel like I am missing out on anything. Especially since the token price, in fact, is "too damn high."