In my last article, "Thoughts on AI Coding Tools," I tried to share which AI coding tools I currently use for daily work. What I omitted there was a discussion about the choice of LLM model for those tools.
We now have new and "better than ever!" models getting declared every other day. There seem to be a bunch of benchmarks which these shiny models ace while sleeping. Youtubers and bloggers show how they built yet another snake game with a single prompt using these models. But does this translate into real productivity gains on large and complex codebases?
I will go through my experience and current choice of models in this article.
Model Performance: Approaching a Strategic Plateau?
I have tested Claude Sonnet 3.7, Sonnet 3.5, and DeepSeek on medium-to-complex and large-sized codebases. However, I used Sonnet 4 and Gemini 2.5 Pro for medium-complexity codebases, because of access to these models through Bedrock.
Is it just me, or have we reached a performance plateau among top-tier models? While each model has distinct strengths, the gaps are narrowing considerably. I'm not going into what benchmark results say, but from daily use, I could hardly find significant differences between most of these models.
However, I have observed some specialized strengths: Sonnet excels in precise implementation, Gemini in architectural planning, and DeepSeek in cost-effective prototyping and planning.
Security and Privacy Considerations
There are some data security and privacy issues associated with your model provider. Amazon lets us use any model from Bedrock. Different organizations have their own data privacy policies and concerns. Some people completely avoid DeepSeek due to security concerns or because of sentiments like, "The Chinese will take our data. If our data is to be stolen and misused, it's better to let OpenAI do it."
Childish jokes aside, for sensitive projects, a hybrid approach can come in handy. Not every single task you will have the agent do daily will involve sensitive data. Want to use an MCP server to search the internet and have an LLM summarize the results? DeepSeek can do it for a sweatshop price.
Cost Optimization Strategies
Here is a cost comparison between Claude, Gemini Pro, and DeepSeek. Claude 3.5, 3.7, and 4 have the same pricing model.
The cost can become high really fast. I ended up accruing a $150 cost in two weeks for Claude usage on Bedrock. One option could have been subscriptions. I would be willing to pay $100 per month, but that does not seem to be enough at this stage.
Subscriptions like Claude Pro do not help much as they do not have API usage baked into them. GitHub Copilot Pro can be an exception. This subscription can be used with Roo/Cline by choosing the 'VS Code LM API' provider option. It was a beta feature at the time I was writing this. And the subscription does come with somewhat obscure usage limits.
My Current Strategy
Now that Andy is not paying my model usage bills, I have created the following plan to avoid excessive costs. I will try to update this section as my strategy evolves.
Search mode: I am currently testing out both Brave and Google Search APIs through an MCP server for this mode. Google does seem to offer quite a lot of free searches.
Free tier with 100 search queries per day. Beyond that, additional requests are billed at $5 per 1,000 queries, up to 10,000 queries per day
Then DeepSeek does the summarization for me. I am also trying out locally run LLMs for this. However, the laptop I use for this server starts to sound like a vacuum cleaner when I use a reasonably decent local model through Ollama.
Architect/Plan mode: Gemini 2.5 Pro provides a lower price point than Claude and does a similar (or even better) job at planning. Although the price becomes the same once we hit 200K prompt tokens, I have not reached that yet.
Code/Implement mode: For some reason that I cannot scientifically explain, I still prefer Claude for this mode and continue to use it. Turning off extended thinking and enabling prompt caching can help achieve better results in this mode.
Documentation mode: Gemini 2.5 Pro works great here as well, at a lower price than DeepSeek. DeepSeek is not bad at this either. However, I only use DeepSeek when I am not particularly concerned about data privacy.
Conclusion
Find your own sweet spot. Employers should probably start providing model access using pooled rate limits. $250 per month per employee should be more than enough. And that will provide a productivity boost of (may I dare say?) 2 to 3 times.
