Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.
I think the models are being optimized for wealth extraction from users and companies, instead of solving problems.
I don't know why Opus would try to create an entire library when I told it specifically to do something simple that would take 2-3 lines of Python.
YES! They introduced the new tokenizer to increase token generation by upto 33%.
On top of this, Anthropic are generating almost twice as much revenue per paid user than openai - whilst their subscriptions have lower usage limits than openai's:
I don't think so. Expect that in a market with high vendor lock-in but that's not the case here. The market is extremely competitive and switching cost are near zero. Anthropic can't afford to pull shit like this and sacrifice quality.
Yeah, that’s my thoughts as well. I feel it’s great for benchmarks and some tasks while in other it tries to spend as much tokens as possible, tries to overcomplicate task and needs seconds or third round of steering that costs. With the scale Anthropic operates I bet it’s huge amount of extra money just to make sure their model works.
Because it reasons in one direction. First it encounters some kind of issue with 2-3 lines of Python that might make it not work, and then it goes onto plan B, which is making a library, but it doesn't circle back and compare the effort of making the library to working around whatever might make the 2-3 lines not work. Except sometimes it does, because it's inscrutable.
Should I refer to those who are only realising this now as stupid? I believe so.
Its not wealth extraction btw - the correct economic term is capturing/extracting surplus. They have a wide range of schemes - quality discrimination being one very obvious one.
Swear most of you on here pretend to be soooo smart when you def are not.
You have to test each task obviously but it is not a bad model on its face.
From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5
As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"
Similar situation was with planning and coding. GLM-5.2 seems to be good “on paper” but the real usage results was different.
And I am not an attorney for Claude or GLM-5.2… :)
But as I’ve been using LLM models daily since Nov 2022 I have realized that all common tests have to be confirmed in your project - there is no “one model rules them all” - you need to dig out a specific model from that LLM haystack with thousands of models.
Benchmarks help but they start to be similar to fuel consumption specs in car ads - real consumption is different for everybody :)
You'd need to produce this like 20 times by each model and then do 2x20x20 cross comparisons by both models and ultimately distill the 2x20x20 comparison results into two reports of how they differ.
In this non deterministic computing future, everything else is voodoo, feelings and "vibes".
"Wow, X models is Y% better or worse than Claude Z model on T benchmark"
"That's irrelevant, they're just benchmaxing."
"Not useable for daily coding or agentic workloads, the vibes are totally wrong."
"It's almost as good, and costs a lot less, so I will absolutely use it."
"I cannot imagine justifying using these, as the step change means open models lower costs do not make up for the productivity loss"
I'm an unhappy Anthropic customer and really rooting for open models and non-gatekept intelligence, but how do we move on from this now meme-like model release discourse rigamarole. I do not know what that would be. I don't design LLMs nor benchmarks, and I genuinely appreciate that people do their best to provide information, even if non-perfect here. I'm sure most of you who actively read these comment pages on announcements must feel similarly, though, right?
20 minutes after the announcement there's no real useful statement that can be made about it.
I generally agree with this in spirit https://www.seangoedecke.com/are-new-models-good/ , but I think you can read Anthropic's results showing Sonnet 5 as almost strictly worse than Opus 4.8 as very credible/meaningful, and then draw comparisons from that
I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions.
I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.
Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two.
Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models.
It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them.
This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K...
If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity.
Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.)
Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI.
At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.)
However, these frontier labs are also making moves that could let them capture a disproportionate share of the upside. One possibility is a situation analogous to the smartphone manufacturing space, where there are dozens of players but just a handful (e.g. Apple, Samsung in smartphones) capture the lion's share of the revenue.
Samsung the same. And is the best android device.
If tomorrow comes a Nokia os will be dead in the water: it has no apps.
But with a new llm that doesn’t matter. There is nothing sticky about typing Gemini, Claude or codex in a cli.
The AI labs are also making moves to secure long-term enterprise presence, such as their Forward Deployed Engineer strategy. I think that is a trojan horse play that could make enterprises dependent on them forever, much like so many companies are still dependent on IBM's mainframes. As an extreme example, you could imagine a company's core business logic encoded in the weights of a proprietary model custom-trained and hosted by one of these model providers, something even more inscrutable and sticky than ancient COBOL codebases.
The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi.
They'll show up after the fact and whinge endlessly about how they should have been involved.
Or maybe every cultural group has its own set of whiners and we always think the ones we disagree with are the loudest.
The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings.
I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do.
> that is ~1.5 - 2.5T in value annually
That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best.
It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further.
The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash.
It's also before cutthroat pricing really kicks in.
Right, that's more of an estimate on the value proposition of the overall AI industry, rather than valuations of the industry or specific players. While I don't think OpenAI and Anthropic will capture all of the potential upside, I do suspect they will do much better than other players despite the competition (https://news.ycombinator.com/item?id=48740472)
> And this is before we consider that they need to do it for cheaper or why would anyone bother.
Typically yes, but there are reasons companies may be willing to pay the same amount or even more, such as "AI doesn't need sleep, holidays, insurance, or benefits" and "AI is easier to procure and replace than humans."
> The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed...
Curious to see which studies you're looking at, the studies I'm thinking of (some here: https://news.ycombinator.com/item?id=45379452) are from 2024 - 2025, so already old and before agents really took off.
However, your point about meetings and agreements and documenting is much more germane. My theory is that the largest productivity gains -- and subsequent labor displacement -- will come from reducing coordination overhead: https://news.ycombinator.com/item?id=48040999
Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive.
Personally, for assisted development, I haven't seen much progress in a while.
Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000.
Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed.
$1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally.
So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model.
Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then.
I do think open weight and other competitor models, especially with better harnesses, will play a significant role in the equation and will result in less concentration in the market. However, I do also think the big AI companies will capture a lot of that value. Partially for the same reasons that the cloud industry has been growing like gangbusters, even pre-AI, despite on-prem being much cheaper: companies will outsource anything that is not deemed a "core competency" for their business.
A lot of the problems you mentioned will be relegated to the consumer market and won't apply to enterprise contracts -- which is where the real money is.
Pls stop posting you are creating noise.
I think this sort of thinking is a trap, because it presumes that all software has the same constraints.
There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing".
I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine.
They won't justify trillions in investment, but they are useful.
Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage.
I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes.
As you said, building a script that only you use personally or a very simple thing that just accomplishes one task and it’s easy to test require almost no engineering, and an LLM can often build those with very little downsides.
That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai.
Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp.
Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best.
Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it.
If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found.
If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have.
This is for native mobile app development, and the code base is around 100k LOC.
Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad.
FCFF = EBIT(1-t)-Reinvestment
I dont care about your gross profit - this kind of cash profit determines the value of operating assets.
Whether he's lying is another question, but seems unlikely.
Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage.
But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :)
Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models?
It’s such a mess and isn’t inspiring confidence as a non-investor.
It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that.
So I dunno! Could go either way!
But is your impression that this is the strategy of people like Amodei? My impression is that it isn't, that they are actually true believers, and not just trying to hit the timing right and flip it.
What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO.
My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing.
But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions.
https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest...
I think there is. Pair today doesn’t mean they’re locked into that forever.
go ahead m8 we are all waiting... the stage is yours. lets see your model.
Yup. I think we agree. These valuations aren’t made or unmade by whether their tools are being used as vibe agents or pair programmers.
Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities.
Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on).
And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens).
Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said).
So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk.
I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field?
That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house.
<banned>
Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model?
Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course.
I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign.
For the non-bleeding edge they have a lot of competition with more competitors showing up every day.
The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well.
However the result is exactly the same, concentration of power.
And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week.
That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus.
I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al.
I currently don't see a world where it makes sense to run a local model that will eats up 60% of my RAM, 20-30% of my disk space while providing worse quality output than a $20/month subscription.
https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4...
Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me.
I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve.
IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task.
I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then
OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks).
Imo MiniMax and MiMo are a lot more reliable (and cheap)
Not opus level, but close enough and cheap enough to get the job done
If this was the last model I could ever use I think I would be happy.
I give AI an image and just it what's wrong, and then it goes on to fix the bug in the codebase for me ( and write the tests), is this agent-assist or agent-driven?
Sometimes I just give the AI my description, and mockup, and it creates a plan and implements the details for me, and I verify visually ( this is the weak spot of AI), is this agent-assist or agent-driven?
the incentives aren't there sadly
There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models.
People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.)
I can't help but feel this is intentional towards the 'Agentic' workflow.
For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other.
Feels like optimizing for either precision or recall, but can't have both
By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day.
If you set off a classifier, that's how it looks to Claude.
IMO, they were quite good with checklists even a year ago, and tried to tick off each one.
The quite useful tool is to use /opusplan along with /codex:rescue (https://github.com/openai/codex-plugin-cc) means you get quite a strongly reviewed plan using native claude + codex without having to implement the mostly useless trust-me-bro plugins and other bs.
Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon.
It's really grim if you're looking for assistance instead of an implementor.
GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information.
I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files.
The problem is obviously who will be left. There’s a lot of scifi to catch up on.
I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became.
I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention.
I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter.
"I just cloned this repo, investigate how to set it up, don't install anything, just collect information"
_spews information_
I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten...
"There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix"
_proceeds installing linux-utils and all the rest_
"Didn't I tell you to not install anything?"
_you're absolutely right_
F*k me..
I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails.
Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus.
Weak spots (categories it fails):
- Trivia — 0/3 - basically not much built-in knowledge
- Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
- Puzzle Solving — score 77, flubs carwash-like tests
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools).
And no (strong) programmer would jump to assuming other people are coding monkeys just because they disagree on what a strong LLM is: that's the kind of thinking reserved for the glorified coding monkeys who wasted their life getting better at writing CRUD apps and are now upset that someone's tooling is dropping the already very low bar there.
(ie. won't feel the need to downvote them just for having yet another crappy AI benchmark)
I only recognize it because I build a product that leaves me looking for information on every major release... and every major release a new crop of folks reply confused about the anomalies on top of anomalies that they're seeing, and they slowly learn this person is just way more unserious than the dogged distribution would imply.
z.ai doesnt always have the most reliable AI
but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh
So I’ll consider all, fastest tokens/sec wins
That's not something that's definite. They are not quite like the Russians. A lot of the governments in Asia are overly pragmatic and will happily strong arm their companies to throw users under the bus for the sake of a trade deal. There's a reason why Snowden ran to the Russians and not China.
Also, if they have any subsidiaries in the US, they may not have a choice in the matter.
> Illustration of a white goose riding a bicycle, with one wing extended forward to grip the handlebar, set against a plain white background with a brown ground line.
Meanwhile GLM 5.2 drew a cool self-contained fully animated SVG pelican:
(I suspect that's more of an indication that Anthropic have chosen not to waste resources training on animals riding vehicles, personally.)
The reason I thought this was an interesting benchmark is because it’s a non-image generating model creating an image using SVG code, so it kinda spans capabilities.
If an AI lab trained a model specifically for animals riding bicycles it seem trivial to modify the prompt and determine if it was trained specifically for that or if it’s generalized a skill and can also generate a proper orangutan walking on stilts or an armadillo on a skateboard, this sort of thing?
Google Gemini have openly boasted about their animals on vehicles results! https://x.com/JeffDean/status/2024525132266688757
Stable Diffusion 3, an open weights model, was laughed at at release for not being able to even generate a woman laying in grass. The community attributed this to the heavy dataset filtering. Since then other open weights releases have been made with no NSFW capabilities and the community claims they're not as good as anatomy as well.
You can google "stable diffusion 3 woman in grass" and press the images tab to see how the model failed spectacularly.
Most recently Ideogram released an open weight model that will denoise into a grey image with the text "Blocked by safety filter" notice for certain prompts
Of course, because it's open weights people have found defeats
This may be the goal.
They changed the Sonnet 5 'Agentic search' benchmark graph overnight
As with any new model, you won't know the real impact until you start using it for your workload.
There was a fairly major regression in Claude Code performance for some time when they changed the system prompt to try and make it less verbose (saving tokens). And if I'm not misremembering, there were a lot of complaints when they changed the default effort from high to medium.
I've been using Sonnet instead of Opus for almost all coding tasks for a while now. A little elbow grease to break down tasks and you can spend a lot less money for just about the same output quality.
I'm a skilled senior (I'm 54 and been coding since I was about 8; I've been 100% AI-generated code for at least 6 months now and have produced a combination of speed and quality that has astonished me; my velocity is apparent at https://github.com/pmarreck/) and this has been a massive net gain, so your claim is now officially in sheer defiance of reality.
In a skilled senior's hands, this is like an expert power tool. In the hands of someone less-skilled, it is likely also... less-skilled. It's a magnifier.
> and the hidden cost in terms of technical debt and skill atrophy is just being swept under the rug.
Nope, no it's not. It's being reviewed, measured, and controlled against. Because... you WILL need more controls to take full advantage. Look, I even invented a whole new control methodology around it called MFIC: https://gist.github.com/pmarreck/b30aa3ca69cb70a5526f8a63ab8...
Not disagreeing that LLM's are a force multiplier, but I highly doubt whatever value will end up finding multiplying in the next generation of seniors, at this rate. It's surreal to me that I have to point out that recognition AND recalling are both necessary components of skill acquisition, because humans largely knew this since the dawn of education.
Seniors should be paid to actively introduce juniors to the trade over couple of years. No more bootcamp entry.
And it would be significant $ for senior to agree to expend his time and energy on software engineering apprentices. There would be also very limited number of places with good seniors. Exactly like medicine for a long time now.
In fact it is already happening in some companies I know about - seniors geting their bonuses tied to juniors being under their wings.
Instead lets train the contractors of an IT sourcing company and then we don't need you.
If your "force" is above 1 then its ok to have AI power up your force. 2.3 to the power of 3 is 12.169.
But if you're a beginner and your "force" is bellow 1 so power upping this makes it worse. 0.2 to power of 3 is 0.008
I have yet to see it, but OK
Either measure it or it sounds like a conspiracy theory
If you don’t use a skill, it’s like a gene a species doesn’t need anymore, it will atrophy.
Is that bad and if yes, why? Skill atrophy is not intrinsically bad. I don’t know how to make tinted glas for church windows and I will never learn it because there are machines doing it now.
But I would for example think that critical thinking would be a catastrophic skill atrophy. As far as I know, there is no proven link though (and one would have to define what is “critical thinking” in the first place). Writing assembler without any autocomplete, I’m not so sure it’s such a problematic skill atrophy.
As far as I’m concerned, so long as we can be happy with AI we can run locally, AI is no different to the rise of scripting languages or the pocket calculator. It’s only problematic if the calculator is rented to you as a service.
Since automatic memory management became a thing memory management and pointers knowledge atrophied[1] across the workforce (although not nearly to the same degree).
I think the pattern here is that compilers almost always output better machine code than humans, automatic memory management doesn't output better machine code than skilled humans can very (especially with modern languages that give you a lot build-time safety checks).
And even then, there is still demand for assembly knowledge in the workforce, it is just very niche.
I don't think LLMs will ever be good enough to "almost always" output better code than humans. But, like automatic memory management, it will likely make some types of programming more niche.
The key thing here is that compilers are deterministic, deterministic tools have way less variance in output quality. Automatic memory management is not as deterministic as a compiler because it happens at runtime. LLMs output build-time code, but the can be drastically different if I sneeze too hard.
[1]: as in % of the workforce, not absolute numbers. Hard to get exact figures on this, but I think we have more experienced people actively using Assembly today than we had before compilers became the default (late 80s). We probably have more active C/C++ programmers today than before Java became popular (early 2000s).
I just did a big refactor with opus, it went ok, some bugs. The normal stuff. One of the bugs was in a part of the code no longer needed, which Opus had just filled with comments more or less. Asking it fix the bug worked, but then I really looked at the code and realized just that, this is pointless now.
I’ve only been coding for 20+ years so I might be more susceptible than the author, but I’m quite terrified about losing skills in writing code, but also designing good structure, coherency and system overview. These are the things people claim you need more of with LLMs, but is what you outsource the most, even if you think you are describing it in detail.
We are all collectively growing the skill of complacency and laziness though, and those are not great ”skills” to have. And I’m just as guilty as anyone.
Yes, some skills will atrophy, but the learning curve for LLMs is also steep and you will acquire new skills that will pay off the costs many times over.
We see this in discussions like these where you have people running the gamut from using them as glorified auto-complete or babysitting them (usually a net loss in speed, though it'll feel less draining) via people running multiple agents in several different tabs (a gain) to people prompting for harnesses rather than tasks, and putting the agent in the resulting harness (where the multipliers come in) and even people at the peak of experience with them today are only scratching the surface.
I'm very aware that just as my assembler skills are not what they once were, my skills in the languages I'm now writing less will not be what they once were a few years down this line.
But I produced far more before I started using LLMs through the force multiplier of modern languages and frameworks than I did in assembler in the 80's and 90's, and I produce far more now with LLMs, and I will produce even more in the future by learning how to take advantage of new capabilities.
I have Claude refining a system that wasn't tractable a few years ago in another terminal as I'm writing this. I don't care if it would take me a bit longer to get back up to speed on a C codebase again if I was stripped of all access to LLMs any more than I care if it'd take me a bit longer to get up to speed on programming assembler on a Commodore 64.
I usually start a task with an LLM and then do small refactors using the LLM and then do some manual refactors before I am done. But often for more complex tasks the manual refactors are quite large.
Maybe it is because they can read walls of text so easily, so they output walls of text that are hard to read for humans.
I feel quite sad because a lot of my fellow colleagues are not putting this extra effort in to make things easier to understand by humans. PR review is basically me just doing this extra effort for them and their LLM implementing my comments.
And that is when I can even pinpoint the bad taste in the code structure, sometimes it is not something you can easily describe in a PR comment besides "no human would structure the code like this".
At the end of the day there are goals achieved with coding. Coding is a tool to reach either your business needs or some personal aspiration.
When it comes to businesses I don't think a business cares if you used the best stack possible, or you've written it in assembly, as long as it works. Judging from the biggest coding drivers out there, most of the code produced globally and the biggest apps out there have had skilled engineers writing code but its not always perfect. As long as it works. Lets not forget that the web is build in php and js.
So again my argument is that, are you atrophying a skill that is going to exist in the next 1 to 2 years, or is everything going to shift towards LLM code writting.
Personally I think that LLM code writing is the winner, whether we like it or not, it accelerates business objectives, which at the end of the day its what is the deciding factor.
And yes I do miss the days I was writing code and I was solving complex problems myself.
This is your opinion and I even share it, but there are many people here for whom writing the code was/is the whole deal. You would not have languages and heck - even editors! - holy wars otherwise.
Could you elaborate on this steep price that you have in mind? What does it consist of?
Technical debt due to accumulated excessively verbose, badly architected, often redundant, feature-bloated code which always looks good, even upon earnest review, but actually sucks and becomes extremely difficult to maintain in ways which are not obvious in code review. The issue is this: your tooling can help, and can make you feel better, and you might think you wrote all the prompts and made all the tools to mitigate these issues, but you haven't. If you're not consistently seeing it generate code that is very very close to the way a skilled senior dev such as yourself would have done it (with similar line count, etc), that is a red flag even if the code looks great and works.
I can only judge from my own experience but with or without LLMs, these are the codebases that I have worked with during most of my career. To me, much of the question is whether LLMs produce worse code than the me and my colleagues have done in the past and I don't think that's the case. It is however very common that people hold LLMs to a higher standard than human colleagues and then it's not a useful comparison.
It came up with a correct LC-hard tier solution that involved dynamic programming, and was essentially an unreadable dense mess that was impossible to reason through as a human.
It worked, but it was so bad, that I sat down and realized after a bit that with maintaining a small cache, and being very particular about how the nodes are traversed, I reduced the solution to like a 10 line modified DFS, that I could understand.
I do the same with the LLM. I tell it that solution is convoluted and hard to understand, if I have a concrete suggestion I suggest one, otherwise I ask it for ideas. We get there just like I do with humans
This is in the interest of big AI companies: if they quasi-monopolize the skills entire sectors of the economy need in order to function, that will be great (for them).
Everyone keeps comparing this to compilers, but I don’t need a multiple-hundred dollar subscription to use LLVM. And people didn’t stop understanding how computers work either, just because they used C. And yeah, maybe local LLMs will become the norm, and I hope so. But market forces (hardware prices) certainly are working against that right now.
But we could build much better tooling around keeping the agents honest. The problems you are describing are absolutely real and I see them every they.
One friend of mine had almost a mental breakdown when he just went ahead and drilled a bug producing Claude to the point that it itself admitted it was “a piece of shit”. He knew that arguing with an LLM agent is more than useless, but it was cathartic for sure.
When I encounter a situation like this I always go down to - have I done everything I could to catch these errors in my automated validation, and update it as needed.
Agents are also more than happy to spend tokens refactoring, once you have such a test harness be good enough, producing successively better and more general abstractions is quite easy.
The old rule of thumb of “make it work, make it fast, make it pretty” still applies , just with much much faster iteration speed.
It seems with agents people have forgotten the last 2 steps since they produce a _working_ solution, and it might be hard to justify spending time “cleaning it up”, but this still remains essential.
I hear what you're saying but I'm not sure I buy it in the context of this thread (a response to someone who is 54 and has been coding since they were eight).
I am in a similar boat, having been coding full-time for fourty years. The way I use the current tools is that I own all architectural and design decisions but let Claude Code fill in the blanks. I reckon the quality of the output is about 90% of what it would have been had I done everything myself, but I get a lot more done (easily 3-5X).
Will I forget how to write a "for" loop just because I haven't been writing many of them by hand lately? Those skills are so deeply ingrained that I seriously doubt it. I can ride a bike after a multi-year break, or converse in a language I haven't regularly spoken for several decades. Or write using pen & paper even though I hardly ever do it. I don't see why coding would be any different.
I also am not about to forget how to for(;;), that said, as a result of some years invested in aligning old pre WGS84 mapping with modern GPS and improving digital mapping, there are fewer people per capita with the skills to navigate via paper maps in the absence of GPS.
Old farts coding since age 8 (in which I include myself with a decade+ over a sprightly young 54) will retain coding skills for as long as they apply them - the fear is that fewer and fewer others will develop and exercise such skills due to AI.
It remains to be seen if that's a bad thing long term.
What I am worried about is us becoming dependent on tools that we as individuals neither own nor fully control, and gradually losing our ability to function without those tools. This, I think, is a huge societal risk.
Seniors will be able to stay in the game much longer than before, mark my words.
When an LLM is making a bad design decision but the engineer doesn't have the experience to spot it AND the consequences don't become apparent until much later (which is often the case) -- it's kinda hard to learn.
But they take a lot longer to reach the same goal for complex tasks, so the difference is still very real, and the cost-savings are still very much a question of how well you manage to characterise the tasks they will do quickly and pick and choose what to use when.
I kind of agree that I think the cheap models will eat away at the moat very effectively, but if it doesn't seem more capable to you, you're not giving it complex enough tasks to see what they can do.
(FWIW, I've burned billions of tokens on each of Deepseek, Kimi, GLM5.2, GPT, Sonnet, Opus, Haiku using the same harness, and we've kept stats on cost per task)
Extraordinary thing to say about the fastest growing company in the history of capitalism. They will soon have access to public markets, essentially unlimited capital, and can build insanely large models that they don't have to make public... ever. They can just use those models to run their business, train better models, eat competitors, etc.
But maybe it's Anthropic that isn't thinking ahead enough - you clearly think you can see around corners with your proclamation. So why do you think they have "little to no chance" of surviving long term?
There is no such thing as unlimited capital. The faster they grow the faster they burn capital. Eventually it will run out.
I recently did a fleetwide upgrade to Zig 0.16. Do I remember every single change from 0.15? No. Do I have to? Also no. Both because I can look it up if I need to, but also because the LLM already does.
If I don't look at a codebase that I myself haven't looked at in a year, I will not recognize some things when I return to it. Is this sense of "atrophy" meaningful when this was a problem long before LLMs came on the scene?
On personal projects, where I am in charge of all the hats (product development, UI, UX, backend, security, server admin, etc) -- absolutely crazy force multiplier. You get a nice suite of backend and e2e tests running, with full business scenario layered on top of that, and constantly running agents to do the coding, another agent on a higher level of reasoning to review that work, and sometimes occasionally poping into another competitors model to review their work just for added comfort -- it feels like wizardry. I am not vibing it, but I wouldn't say I am carefully scrolling through every line. I review whats fundamentally important, especially when it comes to data, overall structure, and large, cross cutting concerns, but I would be lying if I say some code doesn't land that I don't read. But I have the security of the test suites and validations , so I pour more effort into that.
It's a nice self reinforceing loop.
All of this might sound like I agree with you, and to some extent I do, but I am realizing as the apps I have built out like a cannon shot out of hell with tremendous speed and polish right out of the gate are starting to slow down. Feature adds are getting more complex. My memory is not what it used to be. Each run and pass through the code consumes more of my tokens and limits. I am starting to do less in the same amount of time. Codex did a vertical slice of a feature for me (well defined and well planned). It contained functionality that has historically plagued us developers -- the dreaded time. I used xHigh GPT 5.5. It had obvious bugs, but I wanted the robots to catch it. I popped it in claude (on the new sonnet 5! heyo!) -- Claude caught the bugs. Even said they "immediately stood out" I wondered how this happened. Frontier model from company A was evaluated by workhorse model from company B. All of this again took massive amounts of usage. And time.
And this is -- best case scenario, perfect world, everything is in perfect alignment.
Now for the work reality.
Multiple product and experience owners. Multiple dev teams. Different enterprise teams support services you rely on. You don't have full unfettered access to frontier models. You have to use copilot, or some other enterprise harness, and you run out of credits for the month, you are SOL. It's not as good as your claude, you think to yourself, but hey, its familiar enough, and you have 5k credits left for the month for Opus 4.8, better make the best of it. But now you burned half of them working on that Transactional Bug that was mixing synchronous and asynchronous semantics that the other guy's model should have picked up on. What happened? Maybe he didn't use Opus, maybe he used Haiku, maybe his prompt was bad. Who knows. Gotta fix it. Oh, you gotta reach across the isle and put in a request to get the Enterprise team to look at this caching inconsistency on user data that you need and is really the source of your race conditions. Tick tick tick. Model limits approaching. You start wondering if you just did all this by hand like "in the old days" would you have got it done correctly faster? Or at least, cheaper. You'll never know.
Scaling in this sense is not operational (“servers”), but conceptual (“features”).
I don’t want to be a downer but I find many devs are not great at this. Very clever folks, but they tend to not see these issues clearly. They’ll nod and recognize when you talk about separating content from form and the importance of various design principles like high cohesion and loose coupling but completely disregard them once in contact with reality.
Part of the problem, as you nicely showed, is that technology is only a single slice of this problematic pie. Organizations in general are systems as well and they tend to be either badly architected, badly maintained or often both. Some technological issues are downstream from organizational issues and IME those can be become rather dominant variables in the equation and no amount of AI - save full AGI taking control of the company - is going to save us from those factors.
the distinction between personal projects and Enterprise development is a big one. A severe bug in my personal projects, i fix it on the fly. A bug in our products rolled out, nightmare.
This is all to say, we as a company are using AI a lot in all possible corners, but thankfully our leadership isn't schizophrenic and isn't mandating everyone hit token limits or whatever, it's more of a "Let's see what works and what doesn't" type of thing, and we measure a lot of statistics. Nobody here really cares whether LLMs are the next coming of Christ or not, as a company there are many people (even in SLT) that are indifferent to LLMs, and many who are reasonably hyped.
I wish I could link to the actual document we were all shown since it has a beautiful breakdown of the methodology and a fine-grained breakdown of the stats and the categories measured, but in the grand scheme of things, ALL the AI tooling we have implemented (at least on the engineering side of the equation) has contributed to a total of... drum roll please... 7 (seven) Percent overall productivity increase! The most productive teams saw a productivity increase of around 20%, while some teams actually saw drops in productivity into the negative percentage points. My team, none of us really give a shit about AI and we're somewhere in the 3-5% range on certain categories of tasks, which I'd say is a fairly good assessment.
Productivity here is measured in many ways, including but not limited to speed of MR review and merge times, feature/ticket/roadmap closure/delivery, rollback/revert incidence rate, how often people interact with the MR review bots and implement their suggestions/fixes, how many times people check back on AI transcriptions/meeting notes (hint: Nobody looks back on any of it, it's all just noise that gets generated and never actually referenced outside a few extremely rare cases) and many more things I'm forgetting. It is an imperfect number of course, because measuring productivity in engineering is a sisyphean task, but in my opinion it is accurate to the reality on the ground and outside of all the hype and marketing bullshit.
So, I remain thoroughly unconvinced of these personal anecdotes of people being "massively" more productive, especially once you factor in the fact that we now have a 2000EUR budget/month/dev for all the AI tooling, those productivity numbers start looking pathetic once you factor in the costs (which are only increasing as the AI companies need to start recouping the gazillions they've burned). Some teams have started begging to disable coderabbit and other similar tools in their MRs because they're producing nothing but walls of noise that makes reviewing any MR a nightmare of sludging through endless slop of useless bullshit, ours included.
Its like drug that will give you few years of great high, and ruin rest of your life afterwards. Use it by all means, I don't care about your output, nobody here does, you do you.
I do care about my long term skills, which aren't about piping some llm outputs. My employer ain't dumb fuck who is pushing for llms at all costs as much as possible. Anyway, most of my day work are processes, discussions, pushing things through - llms can't do a shit here, its personal conversations, connections, often psychical contact to get things done on time. Startup world would be different but I am as far from that unstable environment as I am from say gaming industry, just not worth my time outside SV area.
So I just use llms to verify my coding results, they are fine for that, but I do the creativity. Its by far the best part of my software dev work, why the heck would I be automating that away? Its like automating sex away so you can have more time... reading HN or some other way to just waste time, dumb approach from all angles.
Of course this changes if one is working on personal projects, self-employed, small startup etc but most folks here are not in that category.
I've had Claude Code running a /loop for the last week driving down complex crashing bugs in a prototype compiler entirely unilaterally. I occasionally glance over.
A few of those crashing test cases were ones I've spent more than a week trying to track down myself. I have 30 years of experience of doing this.
It's worked 24/7.
So far it has fixed over 500 of them.
Will there be technical debt? Yes. But nothing that remotely compares to the cost I'd have incurred of fixing all of those myself.
It is hard to reconcile those gains without thinking that if people are saying these are not a net gain, they haven't really tried learning how to get the full benefit. If you sit and watch a model work and keep intervening all the time, then sure, they're not going to be a net gain.
(And I say this as someone who agrees with you that it's garbage that these companies are trying to legislate their way into an oligopoly.)
Anthropic has gone past fearmongering and well into terrorism. I think people on Hacker News should not recommend working with terrorist orgs.
Or the largest ad company in the world (Google)?
Skill athrophy is a real thing though; we try to prevent this by have hackethons (for lack of a better word) without AI where I pick something extremely non trivial and we implement it for fun and profit without AI (with would not matter much as they are currently bad at these things); last one was flex paxos for our in house db with obvious metrics for the endresult: data integrity (duh) under failure and performance better or at least the same as our raft production version.
You’ll never guess what product your clients are looking to replace with their own next.
For now everyone is still sufficiently crap at using AI to need help. We had enough clients trying to build something themselves and then come crying to us.
What is your evidence?
Open weights models are responsible for enabling reams of research on interpretability methods that do just that. And they have facilitated so much collaboration on architecture, inference optimizations, training and steering methods, and other topics that were completely out of reach with closed models like Anthropic's. It's really staggering to me.
“His warns that once powerful models are released openly, companies lose the ability to monitor misuse, revoke access, or update safety guardrails.”
1) the company has device-level control to the degree that they can not only restrict which API endpoints people can connect to but which accounts they use to do so (in which case this already isn't an issue); or
2) they don't, and all bets are off anyway, open weights or not.
Did fearmongers like Amodei say, "Oops, we were wrong! It wasn't that dangerous after all"? No. Of course they didn't.
> "Once the weights of a model are public, they cannot be retrieved. If a model possesses dangerous capabilities, it is permanently out in the wild... We need to consider regulatory frameworks that account for the unique risks of open-source distribution of highly capable frontier models."
It definitely sounds like the kind of thing that ends the world in B sci-fi thrillers.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
Juggling between all different models/agents is quite simple with Zed.
A caution about OpenCode Go though, the entire company seems to be run by AI so there's lot of billing related issues with zero support. I subscribe new every month as I lost money due to double payment with automatic subscription.
For non coding related tasks I use local models.
P.S. If anyone is interested to read more about my setup, let me know I'll publish a blog post.
The Z.AI is a bit wonky, so now I'm moving to Openrouter for Qwen+Kimi+Deepseek?GLM
My summer project is to figure out a proper agentic system where a "big" model does the planning, but automatically uses a cheaper one for the grunt work. Having Opus to config edits is just stupid :)
My company pays for the tokens so I don’t care. Biggest model and max everything. The slight risk of a smaller model making a mistake is more expensive than just running the bigger model all the time.
But I'm playing the long game you see. The tokens will get expensive and the monthly subscriptions will either go away or also get too expensive.
Then companies want efficient token use and cost control - and I'll already know how to do that =)
What sort of hardware are you using to run local models? And how do you use them?
I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.
I've done variants of this a number of times, but feel like it was a generally waste of my time to then have to compare them and write up which parts I liked or disliked: if the output is something substantial, each will have its pros and cons. Clear-cut wins aren't very common. Of course it could work well if we automated the whole thing with an orchestrator; you just need a model with actual good taste (according to your own preferences) ... so we'll have to compare all the models to find that one
At the same time, I’ve invested in tooling that prints and lints architecture I want, so which model is less of an interesting decision, because the results tend to be very close.
For Opus 4.8 training with overblown internal dialogue and second opinions - Max effort burns just tokens and wastes time without much value. Spinning wheels.
Now that the ban is lifted, max effort Fable 5 is gonna solve this problem quite neatly. Fable to plan and review, Sonnet for the implementation.
Wait, never mind that. Subscribers will only have Fable for a week.
I am getting things done. I've made major progress on my projects, and even started new ones. My most requested tasks are: code review, brainstorming and research. The fixation on tiny details is exactly what I'm paying for.
I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.
They are often used for reading code though.
To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.
In theory the flow works like this:
- small fast models read lots of code, and pass details to the large model to write a plan
- large model takes those details and writes a detailed plan
- medium models write the code
The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.
If it guesses, the plan usually starts to fall to bits.
If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)
It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.
I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.
That's not something I understand very well. The less expensive models will quite happily chug away at tasks, if the codebase is well-structured (small files help a lot) and your instructions are clear. In contrast, I've never seen a large model turn bad instructions (instructions that would cause a human to think before starting) into a result I liked. You can run small models almost 10-100x as long for the same price in dollars, which covers a lot of correction and adjustment.
Why does everyone say the trade-offs are rarely worth it?
I think the distinction is here.
I expect my agent to build from product level descriptions. This might include specific special cases that I call out, but will rarely highlight existing special cases or edge cases - they already exist in the code, and I'd expect a programmer to make sure that behavior continues to work.
If a feature hits lots of these edge cases, the weaker model that is reading the code (aka Haiku) won't understand their significance, and will report back to the planning model incomplete or incorrect information.
The planning model (Opus - which hasn't actually seen the code remember!) will build a plan that is incorrect or incomplete and delegate coding to the mid level model (Sonnet) which will do it's best to make things work, without understanding the overall picture.
This is how you end up with slop - for example Sonnet reimplements things that already exist because it found one of the edge cases, but Opus had never known about it because Haiku didn't understand it.
It's possible that the new "agent teams" feature in Claude code can help with this. That keeps each agent alive with its context so they can ask each other things, but I haven't tried that enough to be sure - let alone with the specific model mix like this.
In your case, you are giving the Sonnet model specific instructions for what to implement mindlessly. I'd expect that to work well!
But that's not the same as the agentic workflow many other are using.
I haven't used them in a while so my info may be out of date, but they tended to track whatever models were the best and auto-use them for each task (eg, one for planning, subagent for a code search, other frontier for implementing). Their CLI seemed very well thought out to make you do things "the correct way" -- for instance, `/handoff` instead of `/clear`.
I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.
Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.
Understandable frankly.
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
It makes some sense, as models are trained more and more with reasoning, than without.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.
There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.
Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.
The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.
So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.
But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.
Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.
If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.
(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)
It was a claim that applies to a range of x-values where both curves are defined.
Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?
You could make it true by artificially dropping some of the data points, but, like, why?
(Again, this is moot given the updated graph.)
> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.
Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
Does anyone else have any review token saving measures?
Assume it to get deprecated sooner rather than later.
I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.
And what (avaiable) model do you trust to go off on its own?
Why would they brag about something like this? It's like they know people want to use models to perform cybersecurity tasks yet knowingly deny them the ability.
And Opus 4.8 is still cheaper for a higher pass rate (much less open weight models like GLM 5.2) so not sure why I'd use Sonnet except on the low effort level for I suppose trivial tasks where I want it to work only 50% of the time judging by the graph. The pricing doesn't really make any sense.
It’s like telling a chef to cook without a knife because knives can kill people.
Dario and his lackeys at Anthropic aren’t visionaries.
I'm sure they're well-aware that this also will make it worse at building secure systems, but the gov't isn't restricting releases based on that.
thats true because their point of view makes no sense for us. dario is all in on lesswrong machine god theory and really believes they need to create a super intelligence before anyone else. that means doing as much as possible to slow down others progress and accelerate your own. but the fact that they believe its the only option doesnt make it true for the rest of us.
Are there some Less Wrong posts or similar I should read that probably explain it?
Fable is effectively not available to the general public in the US either
Everyone dislikes when these models are provided for use by the Department of Defense, but we can likely assume these newer, more capable models are being used by the NSA, FBI, CIA and other Five Eyes agencies to develop more backdoors, hack into more things to spy on us all.
We get drip fed the weaker models, but only once all the 0days have been used against us.
Also, I wouldn’t expect Mythos-class models to be allowed to be openly released by the CCP. Thinking otherwise is pure naivety.
Quite a lot of these models have "safety" (lol) filters in front of them, vs it being heavily encoded into the weights not.
After a certain level of capability you're proposing handing loaded nukes to everyone. There is an end of the road to the "open models are good" argument and that end is when they start turning into cyber super weapons.
Either you think model intelligence will continue to improve or you don't.
If you think it won't continue to improve, sure, open models are great.
If you think it will continue to improve, then we are all fucked if models continue to be open on release.
Also Heretic as it is does not work for GLM5.2 (at least as of 3 days ago when I tested it). You'll need some hybrid approaches.
I am planning to release the steering patch for the GLM 5.2 eliminating pro-CCP alignment in the next few days.
I supposed I shouldn't be surprised at how the trump admin is approaching AI regulation, counter-productive is really all they do
>Our safety assessments found that Sonnet 5 shows an overall lower rate of undesirable behaviors than Sonnet 4.6, and is generally safer to use in agentic contexts.
which is obviously painting that as a good thing. So reading the next sentence as "in other good news" is reasonable.
This recent government interference is about trying to preserve US offensive cyberwarfare and cyberespionage capabilities. It’s not about “bad actors”. It’s about defensive capabilities becoming pervasive and cheap, which would kneecap us cyberoffensive capability.
It’s like making seatbelts illegal so that police chases can be more effective.
Gemini wouldn't do a security audit. But it came up with a great set of mitigations and identified an extant XSS flaw in the process of improving robustness.
There's an awful lot of good that can come from proactive, defensive use of LLMs. I realize there's also a lot of pain when the difficulty of exploit finding drops suddenly, but in the long term we may all benefit from the defensive side of this.
What exactly do you want Anthropic to say here? "This model, the one we are about to give to the entire world for cheap, is really good at hacking"? Saying Sonnet is terrible at cybersecurity is the most reasonable thing they can say, out of a lot of bad options.
Unless it spams as much as Opus, I doubt it. Opus 4.8 literally spams text like puke. On a longer run especially if you get cache misses here and there the bulk of the cost is all the extra context it adds.
This line as a selling point is also pretty funny:
> Evaluations also show that it has a much lower ability to perform cybersecurity tasks than our current Opus models.
It then hallucinated the submit button class...
In effect, high reasoning only makes sense when you're using the frontier model and need extra performance (higher levels of reasoning are never pareto optimal unless you're at the largest model size).
At least for Claude family models.
e.g. {
"reason": "<Describe why you picked this result>",
"selection": "<The number of the value you selected>"
}I'm sure native reasoning produces more accurate results, but for my use case the quality was about the same, and the model would reason for thousands of tokens in native reasoning vs just 1-200 with response level reasoning.
Again, to be clear, this is for deterministic/pipeline style workflows, not agentic/coding use.
I don't get what value you get out of this.
I don't know whether that comes out ahead compared to just staying with the better model in the first place.
I'm sure folks' mileage will vary though.
I keep having to correct 4.8, but 5.5 more often than not is correcting me.
Opus writes a bit nicer though and it is easier to follow wat it is doing/saying. Not too different experience from talking to humans: 5.5 feels like a very smart 'nerd' that doesn't make a huge effort to communicate wel, while Opus is a bit less intelligent but that makes it's ideas easier to communicate
Went away on it's own.
Sonnet 5's performance is comparable to GLM 5.2 in both one-shot coding and agentic ability. However, it's about ~20% less verbose than GLM 5.2 in average code submission sizes, and uses fewer reasoning tokens, which reduces the cost gap and suggests it writes cleaner code. In practice, Sonnet 5 ends up being 40% more expensive and ~2x faster than GLM 5.2 in our evaluations (not 300% more expensive as the per-token pricing would suggest). Granted, GLM 5.2 is an extremely reasoning heavy model.
Overall, it's a solid release that gives Anthropic some standing in the price-conscious inference market.
Data at https://gertlabs.com/rankings
They released Sonnet 5 with a temporary price reduction until August. Everyone was excited, but in reality, they increased the tokenizer size by 50%. As a result, the actual cost went up by 50%, they shifted everyone's attention to decrease.
Thus, Anthropic is raising prices but not telling anyone about it. Nobody is really aware of it. You go to the pricing page, the price looks the same. Yet people are actually paying 50% more.
Very shady marketing.
And of course they lie about 35% again. In reality with coding it is 50%.
UPD: I run playcode.io, so it’s my job test all models, their pricing, quality in order to provide best price/quality/speedy/reliability to non-techy.
I keep specific branches a state where they are ready to develop new features.
Claude is a series of models (Claude Sonnet X, Claude Opus X, etc.), Claude Code is their development CLI that uses their models, and Codex is the same as Claude Code but from OpenAI.
Ultimately the quota is linked to neither of those 3 directly, rather to which specific model you invoke.
I also like that the difference between low, medium, high, xhigh seems more spread, which is actually a good thing for people trying to tune applications. Running Sonnet 5 on low with the launch pricing makes this potentially a better fit than Haiku or open source models for some tasks. I don't think it will make sense at full price.
I'm talking back-end, with database models, classes, queries, accompanying front-end layouts, with real dynamic data, running. Stuff that takes days to weeks to spin up, with minimal errors or issues, having cut down on days or weeks of effort, you can focus on testing and making it all into better code.
In that, it seems sonnet 5 on high costs more than opus 4.8 at a lower pass rate. Am I reading this correctly?
Edit: It looks like the key value proposition of the updated model is that it is much better than Sonnet 4.6.
Wheras, Sonnet 5 delivers great value (by browsercomp benchmarks and compared to opus) when running in low and medium.
So: Sonnet 4.6 should ~never have been run for low, medium or high when Opus 4.8 has been available. Whoops, I think I have some skills that delegate easy stuff to Sonnet.
---
I remember Anthropic pivoting everyone's default model to Opus but had not seen it put so starkly before.
I am a bit confused on the subscription `/usage` screen. It splits out sonnet usage, and I'd presumed that would have contributed to a lower use of subscription Quota.
But if this is correct, Sonnet usage was basically like smoking unfiltered cigarettes.
Sort of like, getting an automatic upgrade at a car rental or hotel if there is availability.
But isn’t Fable supposed to be another step change? I never used it, myself.
Tbh, at this point I think top tier models are smart “enough” (I’m sure this will look antiquated in a year), and the way to give me MORE noticeable improvement is to make them much faster rather than much smarter. Or even a way to automatically and accurately pick faster models when it makes sense. I know that IDE’s have Auto modes, but it’s not something that I trust right now to pick smart+fast instead of picking “maybe smart enough”+”cheaper for harness owner”
It's also still just prone to the kind of "stupid" mistakes we see from all LLM's. Like it can write great code, but it doesn't really have common sense without enormous guidance.
I struggle to understand where this model fits in. If I need a cheap model for simple stuff (like, summarizing an email); I'd go Haiku (actually, I'd go Deepseek v4 Flash, but you catch my drift). I just can't think of many tasks where I'm like "yeah let me reach for Sonnet Low Reasoning so I can save a dollar but also seriously run the risk of it failing"; I'd just reach for Opus Low.
Low and maybe medium will save money on simpler tasks, but after that it just isn’t worth it compared to Opus.
I wish they would have explained in the blog post why they think anybody would ever want to use this above medium.
Maybe it works well on things that aren’t clear in the benchmarks.
In my early tests tonight, Sonnet 5 is a LOT better out of the box. It's one-shotting complex instructions. It also recovered independently from bad instructions that led to an uninformative 400 error by using its schema-fetching tool to figure out there were was too much input.
If I have to gripe about something: it interpreted another impossible instruction by quietly discarding the input in question. But, the way it did it is... kinda exactly what anybody else would do, if they weren't in a position to change the implementation.
This is, obviously, early days but I'm impressed.
In other words, for certain tasks, Opus 4.8 is cheaper than Sonnet 5, and does better than Sonnet 5.
I've noticed this pattern on a lot of benchmarks. You can try to emulate a bigger model by ramping up the test time compute (max reasoning, more turns, model fusion etc.), but you can't reach the same quality level, and you often exceed the cost you would have paid by just using a bigger model.
tldr: if you're doing something hard, just use a bigger model.
Sonnet is dumber and more expensive than Opus.
The token efficiency improvements in Opus are missing in Sonnet. Sonnet generates more output tokens and more reasoning tokens.
Any price advantage per token disappears due to volume.
It doesn't make sense to use Sonnet if you have access to Opus.
or
The Dodge Charger is built to be the most Charger like car yet.
Sonnet is slower due to much higher output and reasoning token generation.
Bro that is financial engineering, not real revenue growth. They engineered the switch to usage based pricing and a price hike timed the quarter before they wanted to go public, long enough to juice their numbers but not long enough for them not to be able to manage backlash and have to walk things back. Then they tried to extrapolate that manufactured bump to make it look like they have record shattering revenue growth.
"They took my shit away!" -- 3-day Fable 5 addicts (me)
"How dare they tell Trump no?" -- US nationalist / "my country right or wrong" types
"Great to see a closed source company fail!" -- open source boosters
"Great to see an American company fail!" -- anti-US, and/or pro-China folks
"Great to see a successful company fail!" -- anti-capitalists and/or sour-grapes crab bucket types
"Serves you right for ripping off creators!" -- copyright warriors
"They keep silently nerfing the models!" -- secret downgrade conspiracy theorists
"Quit killing the planet!" -- anti-datacenter advocates
Which is a bit of a bummer considering they do genuinely make the best model that's most pleasant to work with in my opinion.
I don't agree with your framing that all negativity is from crazies
It feels like your analysis is mostly spot on, it's the confluence of several motivated parties pouring effort into social media.
Many of the posters are pro-foreign models/pro-open source, and most can't distinguish the difference between "open source" and open weight models like Qwen, Minimax, or GLM.
Reminds me of the old "free as in beer" vs "free as in speech" debate. Free beer means you don't pay, but you don't get to see the recipe or change it. Free speech means you get the actual source and the right to study it, modify it, and redistribute it.
Open weight models are basically the beer version. You can download the weights, run them locally, fine-tune them, quantize them, host them on your own boxes — but what you have is a finished product, not the blueprint for how it was built.
Qwen is also censored - although since it's open weight, there are completely uncensored versions available.
The owners of Qwen can't jack up the prices to something I'm unable to pay. They can't take it away.
The owners of Qwen can't log and train on my data.
Open weight models share far more in common with free speech than free beer.
If big daddy Dario and his company are getting pushback it's not being of some motivated group trying to take them down. They brought it on themselves.
"Sonnet 5 is an upgrade to Sonnet 4.6, but it uses an updated tokenizer that changes how the model processes text to improve performance (this is similar to the tokenizer change we introduced with Claude Opus 4.7). The tradeoff is that the same input can map to more tokens: roughly 1.0–1.35× depending on the content type. The introductory pricing is set so that the transition to Sonnet 5 is roughly cost-neutral."
If we trust them, then it is roughly the same as sonnet 4.6
Today sonnet 5's med level effort is equivalent to sonnet 4.6 low level effort :/
Based upon the "Agentic Computer usage", Sonnet 5 Max was going to be off "Agentic Search results" chart. lol ...
In short, Sonnet 5 Low/Medium is more cost efficient, if its a task below Opus 4.8 Medium. For the rest its expensive and your better off using Opus 4.8.
Why even release this model?
You are reading too much into the graph and ignoring the threshold of usefulness for real world tasks. By that logic Sonnet 4.5 would have never been worth using.
For the rest the gap in pricing vs efficiency is so small, that there is no point in using Sonnet. I am looking at their own cost comparisons vs efficiency...
I use Haiku a lot for agent workflows, if I can get better output at similar prices, Sonnet 5 will replace it completely.
Unfortunately that means I won't be using it at work for now.
Claude Code generates more revenue than OpenAI...It appears to be a nice meme.
Not true
> model that's mostly worse while being more expensive
Not true
> they can ride a wave of misinformation.
Not true
cool to see, still waiting for models to get better at computer use.
It seems being incompetent is a feature now...
Okay.
And yet, the $2-$5 section is the widest, even though it only contains a single point.
I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
None of the other labs are doing this kind of long lived two model series.
Oh Claude you master of software engineering does it ever end? DO you have no bounds?
How may we further assist you oh Claude?
loads of trust me bro benchmarks
financially incentivized comments and upvote/downvoting patterns
it's all slop
I'd generously assume this is something about the specific category of agentic task presented in the chart... but it does raise the question "then why is that category the one they chose to highlight here".
Agentic search is a different story, but even there it still dominates 4.6 (as in, for everything Sonnet 4.6 can do, Sonnet 5 can do it as well or better at the same or lower cost).
Yes, Opus 4.8 dominates Sonnet 5 over its entire range in both categories, but Opus's lower range is limited and there is a valid regime on the lower end where Sonnet 5 use makes economic sense. This is not the case for Sonnet 4.6 where Opus 4.8 dominates it completely on both charts.
Edit -- reading your response closer I think we're saying the same things, maybe just disagreeing on whether that lower end is valuable or not.
Claude Sonnet 5
https://www.anthropic.com/news/claude-sonnet-5