Thoughts on Claude Mythos

Epistemic Status: Obviously highly speculative. I have no inside information. Opinions lightly held.

Claude Mythos was recently previewed, and emphatically not released due to safety concerns regarding its advanced cyberattack capabilities. Very plausibly, this is our first look at the next generation of ~10+T models enabled to be trained and served at scale by the Blackwell, Tranium v3, and TPUv7 hardware generations. The narrative in general has been focused strongly around the cyberattack capabilities and generally we appear to have returned to the era of building hype by claiming models are too dangerous to release. Interestingly, this contrasts strongly with the recent GPT5 and similar announcements which flopped hype-wise because they were clearly incremental improvements on the existing paradigm, regardless of their technical merits.

Presumably, Claude Mythos does in fact bring dramatic improvements to cyberattack capabilities and more generally strong improvements to agentic coding capabilities. The interesting question for me is where these capabilities come from. Although it is never, as far as I can tell, explicitly stated, certainly the framing of the Claude Mythos release wants you to think that this is due to either some massive research breakthrough or else due to the next era of pretraining scaling. My suspicion is that these are not the primary driving forces behind the dramatic improvement on cyber capabilities. Rather, my suspicion is that these capabilities are driven by Claude Mythos likely being the first model to be explicitly and seriously RL-trained to invent and defend against cyberattacks.

Specifically, my hypothesis is that the vast majority of these improvements came about in the post-training RL phase for Claude Mythos and that direct RL training on cyberattacks became part of Anthropic’s general post-training stack for this run.

Direct RL training to create cyberattacks is not crazy at all. In fact it is almost the perfect RLVR task. There is an unimaginable diversity of existing OS software on which to practice attacks freely available on the web. It is vastly easier to judge whether a cyberattack succeeded than to create one, so creating the reward function is relatively straightforward. The agentic code harnesses and sandboxes already exist and have been rapidly built up over the last year or so for RL training on eg. SWE tasks, making code diffs, implementing end to end code projects, and web-browsing. Anthropic is already a clear leader here with their focus on code agents vs e.g. OAI who have focused much more on STEM and mathematics skills. It is an extremely natural next step for Anthropic to add exactly this kind of cyberattack and cyberdefense RL training into their post-training stack given their existing focuses and where they stood infrastructure wise after the Claude 4 model series.

More generally, the eval profiles from the system report somewhat support this. There are obviously huge gains on cyber capabilities and pretty substantial gains on SWE-bench and then lesser gains in most other areas compared with the existing SOTA and some areas where it is beaten by existing models. My read here is that:

Mythos is in fact likely the first publicly previewed ~10-20T model and represents the next stage of pretraining scaling.
Pretraining scaling does what it always does and delivers broad but modest (relative to RLVR) capabilities gains for exponentially increasing cost (i.e. the scaling laws still hold which of course they do since they are functions of the data)
For Mythos post-training Anthropic built upon their already highly developed SWE-agent code and computer environment stack which led to strong gains on SWE skills such as SWE-Bench and Terminal-Bench
For the first (?) time, they also included a large number of cybersecurity environments which led to a super dramatic improvement on these capabilities compared to when they were not targeted for improvement¹.

Generally, whenever we see huge spikes in some specific capability, this isn’t usually driven by pretraining or by e.g. architecture or optimizer breakthroughs but instead by targeted data and environment work and using RLVR to hillclimb these specific capabilities. My general model of AI right now is that these models have a huge amount of excess capacity just sitting around and that RLVR is extremely extremely strong at eliciting these capabilities given focused effort and data/environment work. Thus, if any task can be turned into an RLVR environment and also the labs want to focus effort in hillclimbing it, then performance can improve dramatically in a very short amount of time. This happened for math and code olympiads as well as agentic SWE last year and now Anthropic have also turned to focus on cybersecurity. More speculatively, it seems plausible to me that one driver for this is that Anthropic have been regularly assessing their models cyber capabilities as part of their safety audits, and so they already had a whole bunch of prebuilt knowledge and experience with cyberattack harnesses and environments, so just converting these into RLVR environments ended up being very straightforward.

Generally, I think the discourse around AI still somewhat underestimates the primary role of RL post-training in shaping model performance vs pretraining, and the ridiculous gains that you can get from RLVR on basically whichever tasks you choose. This has become super obvious to me only recently as our post-training stack has advanced but even with small models you can reach capability levels rather easily that previously required SOTA 1T+ models to reach, and the primary bottleneck doesn’t seem to be model capacity or compute but rather simply getting the environments setup and having sufficient cold-start sft reasoning traces so the model can overcome the initial exploration barrier to RL learning.

More broadly, this means that the cat is already out of the bag regarding proliferation of cyberattack models as capable as Mythos. My strong suspicion is that the constraint is not Mythos-sized base models but instead Anthropic’s cyberattack environments, sft data, and general sandbox infrastructure for agentic coding RL. Unfortunately all of this is not really gated by compute and can be relatively straightforwardly re-created both in other labs and in the open-source community. My very strong suspicion is that with a similar RLVR pipeline, existing OS bases can likely achieve similar levels of cyberattack capabilities, and that this will likely happen within a year. If such cyberattack capabilities are indeed extremely dangerous then the world will certainly experience significant damage and fallout from this as they are exploited by bad actors. In this sense, Anthropic’s project glasswing is the correct response – using the brief window before others catch up to harden all existing mission-critical systems, and I commend them for this. However, at the same time, the really dangerous information is not Mythos’s weights, but rather that RLVR on cyberattacks works so dramatically well, which can be figured out simply from the Mythos announcement. Even there though, the fact that RLVR on cyberattacks works is not at all surprising a-priori anyway, since of course it works; it is an almost perfect exemplar of an RLVR task and there is no particular reason to expert models to somehow be flummoxed by cyberattacks given their existing agentic SWE skills. Rather, these SWE skills provided the necessary cold-start capabilities to overcome the exploration barrier for exactly this kind of cyberattack RLVR. The danger instead is that cyberattacks become a general benchmark which labs compete to hillclimb upon. If this becomes widespread then certainly we may be entering a new era of cybersecurity.

Another interesting tidbit of the system card is Mythos’ dramatic improvement in “GraphWalks BFS 256K-1M”, which tests the model’s ability to use its’ context window to perform a breadth-first-search over random hexadecimal hashes. There is twitter speculation that this is because Mythos uses a looped architecture. I strongly doubt this. What I suspect instead is, again, that Anthropic started training on similar synthetic ‘context traversal and usage’ tasks in order to improve the model’s general long context understanding capabilities. I.e. by training on such tasks you force the model to create circuits capable of general retrieval and algorithmic operations over its own context window, which has obvious benefits in agentic SWE-style tasks which is clearly important when you are handling repo-level code contexts. Most likely this was done in an early post-training phase to ‘warm up’ the model prior to the serious agentic SWE RL training. This is similar to how people train on puzzle and logical reasoning tasks to ‘warm up’ the model for e.g. olympiad mathematics and which provides surprisingly strong transfer. ↩