The article even references English's built-in delimiter, the quotation mark, which is reprented as a token for Claude, part of its training data.
So are we sure the lesson isn't simply to leverage delimiters, such as quotation marks, in prompts, period? The article doesn't identify any way in which XML is superior to quotation marks in scenarios requiring the type of disambiguation quotation marks provide.
Rather, the example XML tags shown seem to be serving as a shorthand for notating sections of the prompt ("treat this part of the prompt in this particular way"). That's useful, but seems to be addressing concerns that are separate from those contemplated by the author.
<antml:invoke name="Read">
<antml:parameter name="file_path">/path/to/file</antml:parameter>
<antml:parameter name="offset">100</antml:parameter>
<antml:parameter name="limit">50</antml:parameter>
</antml:invoke>
I'm sure Claude can handle any delimiter and pseudo markup you throw at it, but one benefit of XML delimiters over quotation marks is that you repeat the delimiter name at the end, which I'd imagine might help if its contents are long (it certainly helps humans).Even simple --- separators is usually enough to get good results, it just needs to be reasonably clear which items are distinct from each other.
It'd not clear why within any section XML markers would do better than something like markdown, other than claude being explicitly post-trained with XML prompts as opposed to markdown. One hypothesis could be that since a large portion of the training corpus is websites, XML is more natural to use since it's "learned" the structure of XML better than markdown. Another could be that explicit start/end tags make identifying matching delimiters easier than JSON (which requires counting matching brackets) or markdown (where the end of a section is implicitly defined by the presence of a new header element).
To me it seems like handling symbols that start and end sequences that could contain further start and end symbols is a difficult case.
Humans can't do this very well either, we use visual aids such as indentation, synax hilighting or resort to just plain counting of levels.
Obviously it's easy to throw parameters and training at the problem, you can easily synthetically generate all the XML training data you want.
I can't help but think that training data should have a metadata token per content token. A way to encode the known information about each token that is not represented in the literal text.
Especially tagging tokens explicitly as fiction, code, code from a known working project, something generated by itself, something provided by the user.
While it might be fighting the bitter lesson, I think for explicitly structured data there should be benefits. I'd even go as far to suggest the metadata could handle nesting if it contained dimensions that performed rope operations to keep track of the depth.
If you had such a metadata stream per token there's also the possibility of fine tuning instruction models to only follow instructions with a 'said by user' metadata, and then at inference time filter out that particular metadata signal from all other inputs.
It seems like that would make prompt injection much harder.
This is 3% or infinitely far away from the perfect tech.
The perfect tech is the stack.
https://arxiv.org/abs/2305.13673
and of course a^n b^n is also classic CFG, so it's not clear why one paper had positive results while the other hand negative.
I cannot find probability of success in paper you linked. Is it 100%? I believe it is less than 100%, because LLMs are intrinsically probabilistic machines.
> it actually does seem to be 100%
For all Dyck grammar sequences, infinitely many of them? ;)> Dyck/balanced-brackets grammar
Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.
Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.
> but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
They recognize and/or generate finite (<800 chars) grammars in that paper.Usually, sizes of files on a typical Unix workstation follow two-mode log-normal distribution (sum of two log-normal distributions), with heavy tails due to log-normality [1]. Authors of the paper did not attempt to model that distribution.
[1] This was true for my home directories for several years.
While technically possible, it'd be like a unicode conspiracy that had to quietly update everywhere without anyone being the wiser.
Imagine a model finteuned to only obey instructions in a Scots accent, but all non user input was converted into text first then read out in a Benoit Blanc speech model. I'm thinking something like that only less amusing.
The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.
Are we really at the point where some people see XML as a spooky old technology? The phrasing dotted around this article makes me feel that way. I find this quite strange.
Nobody dares advertise the XML capabilities of their product (which back then everybody did), nobody considers it either hot new thing (like back then) or mature - just obsolete enterprise shit.
It's about as popular now as J2EE, except to people that think "10 years ago" means 1999.
XML is really great for text documents with embeds and markup, either semantic (this part of the paper is an abstract) or visual (this part of the document should be 14-point and aligned right). You can do this in JSON, but it's a pain.
JSON is great for representing data. If you have some data structures and two machines trying to exchange them, JSON is great for that.
TOML / yaml / hcl / JSON with comments are great at config. If you have a human writing something that a machine is supposed to understand, you don't want turning completeness and you don't want to deal with the pain of having your own DSL, those are great.
Really, I think you can trace a lot of the "XML is spooky old technology" mindset to the release of HTML 5. That was when XML stopped being directly relevant to the web, though of course it still lives on in many other domains and legacy web apps.
Exactly the opposite; WHATWG “Living Standard” HTML (different releases of which were used as the basis for W3C HTML5, 5.1, and 5.2 before the W3C stopped doing that) includes an XML serialization as part of the spec, so now the HTML-in-XML is permanently in sync with and feature-matched with plain HTML.
“Warning! Using the XML syntax is not recommended, for reasons which include the fact that there is no specification which defines the rules for how an XML parser must map a string of bytes or characters into a Document object, as well as the fact that the XML syntax is essentially unmaintained — in that, it’s not expected that any further features will ever be added to the XML syntax (even when such features have been added to the HTML syntax).”
XHTML was an attempt to encode HTML semantics (approximately, each version of XHTML also altered some semantics from HTML and previous XHTML versions) in XML, and the XML serialization of modern, WHATWG HTML exactly encodes HTML semantics in XML.
Let's say I hope for the day I'll miss SOAP. Right now I have too much of it.
Typically a more primitive (sorry, minimal) format such as JSON is sufficient in which case there's no excuse to overcomplicate things. But sometimes JSON isn't sufficient and people start inventing half baked solutions such as JSON-LD for what is already a solved problem with a mature tech stack.
XSLT remains an elegant and underused solution. Guile even includes built in XML facilities named SXML.
People who wanted to "get shit done" had much better alternatives. XML grew out of hype, corporate management forcing it, and bundling to all kinds of third party products and formats just so they can tick the "have this hot new format support" box.
YAML is just bad. JSON is harder to read for deeply nested structures. TOML and the like don't have enough features.
But it used to be. And so it was used for a lot of things where it wasn't a great fit. XML works fairly well as a markup format, but for a lot of things, something like json models the data better.
> which case there's no excuse to overcomplicate things.
And that's a problem with xml. It's too complicated. Even if the basic model of xml is a good fit for your data, most of the time you don't need to worry about namespaces and entity definitions, and DTDs, but those are still part of most implementations and can expose more attack surface for vulnerabilities (especially entity definitions). And the APIs of libraries are generally fairly complicated.
There are several. And that's the problem. It isn't hard to find a subset with a library for a single language that uses a slightly different subset from the other subsets. But none of them ever caught on.
I’d be very curious what lasting open formats JSON has been used to build.
People upload their podcasts to a platform like Apple Music or Spotify or Substack and co, or to some backend connected to their Wordpress/Ghost/etc) and it spits the RSS behind the scenes, with nobody giving a shit about the XML part.
Might as well declare USSR a huge IT success because people still play Tetris.
What's more, the web standards bodies even abandoned a short-lived XML-hype-era plan to make a new version of HTML based on XML in 2009.
That from this touted to the heavens format a handful of uses remain (some companies still using SOAP, the MS Office monster schemas, RSS, EPUB, and so on) is the very opposite of the adoption it was supposed to have. For those that missed the 90s/early 00s, XML was a hugely hyped format, with enormous corporate adoption between 1999–2005, which deflated totally.
Did you also learned those things too today?
Then spend the next week making it even more convoluted.
That data format is still better than EDI.
What gets me is going from this structured data to Markdown which doesn’t even have enough features & syntax that the LLMs try to invent or co-opt things like the blockquote for not quoting sources.
For Web markup, as an industry we tried XHTML (HTML that was strictly XML) for a while, and that didn't stick, and now we have HTML5 which is much more lenient as it doesn't even require closing tags in some cases.
For data exchange, people vastly prefer JSON as an exchange format for its simplicity, or protobuf and friends for their efficiency.
As a configuration format, it has been vastly overtaken by YAML, TOML, and INI, due to their content-forward syntax.
Having said all this I know there are some popular tools that use XML like ClickHouse, Apple's launchd, ROS, etc. but these are relatively niche compared to (e.g.) HTML
XML was definitely popular in the "well used" sense. How popular it was in the "well liked" sense can maybe be up for debate, but it was the best tool for the job at the time for alot of use cases.
And then do we end up over indexing on Claude and maybe this ends up hurting other models for those using multiple tools.
I just dislike how much of AI is people saying "do this thing for better results" with no definitive proof but alas it comes with the non determinism.
At least this one has the stamp of approval by Claude codes team itself.
word-break: break-all;
Although I can never remember the correct incantation, should be easy for LLMs.
E.g. instead of
<examples>
<ex1>
<input>....</input>
<output>.....</output>
</ex1>
<ex2>....</ex2>
...
</examples>
<instructions>....</instructions>
<input>{actual input}</input>
Just doing something like: ...instructions...
input: ....
output: {..json here}
...maybe further instructions...
input: {actual input}
Use case document processing/extraction (both with Haiku and OpenAI models), the latter example works much better than the XML.N of 1 anecdote anyway for one use case.
I assume you are right too, JSON is a less verbose format which allows you to express any structure you can express in XML, and should be as easy for AI to parse. Although that probably depends on the training data too.
I recently asked AI why .md files are so prevalent with agentic AI and the answer is ... because .md files also express structure, like headers and lists.
Again, depends on what the AI has been trained on.
I would go with JSON, or some version of it which would also allow comments.
But I'd be happy to hear about studies that show evidence for XML being more readable, than JSON.
But I’d be happy to hear about studies that show evidence for JSON being readable, than XML.
And while we're at it, instead of wall-of-text, I also feel like outputs could be structured at least into thinking and content, maybe other sections.
I'd post a link, but unfortunately many are highly NSFW. Just search for "Claude jailbreak" on reddit or something.
You'll start to see how Claude really thinks. They'll put things in <ethic_reminders>, <cyber_warning> or <ip_reminder>. You could actually even snip these off in an API, overwrite them, or if your prompt-fu is good, convince Claude that these tags are prompt injections. It's also interesting noting how jailbreaking is easier on thinking mode because the jailbreaking prompts will gaslight Claude into thinking that these tags are attacks.
There's a lot of speculation in this thread, but go and have a spar with Claude instead.
The post even links to that page, although there’s a typo in the link.
And yes, these are screenshots from Anthropic’s documentation.
One thing I've found: even with XML tags, you still need to validate and parse defensively. Models will occasionally nest tags wrong, omit closing tags, or hallucinate new tag names. Having a fallback parser that extracts content even from malformed XML has saved me more than once.
The real win is that XML tags give you a natural way to do few-shot prompting with structure. You can show the model exactly what shape the output should take, and it follows remarkably well.
That’s the sign that your prompt isn’t aligned and you’ve introduced perplexity. If you look carefully at the responses you’ll usually be able to see the off-by-one errors before they’re apparent with full on hallucinations. It’ll be things like going from having quotes around filenames to not having them, or switching to single quote, or outputting literal “\n”, or “<br>”, etc. Those are your warning signs to stop before it runs a destructive command because of a “typo.”
My system prompt is just a list of 10 functions with no usage explanations or examples, 304 tokens total, and it’ll go all the way to the 200k limit and never get them wrong. That took ~1,000 iterations of name, position, punctuation, etc., for Opus 4.6 (~200 for Opus 4.5 until they nerfed it February 12th). Once you get it right though it’s truly a different experience.
Ex: <message>...</message> helps keep track. Even better? <message78>...</message78>. That's ugly xml, but great for LLMs. Likewise, using standard ontologies for identifiers (ex: we'll do OCSF, AT&CK, & CIM for splunk/kusto in louie.ai), even if they're not formally XML.
For all these things... these intuitions need backing by evals in practice, and part of why I begrudgingly flipped from JSON to XML
HTML also descended from SGML, and it’s hard to imagine a more deeply grooved structure in these models, given their training data.
So if you want to annotate text with semantics in a way models will understand…
Nobody expects the end user to prompt the AI using a structured language like xml
> In other words, XML tags have not only a special place at inference level but also during training
Their cited source has 0 proof of that. It's just like python/C/html in training. Doesn't mean it's special. And no, you don't need to format your prompts as python code just because of that.
> In truth, it does not matter that these tags are XML. Other models use ad hoc delimiters (as explained in a previous article; example: <|begin_of_text|> and <|end_of_text|>) and Claude could have done the same. What matters is what these tags represent.
Those strings are just representations of special tokens in models for EOS. What does it have to do with anything this article pretends to know about?
Please don't post such intellectual trash on here :')
Claude analysis of the article:
The author is making an interesting philosophical argument — that XML tags in Claude function as metalinguistic delimiters analogous to quotation marks in natural language, formulaic speech markers in Homer, or recognition sequences in DNA.
The core thesis is about first-order vs. second-order expression boundaries, which is a legitimate linguistic/information-theory concept. But to your actual question — do they understand what tokens are?
No, not in the technical sense you're pointing at. The article conflates two very different things:
1. Tokenizer-level special tokens — things like <|begin_of_text|>, <|end_of_text|>, <|start_header_id|> etc. These are literal entries in the vocabulary with dedicated token IDs. They're not "learned" through training in the same way — they're hardcoded into the tokenizer and have special roles in the attention mechanism during training. They exist at a fundamentally different layer than XML tags in prompt text.
2. XML tags as structured text within the input — these are just regular tokens (<, instructions, >) that Claude learned to attend to during RLHF/training because Anthropic's training data and system prompts heavily use them. They're effective because of training distribution, not because they occupy some special place in the tokenizer.
The author notices that other models use <|begin_of_text|> style delimiters and says Claude "could have done the same" but chose XML instead. That's a category error. Claude also has special tokens at the tokenizer level — XML tags in prompts are a completely separate mechanism operating at a different abstraction layer.
The philosophical observation about delimiter necessity in communication systems is fine on its own. But grafting it onto a misunderstanding of how tokenization and model architecture actually work weakens the argument. They're essentially pattern-matching on surface-level similarities (both use angle brackets!) without understanding the underlying mechanics.
[1] well of course XML is still heavily used in stuff like interfacing with automated wire transfers with big banks (at least in Europe) and all the digital payments directives etc. But XML is not widely used by the "cool" stuff.
To be realistic, this design needs more weirdly sexual etsy garbage, “one weird tip,” and “punch the monkey”
This vague posting is kind dumb.
Why XML tags are so fundamental to Claude
https://glthr.com/XML-fundamental-to-Claude