Why HTMLParser Becomes html_parser, Not h_t_m_l

Converting HTMLParser to snake_case should produce html_parser, not h_t_m_l_parser. Most case converters get this wrong — they apply a naive regex that doesn't understand the difference between an acronym and a series of separately-capitalized words.

This sounds like a small detail. It isn't. If you're refactoring a codebase, normalizing JSON keys, or migrating database column names, mis-tokenizing acronyms produces output that's not just ugly but actively broken — symbol names that don't match anywhere else in the system.

Here's why correct tokenization matters and how a good case converter handles it.

The naive approach

The simplest way to convert PascalCase or camelCase to snake_case is to insert an underscore before every uppercase letter (except the first), then lowercase everything:

// Naive approach (JavaScript)
function toSnakeCase(input) {
  return input
    .replace(/([A-Z])/g, '_$1')
    .toLowerCase()
    .replace(/^_/, '');
}

toSnakeCase('userName');     // → 'user_name' ✓
toSnakeCase('parseRequest'); // → 'parse_request' ✓
toSnakeCase('HTMLParser');   // → 'h_t_m_l_parser' ✗
toSnakeCase('getXMLData');   // → 'get_x_m_l_data' ✗

The naive approach works fine when every uppercase letter signals a word boundary. But acronyms (HTML, XML, URL, API) are runs of multiple uppercase letters that should be treated as a single word.

The correct tokenization rule

A correct tokenizer recognizes three patterns:

Lowercase → uppercase transition: word boundary. userName splits as user | Name.
Uppercase run followed by lowercase: the uppercase run is one word (an acronym), but the last uppercase letter actually belongs to the next word. HTMLParser splits as HTML | Parser.
Existing separators (underscore, hyphen, dot, space): word boundaries.

Rule #2 is the tricky one. Notice that we don't split HTMLP as HTML + P and then P + arser. We need to look ahead — when an uppercase letter is followed by a lowercase letter, the uppercase belongs to the new word.

The two regex passes

The standard implementation uses two substitution passes:

// Correct tokenization (JavaScript)
function smartTokenize(input) {
  return input
    // Pass 1: Insert space at lowercase-to-uppercase boundaries
    .replace(/([a-z\d])([A-Z])/g, '$1 $2')
    // Pass 2: Insert space between ALLCAPS run and following Capitalized word
    .replace(/([A-Z])([A-Z][a-z])/g, '$1 $2')
    .split(/\s+/)
    .filter(Boolean);
}

smartTokenize('userName');        // → ['user', 'Name']
smartTokenize('parseRequest');    // → ['parse', 'Request']
smartTokenize('HTMLParser');      // → ['HTML', 'Parser']
smartTokenize('getXMLHttpRequest'); // → ['get', 'XML', 'Http', 'Request']
smartTokenize('iPhone');          // → ['i', 'Phone']
smartTokenize('macOS');           // → ['mac', 'OS']

Pass 1 handles the simple case: a lowercase letter (or digit) followed by an uppercase letter means a word boundary. This gets us user | Name from userName.

Pass 2 handles the acronym case: an uppercase letter followed by another uppercase letter followed by a lowercase letter means the second uppercase letter belongs to the new word. This gets us HTML | Parser from HTMLParser.

Apply these in order. Pass 1 doesn't help with HTMLParser (no lowercase-to-uppercase transition exists). Pass 2 alone can't handle userName (no uppercase-uppercase-lowercase pattern exists). Both passes together handle all the common cases.

Even more edge cases

The two-pass approach handles most input correctly, but there are still edge cases:

Trailing acronyms

parseHTML should split as parse | HTML. Pass 1 handles this: lowercase e → uppercase H is a boundary. parse | HTML. Good.

Single-letter words

AClass should split as A | Class. Pass 2 catches this: A + Class means uppercase + (uppercase + lowercase) — boundary between A and C. Good.

Numbers

parseUtf8 should split as parse | Utf | 8? Or parse | Utf8? Most conventions treat the number as a suffix to the preceding word: parse | Utf8. Add a digit pattern to pass 1: /([a-z])([0-9])/ doesn't split, but /([0-9])([A-Z])/ does. This depends on the tokenizer's specific rules.

Our case converter includes [a-z\d] in pass 1 (digits act like lowercase for boundary detection), which produces parseUtf8 → parse | Utf8 — the common expectation.

Unicode

The regex above uses ASCII [A-Z] and [a-z]. For names with diacritics or non-Latin scripts (résuméParser, καλήμέρα), you need Unicode-aware character classes. Our converter uses extended Unicode ranges to handle European Latin and common scientific characters.

The "ALLCAPS as word" convention in modern code

Java's official style guide (Oracle's Java Code Conventions) and Microsoft's .NET guidelines both recommend treating acronyms as words in identifiers:

HtmlParser, not HTMLParser, in new PascalCase code
htmlParser, not HTMLParser, in new camelCase code
parseHttp, not parseHTTP
readUrl, not readURL

JavaScript's older built-ins (XMLHttpRequest, JSON.parse) use the old "ALLCAPS acronym" convention because they predate the modern guideline. New JavaScript code increasingly uses the modern convention.

If you're refactoring old code to the new convention, you can paste your old PascalCase names into our converter, switch to camelCase mode, and get correctly-tokenized output that converts HTMLParser to htmlParser.

What our engine does

The transformcase engine uses the two-pass approach plus a few additional heuristics for edge cases. Specifically:

All standard separators (_, -, ., /, space) are word boundaries.
Lowercase-to-uppercase transition is a boundary (pass 1).
ALLCAPS-run followed by Capitalized-word is a boundary at the last uppercase (pass 2).
Digits act like lowercase characters for boundary detection.
Unicode letters (Latin Extended, Greek, Cyrillic) are recognized as letters.
Apostrophes and word-internal hyphens are preserved as part of a single token (so "don't" doesn't split into "don" and "t").

The result: pasting any reasonable identifier — camelCase, PascalCase, snake_case, kebab-case, dot.case, or ALL_CAPS — and converting between them produces the canonically correct output.

Test cases worth running

If you're writing your own case converter (or evaluating one), test these inputs to see if it tokenizes correctly:

Input	Expected snake_case
userName	user_name
HTMLParser	html_parser
getXMLHttpRequest	get_xml_http_request
parseURL	parse_url
parse_url	parse_url
parse-url	parse_url
parseUtf8	parse_utf8
iPhone	i_phone
macOS	mac_os
IOError	io_error

If your converter returns h_t_m_l_parser for any of these, it's using the naive approach and will mangle real-world identifiers.

The bigger lesson

Naming conventions only work if the tools we use respect them. A converter that splits acronyms into individual letters is producing technically-valid output that doesn't match what anyone actually wants. The same logic applies to other rule systems — title case that doesn't preserve proper nouns, sentence case that doesn't recognize abbreviations, slug generation that doesn't strip diacritics.

"Correct" is what the developers who follow the convention would write by hand. Anything else is a tool failing at its job.

Why HTMLParser Becomes html_parser, Not h_t_m_l_parser