Language Tags: Unicode Extensions

01 Jun 2023

Just a curiosity - and probably quite obscure…

Java’s Locale class supports BCP 47 language tags. It also supports the Unicode LDML (Locale Data Markup Language) which is an extension to BCP 47 for locale data exchange, and is used in the CLDR (and, by extension, in Java).

I have often used BCP 47 language tags in Java (and elsewhere) - for example:

Locale locale = Locale.forLanguageTag("de-CH"); // Swiss German locale

But I have never used an LDML-specific tag - or even seen one used anywhere, until I saw this Stack Overflow question: How to use unsupported Locale in Java 11 and numbers in String.format().

First consider this more common language tag:

Java
1
2
Locale locale = Locale.forLanguageTag("ar"); // Arabic language
System.out.println(String.format(locale, "Output: %d", 1234567890));

This prints the following:

Output: ١٢٣٤٥٦٧٨٩٠

That is, it prints Arabic numerals.

Is that surprising? It is to me, because other language tags I tested such as Chinese (zh, cmn) do not have similar conversions - so, why Arabic, specifically? I assume because it’s a useful piece of locale-aware functionality.

Now consider this - our LDML example:

Java
1
2
Locale locale = Locale.forLanguageTag("ar-u-nu-Latn"); // what is this?!?
System.out.println(String.format(locale, "Output: %d", 1234567890));

This prints the following:

Output: 1234567890

What is ar-u-nu-Latn?

The -u- is one of the extensions provided as part of the LDML - see Unicode BCP 47 U Extension.

It is followed by nu which refers to numbering systems (see the table in the above link).

The language tag ends with Latn which is the standard BCP 47 tag for the Latin alphabet (script). So this causes our Java number to be formatted using Latin integer characters instead of Arabic characters.

In other words, our Java code gives us the original output - as a formatted string.

The curious (and somewhat obscure) things - to me, anyway - are not the language tags themselves, but the facts that the formatter actually uses the ar language tag’s locale to convert characters… and that for ar-u-nu-Latn, it doesn’t!

Language tags indicate the language of text, the script it is written in, the dialect being spoken, and so on. They are metadata. Any extra behavior - such as locale-based formatting - is in addition to (and separate from) the tags themselves. I need to remember that.

Resources: