Language Tags

27 Oct 2016

Table of Contents


Some notes I keep misplacing - so I gathered them here. If I misplace this, then all bets are off.

Reference Guides

Language tags - what they are and how to construct them.

Make your tags as short as you need, not as long as you can.

Examples from the guide:

Code Language Subtags
en English language
mas Masai language
fr-CA French as used in Canada language+region
es-419 Spanish as used in Latin America language+region
zh-Hans Chinese written with Simplified script language+script

Overall structure of a tag:

language-extlang-script-region-variant-extension-privateuse

Some notes of each of these can also be found in the Locale javadoc. But note that Java’s use of these terms does not map exactly to the BCP 47 terminology.

A summary:

Tag Javadoc Notes
language ISO 639 alpha-2 or alpha-3 language code, or registered language subtags up to 8 alpha letters (for future enhancements). When a language has both an alpha-2 code and an alpha-3 code, the alpha-2 code must be used. The language field is case insensitive, but Locale always canonicalizes to lower case. Example: en (English), ja (Japanese), kok (Konkani)
script ISO 15924 alpha-4 script code. The script field is case insensitive, but Locale always canonicalizes to title case (the first letter is upper case and the rest of the letters are lower case). Example: Latn (Latin), Cyrl (Cyrillic)
country (region) ISO 3166 alpha-2 country code or UN M.49 numeric-3 area code. The country (region) field is case insensitive, but Locale always canonicalizes to upper case. Example: US (United States), FR (France), 029 (Caribbean)
variant Any arbitrary value used to indicate a variation of a Locale. The variant field is case sensitive. Also BCP 47 subtags are strictly used to indicate additional variations that define a language or its dialects that are not covered by any combinations of language, script and region subtags. The variant field in Locale has historically been used for any kind of variation, not just language variations. For example, some supported variants available in Java SE Runtime Environments indicate alternative cultural behaviors such as calendar type or number script. In BCP 47 this kind of information, which does not identify the language, is supported by extension subtags or private use subtags. Example: polyton (Polytonic Greek), POSIX
extension (inc. private use) A map from single character keys to string values, indicating extensions apart from language identification. The extensions in Locale implement the semantics and syntax of BCP 47 extension subtags and private use subtags. The extensions are case insensitive, but Locale canonicalizes all extension keys and values to lower case. Note that extensions cannot have empty values. Example: key="u"/value="ca-japanese" (Japanese Calendar), key="x"/value="java-1-7"

Personal note: I’ve used these combinations without issue:

language
extlang
language-script
language-region
language-script-region

I’ve never needed variant, extension, or privateuse.

How to Choose

How to choose the right language tag?

Lots of helpful notes and guidelines there.

Tools

 
BCP-47 validator - Paste a list of tags into its input field, and it will validate them all for you.

Various Lookup Tools - nice.

The Debate

The BCP-47 “standard” is not prescriptive.  There can be different ways to construct reasonably unambiguous language tags which mean more-or-less the same thing.

cmn - Mandarin Chinese
zh-CN - Chinese as used in Mainland China

(In this case, cmn is much less ambiguous than zh-CN, so I’d want to use that. In fact, I’d probably want tags such as cmn-Hans, yue-Hant, and so on - and maybe even access to a panel of philology and linguistics experts.)

One seemingly never-ending debate which results from this flexibility is:

Can I just use the language subtag, and be done with it, for most of the languages I need?

Let’s say I use en for English. I may quickly decide that this is not good enough.  American English is sufficiently different from British English - and I happen to care about the difference† in my application - and in any other application which receives my data.

So, I will go with en-US and en-GB. Done!

Similarly, I’ll make a clear distinction between French (Parisian?) French and Canadian (Quebecois) French.  So: fr-FR and fr-CA. Otherwise, tabarnouche!

…and German vs Swiss German… you get the picture.

But should we just bite the bullet and standardize on language-region for all our tags? Or can we use it for Italian, for example? I just know that the day after we go live with it for Italian, we’re going to wish we’d distinguished it-IT from some other flavor of Italian…  The same goes for languages such as ja (Japanese) and he (Hebrew).

And then there are languages where it’s not uncommon to have different scripts for the same language - for example, Serbian: sr-Cyrl and sr-Latn.

My Advice

You can’t predict the future - you can only do your reasonable best to anticipate it. So, assume that whatever you decide, it will need to be changed at some point.

For example, in the context of master reference data stored in a relational database table:

Never try to use language tags as primary keys in your DB.

Instead, assign synthetic primary keys (sequential integers, or UUIDs, or whatever). That way you will stand a fighting chance of managing changes, when they inevitable show up. By all means, add a unique index to the language tag column - just don’t make it the primary key.

† - Requirements, requirements, requirements.  What kind of data am I tagging? A piece of written text, where the script is important? The primary spoken language(s) in a TV show? And so on. How will the data be used?  Will it be passed around to other consumers? How will they need to use it…?