Language Tags
Contents
Some notes I keep misplacing - so I gathered them here. If I misplace this, then all bets are off.
Reference Guides
Language tags - what they are and how to construct them.
Make your tags as short as you need, not as long as you can.
Examples from the guide:
Code | Language | Subtags |
---|---|---|
en | English | language |
mas | Masai | language |
fr-CA | French as used in Canada | language+region |
es-419 | Spanish as used in Latin America | language+region |
zh-Hans | Chinese written with Simplified script | language+script |
Overall structure of a tag:
language-extlang-script-region-variant-extension-privateuse
Some notes of each of these can also be found in the Locale
javadoc. But note that Java’s use of these terms does not map exactly to the BCP 47 terminology.
A summary:
Tag | Javadoc Notes |
---|---|
language | ISO 639 alpha-2 or alpha-3 language code, or registered language subtags up to 8 alpha letters (for future enhancements). When a language has both an alpha-2 code and an alpha-3 code, the alpha-2 code must be used. The language field is case insensitive, but Locale always canonicalizes to lower case. Example: en (English), ja (Japanese), kok (Konkani) |
script | ISO 15924 alpha-4 script code. The script field is case insensitive, but Locale always canonicalizes to title case (the first letter is upper case and the rest of the letters are lower case). Example: Latn (Latin), Cyrl (Cyrillic) |
country (region) | ISO 3166 alpha-2 country code or UN M.49 numeric-3 area code. The country (region) field is case insensitive, but Locale always canonicalizes to upper case. Example: US (United States), FR (France), 029 (Caribbean) |
variant | Any arbitrary value used to indicate a variation of a Locale. The variant field is case sensitive. Also BCP 47 subtags are strictly used to indicate additional variations that define a language or its dialects that are not covered by any combinations of language, script and region subtags. The variant field in Locale has historically been used for any kind of variation, not just language variations. For example, some supported variants available in Java SE Runtime Environments indicate alternative cultural behaviors such as calendar type or number script. In BCP 47 this kind of information, which does not identify the language, is supported by extension subtags or private use subtags. Example: polyton (Polytonic Greek), POSIX |
extension (inc. private use) | A map from single character keys to string values, indicating extensions apart from language identification. The extensions in Locale implement the semantics and syntax of BCP 47 extension subtags and private use subtags. The extensions are case insensitive, but Locale canonicalizes all extension keys and values to lower case. Note that extensions cannot have empty values. Example: key="u"/value="ca-japanese" (Japanese Calendar), key="x"/value="java-1-7" |
Personal note: I’ve used these combinations without issue:
language
extlang
language-script
language-region
language-script-region
I’ve never needed variant
, extension
, or privateuse
.
How to Choose
How to choose the right language tag?
Lots of helpful notes and guidelines there.
Tools
BCP-47 validator - Paste a list of tags into its input field, and it will validate them all for you.
Various Lookup Tools - nice.
The Debate
The BCP-47 “standard” is not prescriptive. There can be different ways to construct reasonably unambiguous language tags which mean more-or-less the same thing.
cmn
- Mandarin Chinesezh-CN
- Chinese as used in Mainland China
(In this case, cmn
is much less ambiguous than zh-CN
, so I’d want to use that. In fact, I’d probably want tags such as cmn-Hans
, yue-Hant
, and so on - and maybe even access to a panel of philology and linguistics experts.)
One seemingly never-ending debate which results from this flexibility is:
Can I just use the language subtag, and be done with it, for most of the languages I need?
Let’s say I use en
for English. I may quickly decide that this is not good enough. American English is sufficiently different from British English - and I happen to care about the difference† in my application - and in any other application which receives my data.
So, I will go with en-US
and en-GB
. Done!
Similarly, I’ll make a clear distinction between French (Parisian?) French and Canadian (Quebecois) French. So: fr-FR
and fr-CA
. Otherwise, tabarnouche!
…and German vs Swiss German… you get the picture.
But should we just bite the bullet and standardize on language-region for all our tags? Or can we use it for Italian, for example? I just know that the day after we go live with it for Italian, we’re going to wish we’d distinguished it-IT
from some other flavor of Italian… The same goes for languages such as ja
(Japanese) and he
(Hebrew).
And then there are languages where it’s not uncommon to have different scripts for the same language - for example, Serbian: sr-Cyrl
and sr-Latn
.
My Advice
You can’t predict the future - you can only do your reasonable best to anticipate it. So, assume that whatever you decide, it will need to be changed at some point.
For example, in the context of master reference data stored in a relational database table:
Never try to use language tags as primary keys in your DB.
Instead, assign synthetic primary keys (sequential integers, or UUIDs, or whatever). That way you will stand a fighting chance of managing changes, when they inevitable show up. By all means, add a unique index to the language tag column - just don’t make it the primary key.
† - Requirements, requirements, requirements. What kind of data am I tagging? A piece of written text, where the script is important? The primary spoken language(s) in a TV show? And so on. How will the data be used? Will it be passed around to other consumers? How will they need to use it…?
Author northCoder
LastMod 27-Oct-2016