This assumes a reasonable level of familiarity with Unicode.
The example we will mainly use here is the string of Japanese text:
…which roughly translates as “Initialize settings”. Inspired by this Stack Overflow question: How to convert hex string to Shift-JIS encoding in Java?
There are certainly libraries out there which can help with the job of translating between character encodings, but I wanted to take a closer look at how this happens using Java.
Review of Types
The following Java types are of most interest here:
|yes||8 bits||-128 - 127|
|no||16 bits||0 - 65,535||The only unsigned number primitive.|
|yes||32 bits||-2.1bn - 2.1bn|
char is stored as a 16-bit unsigned integer, representing a Unicode code point. More on that below.
This is why you can do things like this:
but not this, due to a compile time error (lossy conversion from
Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a
char. In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings:
Java changed its internal representation of the String class…
…from a UTF-16 char array to a byte array plus an encoding-flag field.
The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.
These changes were purely internal. But it’s worth noting that internally from Java 9 onwards, Java uses a
byteto store strings. And Java has never used UTF-8 for its internal representation of strings. It used to only use UTF-16 - and it now uses ISO-8859- and UTF-16 as noted above.
Early versions of Unicode defined 65,536 possible values from
U+FFFF. These are often referred to as the Base Multilingual Plane (BMP). This was handled by earlier versions of Java by the
char primitive. A single
char represents a single BMP symbol.
Over time, Unicode has expanded significantly. It currently covers code points in the range
U+10FFFF - which is 21 bits of data (approximately 1 million possible values). Characters outside of the BMP range are referred to as “supplementary characters”.
Java handles Unicode supplementary characters using pairs of
char values, in structures such as char arrays, Strings and StringBuffers. The first value in the pair is taken from the high-surrogates range, (
\uD800-\uDBFF), the second from the low-surrogates range (
\uDC00-\uDFFF). But, again, as noted above the underlying storage used by Java is actually a byte array.
Taking the letter
A, we know that has a Unicode value of
Consider the Java string
String str = "A";
We can see what bytes make up that string as follows:
We provide an explicit charset instead of relying on the default charset of the JVM. We can use a string for the charset name, instead:
In which case, we also need to handle the
UnsupportedEncodingException. And a list of charset names can be found in the IANA Charset Registry.
For the above example, our
byte array contains the decimal integer value
We can convert that from an integer to a hex value as follows:
This gives us
"41" - which matches the Unicode value of
U+0041, since the UTF-8 single-byte code point values correspond to the Unicode values (and ASCII values).
We can also convert from the hex string back to the original integer (
(If you were trying to convert hex values outside the
int range, you would need to use the
Long equivalent methods of
Consider the Java string
String str = "設"; - the first character in the Japanese string mentioned at the start of this article. This is Unicode character
U+8A2D. It has a UTF-8 encoding of
0xE8 0xA8 0xAD - and a Shift_JIS encoding of
Now, the following line…
…gives us a three-byte array:
[ -24, -88, -83 ]. Where did these numbers come from? Why are they negative values? How do they relate to the UTF-8 encoding of
0xE8 0xA8 0xAD?
If we try our previous approach
String hexString = Integer.toHexString(bb);, we get
ffffffe8, which doesn’t look right at all.
byte is a signed 8-bit integer, we first have to convert it to an unsigned integer.
And then we can do this:
And now we see our expected
e8 - the first byte of
0xE8 0xA8 0xAD.
Going back to our simple example for
A, with its decimal value of
65, if we do this…
…that also gives
65 maps to
65! Two’s complement in action!
We can repeat this process for each of the three bytes in our
bb byte array, and build our
e8a8ad string representing the hex values of “設” in UTF-8.
We can repeat this process using a different encoding - Shift_JIS instead of UTF-8:
And that gives us our Shift_JIS encoding of
0x90 0xDD, as expected.
Writing Files with Different Encodings
The following method creates 2 files - both containing the same data, but written using 2 different encodings, using
In the above examples, we just need to write out the byte arrays and we are done.
To convert the contents of the
out_shift_jis.txt file to UTF-8, we can do this:
The above reads the input file one line at a time.
You cannot convert any character set to any other character set. The target character set must contain valid encodings for every character in the source character set, to be sure that data will not be lost. Or, at the very least, you need to be sure that the data in any given file will contain characters which are guaranteed to exist in the target character set - a difficult (impossible?) guarantee to enforce.
Consider this example, which is encoded using the Windows-1252 character set, and which contains Microsoft’s so-called “smart quotes”:
This sentence contains “smart quotes” not found in ISO-8859-1.
The ISO-8859-1 character set does not contain these custom double quote characters.
Converting from Windows-1252 to ISO-8859-1 will result in a silent loss of data:
This sentence contains ?smart quotes? not found in ISO-8859-1.
Encoding Names for Java
You can see a list of names here: Supported Encodings