Converting Between Character Encodings with Java
Contents
Background
This assumes a reasonable level of familiarity with Unicode.
The example we will mainly use here is the string of Japanese text:
|
|
…which roughly translates as “Initialize settings”. Inspired by this Stack Overflow question: How to convert hex string to Shift-JIS encoding in Java?
There are certainly libraries out there which can help with the job of translating between character encodings, but I wanted to take a closer look at how this happens using Java.
Review of Types
The following Java types are of most interest here:
Type | Signed? | Size | Range | Notes |
---|---|---|---|---|
byte | yes | 8 bits | -128 - 127 | |
char | no | 16 bits | 0 - 65,535 | The only unsigned number primitive. |
int | yes | 32 bits | -2.1bn - 2.1bn | |
String | n/a | n/a | n/a | See notes. |
char
Notes
Yes, a char
is stored as a 16-bit unsigned integer, representing a Unicode code point. More on that below.
This is why you can do things like this:
|
|
or this:
|
|
but not this, due to a compile time error (lossy conversion from int
to char
):
|
|
String
Notes
Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a char[]
. In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings:
Java changed its internal representation of the String class…
…from a UTF-16 char array to a byte array plus an encoding-flag field.
And:
The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.
These changes were purely internal. But it’s worth noting that internally from Java 9 onwards, Java uses a byte[]
to store strings. And Java has never used UTF-8 for its internal representation of strings. It used to only use UTF-16 - and it now uses ISO-8859- and UTF-16 as noted above.
Unicode Ranges
Early versions of Unicode defined 65,536 possible values from U+0000
to U+FFFF
. These are often referred to as the Base Multilingual Plane (BMP). This was handled by earlier versions of Java by the char
primitive. A single char
represents a single BMP symbol.
Over time, Unicode has expanded significantly. It currently covers code points in the range U+0000
to U+10FFFF
- which is 21 bits of data (approximately 1 million possible values). Characters outside of the BMP range are referred to as “supplementary characters”.
Java handles Unicode supplementary characters using pairs of char
values, in structures such as char arrays, Strings and StringBuffers. The first value in the pair is taken from the high-surrogates range, (\uD800-\uDBFF
), the second from the low-surrogates range (\uDC00-\uDFFF
). But, again, as noted above the underlying storage used by Java is actually a byte array.
Single-byte example
Taking the letter A
, we know that has a Unicode value of U+0041
.
Consider the Java string String str = "A";
We can see what bytes make up that string as follows:
|
|
We provide an explicit charset instead of relying on the default charset of the JVM. We can use a string for the charset name, instead:
|
|
In which case, we also need to handle the UnsupportedEncodingException
. And a list of charset names can be found in the IANA Charset Registry.
For the above example, our byte
array contains the decimal integer value 65
.
We can convert that from an integer to a hex value as follows:
|
|
This gives us "41"
- which matches the Unicode value of U+0041
, since the UTF-8 single-byte code point values correspond to the Unicode values (and ASCII values).
We can also convert from the hex string back to the original integer (65
):
|
|
(If you were trying to convert hex values outside the int
range, you would need to use the Long
equivalent methods of toHexString
and valueOf
.)
Multi-byte example
Consider the Java string String str = "設";
- the first character in the Japanese string mentioned at the start of this article. This is Unicode character U+8A2D
. It has a UTF-8 encoding of 0xE8 0xA8 0xAD
- and a Shift_JIS encoding of 0x90 0xDD
.
Now, the following line…
|
|
…gives us a three-byte array: [ -24, -88, -83 ]
. Where did these numbers come from? Why are they negative values? How do they relate to the UTF-8 encoding of 0xE8 0xA8 0xAD
?
If we try our previous approach String hexString = Integer.toHexString(bb[0]);
, we get ffffffe8
, which doesn’t look right at all.
Because Java’s byte
is a signed 8-bit integer, we first have to convert it to an unsigned integer.
|
|
And then we can do this:
|
|
And now we see our expected e8
- the first byte of 0xE8 0xA8 0xAD
.
Going back to our simple example for A
, with its decimal value of 65
, if we do this…
|
|
…that also gives 65
. So 65
maps to 65
! Two’s complement in action!
We can repeat this process for each of the three bytes in our bb
byte array, and build our e8a8ad
string representing the hex values of “設” in UTF-8.
We can repeat this process using a different encoding - Shift_JIS instead of UTF-8:
|
|
And that gives us our Shift_JIS encoding of 0x90 0xDD
, as expected.
Writing Files with Different Encodings
The following method creates 2 files - both containing the same data, but written using 2 different encodings, using java.nio.file.Files
:
|
|
In the above examples, we just need to write out the byte arrays and we are done.
To convert the contents of the out_shift_jis.txt
file to UTF-8, we can do this:
|
|
The above reads the input file one line at a time.
Conversion Mismatches
You cannot convert any character set to any other character set. The target character set must contain valid encodings for every character in the source character set, to be sure that data will not be lost. Or, at the very least, you need to be sure that the data in any given file will contain characters which are guaranteed to exist in the target character set - a difficult (impossible?) guarantee to enforce.
Consider this example, which is encoded using the Windows-1252 character set, and which contains Microsoft’s so-called “smart quotes”:
This sentence contains “smart quotes” not found in ISO-8859-1.
The ISO-8859-1 character set does not contain these custom double quote characters.
Converting from Windows-1252 to ISO-8859-1 will result in a silent loss of data:
This sentence contains ?smart quotes? not found in ISO-8859-1.
Encoding Names for Java
You can see a list of names here: Supported Encodings
Author northCoder
LastMod 01-Sep-2022