Converting Between Character Encodings with Java

01 Sep 2022

Table of Contents


Background

This assumes a reasonable level of familiarity with Unicode.

The example we will mainly use here is the string of Japanese text:

1
設定を初期化します

…which roughly translates as “Initialize settings”. Inspired by this Stack Overflow question: How to convert hex string to Shift-JIS encoding in Java?

There are certainly libraries out there which can help with the job of translating between character encodings, but I wanted to take a closer look at how this happens using Java.

Review of Types

The following Java types are of most interest here:

Type Signed? Size Range Notes
byte yes 8 bits -128 - 127
char no 16 bits 0 - 65,535 The only unsigned number primitive.
int yes 32 bits -2.1bn - 2.1bn
String n/a n/a n/a See notes.

char Notes

Yes, a char is stored as a 16-bit unsigned integer, representing a Unicode code point. More on that below.

This is why you can do things like this:

Java
1
2
char ca = 'a';
char cb = ++ca; // 'b'

or this:

Java
1
2
char ca = 'a';
char cb = ca += 1; // 'b'

but not this, due to a compile time error (lossy conversion from int to char):

Java
1
2
char ca = 'a';
char cx = ca + 1; // COMPILATION ERROR

String Notes

Prior to Java 9, a string was represented internally in Java as a sequence of UTF-16 code units, stored in a char[]. In Java 9 that changed to using a more compact format by default, as presented in JEP 254: Compact Strings:

Java changed its internal representation of the String class…

…from a UTF-16 char array to a byte array plus an encoding-flag field.

And:

The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string.

These changes were purely internal. But it’s worth noting that internally from Java 9 onwards, Java uses a byte[]to store strings. And Java has never used UTF-8 for its internal representation of strings. It used to only use UTF-16 - and it now uses ISO-8859-1 and UTF-16 as noted above.

Unicode Ranges

Early versions of Unicode defined 65,536 possible values from U+0000 to U+FFFF. These are often referred to as the Base Multilingual Plane (BMP). This was handled by earlier versions of Java by the char primitive. A single char represents a single BMP symbol.

Over time, Unicode has expanded significantly. It currently covers code points in the range U+0000 to U+10FFFF - which is 21 bits of data (approximately 1 million possible values). Characters outside of the BMP range are referred to as “supplementary characters”.

Java handles Unicode supplementary characters using pairs of char values, in structures such as char arrays, Strings and StringBuffers. The first value in the pair is taken from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF). But, again, as noted above the underlying storage used by Java is actually a byte array.

Single-byte example

Taking the letter A, we know that has a Unicode value of U+0041.

Consider the Java string String str = "A";

We can see what bytes make up that string as follows:

Java
1
byte[] bb = str.getBytes(StandardCharsets.UTF_8);

We provide an explicit charset instead of relying on the default charset of the JVM. We can use a string for the charset name, instead:

Java
1
byte[] bb = str.getBytes("UTF-8");

In which case, we also need to handle the UnsupportedEncodingException. And a list of charset names can be found in the IANA Charset Registry.

For the above example, our byte array contains the decimal integer value 65.

We can convert that from an integer to a hex value as follows:

Java
1
String hexString = Integer.toHexString(bb[0]);

This gives us "41" - which matches the Unicode value of U+0041, since the UTF-8 single-byte code point values correspond to the Unicode values (and ASCII values).

We can also convert from the hex string back to the original integer (65):

Java
1
Integer.valueOf(hexString, 16).intValue();

(If you were trying to convert hex values outside the int range, you would need to use the Long equivalent methods of toHexString and valueOf.)

Multi-byte example

Consider the Java string String str = "設"; - the first character in the Japanese string mentioned at the start of this article. This is Unicode character U+8A2D. It has a UTF-8 encoding of 0xE8 0xA8 0xAD - and a Shift_JIS encoding of 0x90 0xDD.

Now, the following line…

Java
1
byte[] bb = str.getBytes(StandardCharsets.UTF_8);

…gives us a three-byte array: [ -24, -88, -83 ]. Where did these numbers come from? Why are they negative values? How do they relate to the UTF-8 encoding of 0xE8 0xA8 0xAD?

If we try our previous approach String hexString = Integer.toHexString(bb[0]);, we get ffffffe8, which doesn’t look right at all.

Because Java’s byte is a signed 8-bit integer, we first have to convert it to an unsigned integer.

Java
1
int uint = Byte.toUnsignedInt(bb[0]); // 232 (decimal integer)

And then we can do this:

Java
1
String hexString = Integer.toHexString(uint); // "e8"

And now we see our expected e8 - the first byte of 0xE8 0xA8 0xAD.

Going back to our simple example for A, with its decimal value of 65, if we do this…

Java
1
int uinta = Byte.toUnsignedInt("A".getBytes("UTF-8")[0]);

…that also gives 65. So 65 maps to 65! Two’s complement in action!

We can repeat this process for each of the three bytes in our bb byte array, and build our e8a8ad string representing the hex values of “設” in UTF-8.

We can repeat this process using a different encoding - Shift_JIS instead of UTF-8:

Java
1
2
3
byte[] bb2 = "設".getBytes("Shift_JIS"); // [ -112, -35 ]
int uinta = Byte.toUnsignedInt("設".getBytes("Shift_JIS")[0]); // decimal 144 (hex 90)
int uintb = Byte.toUnsignedInt("設".getBytes("Shift_JIS")[1]); // decimal 221 (hex dd)

And that gives us our Shift_JIS encoding of 0x90 0xDD, as expected.

Writing Files with Different Encodings

The following method creates 2 files - both containing the same data, but written using 2 different encodings, using java.nio.file.Files:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
private void writeTwoCodePages() throws UnsupportedEncodingException, IOException {
	  String str = "設定を初期化します";

    // UTF-8 output:
	  byte[] bytes1 = str.getBytes(StandardCharsets.UTF_8);
	  Files.write(Paths.get("/path/to/out_utf_8.txt"), bytes1);

    // Shift_JIS output:
	  byte[] bytes2 = str.getBytes("Shift_JIS");
	  Files.write(Paths.get("/path/to/out_shift_jis.txt"), bytes2);
}

In the above examples, we just need to write out the byte arrays and we are done.

To convert the contents of the out_shift_jis.txt file to UTF-8, we can do this:

Java
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
private static void convertEncoding() throws IOException {
    String inFile = "C:/temp/out_shift_jis.txt";
    String outFile = "C:/temp/out_utf_8_converted.txt";
    BufferedReader inputStream = null;
    OutputStream outputStream = null;

    try {
        inputStream = new BufferedReader(new InputStreamReader(
                new FileInputStream(inFile), "Shift_JIS"));
        outputStream = new BufferedOutputStream(new FileOutputStream(outFile));

        String line;
        while ((line = inputStream.readLine()) != null) {
            outputStream.write(line.getBytes(StandardCharsets.UTF_8));
        }
    } finally {
        if (inputStream != null) {
            inputStream.close();
        }
        if (outputStream != null) {
            outputStream.close();
        }
    }
}

The above reads the input file one line at a time.

Conversion Mismatches

You cannot convert any character set to any other character set. The target character set must contain valid encodings for every character in the source character set, to be sure that data will not be lost. Or, at the very least, you need to be sure that the data in any given file will contain characters which are guaranteed to exist in the target character set - a difficult (impossible?) guarantee to enforce.

Consider this example, which is encoded using the Windows-1252 character set, and which contains Microsoft’s so-called “smart quotes”:

This sentence contains “smart quotes” not found in ISO-8859-1.

The ISO-8859-1 character set does not contain these custom double quote characters.

Converting from Windows-1252 to ISO-8859-1 will result in a silent loss of data:

This sentence contains ?smart quotes? not found in ISO-8859-1.

Encoding Names for Java

You can see a list of names here: Supported Encodings