French Dictionary Word Order

There does not appear to be a canonical set of rules for how words are ordered in a French dictionary. Rather, there seems to be conventions and customs established over time. In addition to the standardized “A to Z” word ordering based on the alphabet (“a” comes before “b”, and so on), French has some additional subtleties regarding how accents affect lexicographical word ordering.

Take the following example - a screenshot from the online Dictionnaire de l’Académie française:

Here we can see the following words, in order:

cote
côte
coté
côté

The convention is that accents are evaluated from the end of the word to the start of the word - and that unaccented letters are sorted before accented letters.

(If accents were evaluated from the start of the word, then we would expect the word order to be cote coté côte côté.)

I asked about this on a French language web site. I did not get an explicit answer, but the following comment:

I don’t think there are of any “official” rules, it’s just usage set by the printing History (and I suppose a certain amount of logic). I would expect that the order unaccented, aigu, grave, circonflexe, tréma (very consistent and taught at school) is a mixture of alphabetical order and frequency. I’ve seen differences from one dictionary to another. Differences can be seen in proper names when they are compound, apostrophes, dashes etc.

I have a (very old) French-English dictionary (Gasc’s Concise), which shows the same rule:

I wanted to recreate this using Java. My first attempt was to use Collections::sort, which uses “natural ordering” which, for Java String classes, is based on its compareTo method, where:

The comparison is based on the Unicode value of each character in the strings.

So we can try this:

1
2
3
4
5
List<String> words = List.of("cote", "coté", "côte", "côté");
List<String> defaultList = new ArrayList<>(words);
System.out.println("unsorted\n" + new ArrayList<>(defaultList));
Collections.sort(defaultList);
System.out.println("sorted\n" + new ArrayList<>(defaultList));

The output:

1
2
3
4
unsorted
[cote, coté, côte, côté]
sorted
[cote, coté, côte, côté]

That certainly does not produce the French dictionary convention we want to recreate.

Instead, we can create a Collator specifying a French locale:

1
Collator french = Collator.getInstance(Locale.FRENCH);

And then we can use this collator:

1
2
3
4
5
List<String> words = List.of("cote", "coté", "côte", "côté");
List<String> defaultList = new ArrayList<>(words);
System.out.println("unsorted\n" + new ArrayList<>(defaultList));
Collections.sort(defaultList, french);
System.out.println("sorted\n" + new ArrayList<>(defaultList));

This gives:

1
2
3
4
unsorted
[cote, coté, côte, côté]
sorted
[cote, côte, coté, côté]

This is what we want.

Behind the scenes, the above Collator.getInstance code uses a RuleBasedCollator for the given French locale. This, in turn, uses a general-purpose “base” Java collator, using the rules shown in the following class: sun.util.locale.provider.CollationRules.java

And then, this base rules-based collator is customized for the French locale by appending a @ to the rules text. The @ symbol is defined in the RuleBasedCollator javadoc:

‘@’ : Indicates that accents are sorted backwards, as in French.

You can see where the @ is added here:

sun.text.resources.ext.CollationData_fr.java

1
2
3
4
5
protected final Object[][] getContents() {
    return new Object[][] {
        { "Rule", "@" }
    };
}

Interestingly, in this specific case, Java uses a hard-coded class to hold the collation rules for the en-US lexicographical rules and adds a hard-coded @ for the fr-FR locale.

Java does not use the CLDR database for this.

Other languages have different conventions for how letters (including accented letters) are sorted. For example, Norwegian needs to account for letters such as ø and å. And those rules may be different from how a language such as Danish may sort the same letters.

You can see what Java uses by looking at the different classes in the sun.text.resources.ext package.

This is just the tip of the iceberg. Take a look at Unicode Collation Case Study: Sorting French Topic Lists for a detailed discussion of additional sorting considerations for words and phrases.

Any time we want to sort text, we need to consider what would be the least surprising outcome for people from different locales who have to read what we have sorted.