12.06.06

Character sets issues and guidelines

Posted in Java, Web Development at 3:04 am by skoobi

With globalization, Character Set problems are becoming more and more frequent, and are sometimes even a headache, as Mark Pilgrim and Scott Balmos highlight it in their respective posts entitled Determining the Character encoding of a feed and String encodings - another thorn in interop. Character sets are no more than a simple mapping between characters and numbers, and some encodings, such as Unicode UTF-8 tackle the interoperability issue correctly.

So, the real solution to all these problems would be to leverage UTF-8 as the default encoding for every application. If communication is necessary with a legacy system that does not support UTF-8, then whatever ISOxxx encoding is acceptable in a small wrapper that translates the stream to a UTF-8 one. In order to accomplish this :

  1. Make sure the default locale on all your systems are UTF-8. Recent linux distributions like Ubuntu luckily default to that.
  2. When writing or reading anything to a stream, Java (and I believe other languages too) defaults to the default encoding on the system. Do NOT trust this value, and make sure to only use the Reader/Writer constructors (example: OutputStreamWriter provides a few constructors that take the Charset. Use these constructors at ANY COST, and possibly write Jalopy rules that prevent the use of the default ones).

From a more general point of view, it would be desirable to have UTF-8 everywhere : Domain Name System (which stil uses ASCII), SMTP (which reverts to ugly hacks to allow people to write non-ASCII characters), etc…

The internet is an international place, and as such, should not be ASCII-centric. This means that if the standardization organisms (IETF, ..) do not realize this, we are going to see more and more forking such as China’s reform to its DNS, which is obviously a bad thing for the community since it creates more interoperability issues.

Leave a Comment