The rest of my blog: https://www.catmonad.xyz/blog/

Published: September 24th, 2024. Last updated: October 15th, 2024.

Monadic Cat’s Unicode Reading List

Hey everyone.

I have spent a lot of time explaining Unicode to programmers who just wanted to move on with their projects. While I am considering writing a blog post covering common misconceptions, and simple approaches to getting text processing right, I am short on time.

Instead, take a look at these articles, which are all wonderful!

This is something of a historical piece, from just about 21 years ago now. This piece correctly conveys the notion that you must know the particular encoding of a piece of text data in order to do anything at all with it. It also serves as a decent introduction to how wild text encodings have gotten, especially when it was further from a settled question which one we should use.

Nowadays, UTF-8 should be used for new data, and old systems should be migrated to use it, at the hazard of otherwise facing legal problems.

I also feel compelled to note after this one that you can use <meta charset="utf-8" /> in the head of your HTML documents; it is equivalent to <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> since HTML5. (Note that HTML5 specifies UTF-8 as the only valid encoding.)

Moving on…

There are other great posts on the subject I have read, and I wish I’d started keeping this list sooner.

Whenever I encounter, or rediscover, a post that I think should be on here, I’ll add it.


Update: October 15th, 2024

Added Henri Sivonen’s post, https://hsivonen.fi/string-length/, as I located it again through reading ThePhD’s post 5 Years Later: The First Win and happening to click on Henri Sivonen’s blog which was linked there.