Fixing Unicode for Ruby Developers

Published in

Daftcode Blog

14 min readMay 10, 2017

We’ve already learned about the history of Unicode and how it works under the hood. In this final part, we’ll learn the theory and practice of handling Unicode properly.

Before we get into practice I’d like to talk a bit about Unicode support in Ruby language. If you’re not a Ruby developer, feel free to scroll past the first chapter.

Unicode + Ruby = 😢

This header suggests, that Ruby and Unicode don’t go well together. And it might seem surprising because after all Ruby allows you to write some declarations like this one.

Which then can be used for method calls.

“From new moon up to waning crescent cycle five times and with each slice of five moons print those moons” — obvious

And it gives an expected output.

Looks like pretty good Unicode support, doesn’t it? After all, we’re writing code with emojis, which are very Unicode’y. But there is something more about supporting Unicode than being able to just write it.

The beginnings

As usual, let’s start with history. How about going back to the times when Ruby was not even in version 1.9? Have any of you ever used Ruby pre-1.9? Honestly, I haven’t, but from what the elders of the Internet have told me, Strings didn’t have any special powers there. They were just simple arrays of bytes. And an array of bytes doesn’t care about character encoding at all. You give it some data and it will hold this data for you, but ask it to do something with it… and things can get strange.

Let’s start with Ruby 1.8.7 (the latest one before 1.9) and check how Strings work there.

Nothing surprising here. We’ve created a String without any problems. But what if you’re a developer from Poland, and you wanted to use Polish variable names (please don’t ever do this) and Polish strings? Let’s call our variable “treść” (which means content) and assign local version of Hello World to it.

Seriously, don’t name variables this way

(╯°□°)╯︵ ┻━┻

We can see that Ruby didn’t even manage to read the name of our variable. Actually, it didn’t even manage to properly recover from this Error and just shut down the irb. It’s a pretty obvious sign that we shouldn’t expect much Unicode support here.

But just to be sure, let’s remove special characters from variable name and try again.

We can see that instead of the letter “Ś” (U+015A) we got “\305\232”. Those numbers are octal form of UTF-8 representation of this letter (“U+015A in UTF-8 is C59A). No support for Unicode this time.

Well, actually there was a way to ask Ruby to use UTF-8 as the default encoding. Will it work for us?

“żółw” is Polish for turtle 🐢, our spirit animal in this article

Woah, it looks like everything is working now. Can you guess how long is this obviously-4-letters-long-text?

Obviously… 4 letters long text is 7 letters long. How about trying to make it all uppercase?

Hey, it could have been even worse. Right? Right…?

(╯°□°)╯︵ ┻━┻

But there were changes

That Ruby version should be actually considered a prehistory. What followed next, brought quite a few good changes. Everybody welcome Ruby 1.9.

We saw it’s release on 25th Dec 2007 — exactly 11 years after Ruby 1.0. One of the biggest changes was the improved String class. From now on, it would support many different encodings and support UTF-8 as a default one. Are you as excited as I am to check it out?

Let’s fire up irb for Ruby 1.9.3 and check if its default encoding really is UTF-8.

My Ruby build was somehow broken and decided not to show version number in prompt. But it seems that it indeed supports UTF-8 and we can use it out of the box.

Let’s do the turtle test once more. Can you guess how long will the turtle be this time?

Wooohoo 👏🎉🍾 Our turtle is 4 characters long. But wait. There was a second test. Can we make the turtle bigger?

(╯°□°)╯︵ ┻━┻

But it was all a long time ago

Ruby 1.9 is actually also a prehistory. After all, we’re using ruby 2.x right now. It’s not 1, it’s 2. The whole new version of a language. Let’s just open the prompt for Ruby 2.3.3, which I happened to have set as a default one.

Enough talking, let’s make a big turtle!

Actually, I’m not sure what I’ve expected.

But there are even more changes

As you may know, Ruby 2.3 is not the latest one. Not so long ago, on 25th Dec 2016, Ruby 2.4 had its release. Exactly 20 years after the first release, there was one important change:

“String supports Unicode case mappings”

Is this what we’ve all been waiting for? Can we finally make our turtle bigger?

Yes we can!

Let’s all take a moment to contemplate on this leap for humanity

We can all happily go home now. Our work is done for today. But in the very last moment, a new wild turtle appears.

It looks just like the previous one, but we have the feeling that something is not quite right with it. But just to be sure, let’s check it’s length. Can you guess how long can it be?

What happened here? If it was 7 characters long it would have been obvious. But it’s not. Let’s inspect what’s inside of it (sounds quite brutal when talking about turtles 😨).

For some reason, our “ż” and “ó” got separated into two characters. If you remember things I’ve talked about previously, you may recognise Combining Diacritical Marks here. But just to make things clear. In Unicode, there are two ways to write regional characters. The first one is to use precomposed letter, which is represented as one character. The other method is to use the “base” letter and follow it with a special combining character (such as dot, wave, slash etc.) which is displayed together with the preceding letter. But as we can see displaying is not the same as keeping in memory.

We can now verify whether or not Ruby handles such combined characters properly. Can we find a precomposed “ż” inside our “żółw”?

Both forms of “ż” look exactly the same, so you have to believe me which one is which

To be honest there was no reason at all for this to work.

We’re almost ready to talk about ways of fixing this. There is only one more thing left for us to check. What if we tried to reverse our turtle? Or rather have it reversed and try to make it “right” again?

“włóż” means “to insert” in Polish, and there is a joke about reversing this word, but it makes no sense in English

We’ve just created a monster.

Look at how our “ó” lost its upper thing (there is definitely some smart name for it) and became a plain boring “o”. And our “ż” became a “ź”. Not to mention the opening quotation mark of our string having a dot above. A dot above a quotation mark… 🙃

The reason for this is that deep inside, Strings are still like arrays. If you reverse it, you get what you ask for.

Read it backwards, it would totally compose into “żółw”

It’s high time we try to solve all those issues.

What can we do about it?

At this point, everyone skipping the Ruby part should join us. Hello 👋 And I’ve lied to you because the first solution I’ll show is Ruby only.

Let ActiveSupport help us

If you’re working with Ruby on Rails, you’d already have ActiveSupport installed. It extends String with mb_chars method. What this method does is wrap the string in “ActiveSupport::Multibyte::Chars” class. This class makes all String methods such as “reverse”, “upcase” etc. work as expected.

And indeed it does work. And just to make sure we’ve been reversing a string with composing characters let’s see all the characters.

As we can see, mb_chars made our string behave less like an array and more like an actual text.

But some of us would probably want to avoid using Ruby on Rails (or loading huge gems) just to work with text.

And that’s when we should talk about built-in ways of handling this.

Let’s normalize

Normalization is a way of turning many forms of data into a unified one. In the case of text strings, it means encoding them into one before-agreed form, called a normal form. The Ruby way of normalizing strings is “unicode_normalize” method. Other languages should have something similar (provided that they don’t do this by default).

To see it in action, first create two different turtles: one with precomposed letters, other using composable characters.

We can see that those two strings are indeed different. But when we normalize them before comparing…

To see what’s going on here, take a look at the string before and after normalization.

It looks like “unicode_normalize” changes all Unicode-magical characters into normal ones.

I’m not saying that composed characters aren’t normal or anything like this 😅

Sounds cool, right? Let’s do this in our production environment. We’re now ready for everything Unicode has for us. For example, someone sends us some text which looks like “a a a a…” but has all sorts of funky a-like characters.

“a”, italic “a”, fancy “a”, encircled “a”, “a” slightly above… All those characters are written with the same font

But we’ve got nothing to worry about. We’re normalizing our strings, so it should all get changed into some normal “a”.

But it was not.

The reason for this is, that there are many different normal forms you can use. Unicode Standard describes 4 ways to normalize a string. To see all of them let’s use a string which has all kinds of funky Unicode stuff inside.

“¼ węża” means “one fourth of a snake” 🐍

We have a “¼” character which everybody knows has 3 different characters inside: “1”, “/”, and “4”. Then is a space, or rather a nonbreaking space which looks like a space but cannot be broken across lines. “w” is quite normal here, but then you have a Polish precomposed letter “ę” and “z” with a composable dot above. The “a” is once again just a boring letter.

The first (default) normal form is called NFC. NF stands for Normal Form, while C means Compose. You can probably guess how it works.

All it does is taking letters with composable characters and turning it into one, precomposed letter. It composes “z” and a dot into “ż”.

The second normal form is NFD, with D meaning Decompose. This is opposite to what we just saw above and is usually used only for some specialised internal processing.

As expected, it turned the precomposed “ę” into “e” and a composable hook which is called “ogonek”.

The next form is the most common when turning strings into identifiers (such as database keys). It’s called NFKC, with K standing for Compatibility.

The C was already needed for Compose ¯\_(ツ)_/¯

Apart from doing the compose part, it additionally takes care of all those special characters such as “¼” and the nonbreaking space. It changes them into their compatible counterparts, which are “1”, “/”, “4” and a standard space.

As you might have guessed, the last form is NFKD.

It’s like the previous one, but decomposes characters instead of composing them.

Equipped with all this knowledge we can now try another take on normalizing our “a a a a…”. It looks like the default NFC form is missing the Compatibility part, which we need here. Let’s check if NFKC gives us better results.

It gives us an expected string of standard letters “a”. But at this point, we probably have some trust issues as far as Unicode is concerned. We’d better make sure all those “a” are the same. Can we turn them into turtles?

The ultimate method of proving things — turning them into turtles

Things to watch out for

Normalization is a great way to handle Unicode strings and turn them into more usable form. It solves many problems, but can also create new ones if used without special care.

One example of how Unicode can create security threats in your application happened to Spotify in 2013. There was a way to takeover any user account using Unicode-related stuff.

Spotify being all suspicious about Unicode

Spotify has supported Unicode in usernames for a long time (seriously, it’s possible to name your account something like “¼🐍🐢”). To handle this correctly they use two forms of your username. One is a verbatim form and is used purely for display purposes. The second one, called a canonical form, is used for all sorts of processing (comparing, checking the database etc). To turn a verbatim form into canonical one, it has to undergo a process called canonicalisation. And that’s where the problem came from.

Fun fact. In Polish, the name for canonicalisation is the same as canonisation — which is when the Pope declares that someone who has died was a saint

Let’s make an app

For the sake of this article we’ll simplify what happened to Spotify but will keep the main idea and an attack vector unchanged. Let’s see how easy it is to create a serious security issue.

Imagine you have an app, where usernames aren’t case sensitive. Your users are happy as they don’t have to remember whether they used capital letters or not. They can login as “GreatUser”, “greatuser”, “greatuSER” etc. and still get access to their account. To achieve this, you’ve created a very simple canonicalising method, which makes username all uppercase.

Now when you log in using “Admin” username to access all the secret admin stuff in your app, the username will be turned into uppercase “ADMIN”. It all works as expected.

But since we’re all modern and keeping up with latest trends, we allow users to use Unicode characters when logging in. It’s all working great until some evil hacker appears with the intention to mislead users on your message board (yes, we’re making a message board type application 😐). What they do is register as an “𝖠dmin” with some hacky magical Unicode “𝖠”.

Those evil hackers always hacking and breaking stuff 😢 (╯°□°）╯︵ ┻━┻

And they can do this because this “𝖠” is not the same as “A”.

Quick fix time

We cannot allow for such thing! Luckily, we remember how we’ve learned about Unicode normalization some time ago, and decide that it’s the way to go. We may even remember that the best normal form for things like identifiers is NFKC. So we proceed and fix out canonicalisation method.

Whew… That was actually quite easy 😅 ┬──┬ ノ( ゜-゜ノ)

Hooray! Problem solved. Now, none of those evil hackers can login as “𝖠dmin” anymore.

One important detail

But there is one problem with our canonicalisation method. In real life, things such as canonicalisation may be performed a lot of times on the same text. You can do this as many times as you like thanks to method idempotence (watch out, a wild hard word appeared). The idempotent method has a nice feature, that when you call it many times it gives the same result as when you call it only once. For example “upcase” is idempotent. You can’t make a text more uppercase even when you try it a million times.

On the first sight, we might think that our canonicalisation method is idempotent as well. After all, when we try “to_can” on those evil “𝖠dmin” thing, we get exactly the same result.

But keep in mind that hackers are clever. And they come up with a new attack variation. Now, they try to use “ᵃdmin” as a login. And to our dismay, it’s not the same after two method calls as it is after one.

Look at this cute little floating “a” ☺️

Does it mean that our canonicalisation method is not idempotent? Let’s check what’s happening here step by step.

Now it should get obvious. Even though we may expect it to do so, calling “upcase” on this little “ᵃ” doesn’t turn it into “ᴬ”. So after uppercasing the username, we’re left with “ᵃDMIN” which gets normalised into “aDMIN”, which is not taken yet. The hacker can sign up with such username and it will be stored in the database as “aDMIN”.

Then, they request a password reset for this account. We extract this user from the database, allow the user to set a new password and save the new password to the database. And if at this very moment, just before saving, we canonicalise the username again (not our fault, we thought our method is idempotent). We’ve just changed a password for “ADMIN” user 😱

And it’s more or less what happened to Spotify.

Fixing this for good

Do you have an idea for fixing it? It’s actually really easy. Our problem is in “upcase” not handling all Unicode characters as expected. To get rid of this we should normalise the string before making it uppercase.

Actually, normalising things is, pretty much always, the first thing you should do with them — never the last

Parting words

And that would be all for this wild 3-part Unicode adventure. Thanks for staying with me and I hope you’ve learned something useful here. Feel free to ask any questions, and I’ll try to do my best to help you out 🤓