ponyfoo.com

ES6 Strings (and Unicode, ❤) in Depth

Fix

Yo. Here’s another edition of ES6 – “I can’t believe they killed off Stringer Bell” – in Depth series. If you’ve never been around here before, start with A Brief History of ES6 Tooling. Then, make your way through destructuring, template literals, arrow functions, the spread operator and rest parameters, improvements coming to object literals, the new classes sugar on top of prototypes, let, const, and the “Temporal Dead Zone”, iterators, generators, Symbols, Maps, WeakMaps, Sets, and WeakSets, proxies, proxy traps, more proxy traps, reflection, Number, Math, Array, and Object. Today we’ll be serving updates to the String object coming in ES6.

Like I did in previous articles on the series, I would love to point out that you should probably set up Babel and follow along the examples with either a REPL or the babel-node CLI and a file. That’ll make it so much easier for you to internalize the concepts discussed in the series. If you aren’t the “install things on my computer” kind of human, you might prefer to hop on CodePen and then click on the gear icon for JavaScript – they have a Babel preprocessor which makes trying out ES6 a breeze. Another alternative that’s also quite useful is to use Babel’s online REPL – it’ll show you compiled ES5 code to the right of your ES6 code for quick comparison.

Before getting into it, let me shamelessly ask for your support if you’re enjoying my ES6 in Depth series. Your contributions will go towards helping me keep up with the schedule, server bills, keeping me fed, and maintaining Pony Foo as a veritable source of JavaScript goodies.

Thanks for reading that, and let’s go into updates to the String object.

Updates to String

We’ve already covered template literals earlier in the series, and you may recall that those can be used to mix strings and variables to produce string output.

function greet (name) {
  return `hello ${name}!`
}
greet('ponyfoo')
// <- 'hello ponyfoo!'

Besides template literals, strings are getting a numbre of new methods come ES6. These can be categorized as string manipulation methods and unicode related methods.

We’ll begin with the string manipulation methods and then we’ll take a look at the unicode related ones.

String.prototype.startsWith

A very common question in our code is “does this string start with this other string?”. In ES5 we’d ask that question using the .indexOf method.

'ponyfoo'.indexOf('foo')
// <- 4
'ponyfoo'.indexOf('pony')
// <- 0
'ponyfoo'.indexOf('horse')
// <- -1

If you wanted to check if a string started with another one, you’d compare them with .indexOf and check whether the “needle” starts at the 0 position – the beginning of the string.

'ponyfoo'.indexOf('pony') === 0
// <- true
'ponyfoo'.indexOf('foo') === 0
// <- false
'ponyfoo'.indexOf('horse') === 0
// <- false

You can now use the more descriptive and terse .startsWith method instead.

'ponyfoo'.startsWith('pony')
// <- true
'ponyfoo'.startsWith('foo')
// <- false
'ponyfoo'.startsWith('horse')
// <- false

If you wanted to figure out whether a string contains another one starting in a specific location, it would get quite verbose, as you’d need to grab a slice of that string first.

'ponyfoo'.slice(4).indexOf('foo') === 0
// <- true

The reason why you can’t just ask === 4 is that this would give you false negatives when the query is found before reaching that index.

'foo,foo'.indexOf('foo') === 4
// <- false, because result was 0

Of course, you could use the startIndex parameter for indexOf to get around that. Note that we’re still comparing against 4 in this case, because the string wasn’t split into smaller parts.

'foo,foo'.indexOf('foo', 4) === 4
// <- true

Instead of keeping all of these string searching implementation details in your head and writing code that worries too much about the how and not so much about the what, you could just use startsWith passing in the optional startIndex parameter as well.

'foo,foo'.startsWith('foo', 4)
// <- true

Then again, it’s kind of confusing that the method is called .startsWith but we’re starting at a non-zero index – that being said it sure beats using .indexOf when we actually want a boolean result.

String.prototype.endsWith

This method mirrors .startsWith in the same way that .lastIndexOf mirrors .indexOf. It tells us whether a string ends with another string.

'ponyfoo'.endsWith('foo')
// <- true
'ponyfoo'.endsWith('pony')
// <- false

Just like .startsWith, we have a position index that indicates where the lookup should end. It defaults to the length of the string.

'ponyfoo'.endsWith('foo', 7)
// <- true
'ponyfoo'.endsWith('pony', 0)
// <- false
'ponyfoo'.endsWith('pony', 4)
// <- true

Yet another method that simplifies a specific use case for .indexOf is .includes.

String.prototype.includes

You can use .includes to figure out whether a string contains another one.

'ponyfoo'.includes('ny')
// <- true
'ponyfoo'.includes('sf')
// <- false

This is equivalent to the ES5 use case of .indexOf where we’d compare its results with -1 to see if the search string was anywhere to be found.

'ponyfoo'.indexOf('ny') !== -1
// <- true
'ponyfoo'.indexOf('sf') !== -1
// <- false

Naturally you can also pass in a start index where the search should begin.

'ponyfoo'.includes('ny', 3)
// <- false
'ponyfoo'.includes('ny', 2)
// <- true

Let’s move onto something that’s not an .indexOf replacement.

String.prototype.repeat

This handy method allows you to repeat a string count times.

'na'.repeat(0)
// <- ''
'na'.repeat(1)
// <- 'na'
'na'.repeat(2)
// <- 'nana'
'na'.repeat(5)
// <- 'nanananana'

The provided count should be a positive finite number.

'na'.repeat(Infinity)
// <- RangeError
'na'.repeat(-1)
// <- RangeError

Non-numeric values are coerced into numbers.

'na'.repeat('na')
// <- ''
'na'.repeat('3')
// <- 'nanana'

Using NaN is as good as 0.

'na'.repeat(NaN)
// <- ''

Decimal values are floored.

'na'.repeat(3.9)
// <- 'nanana', count was floored to 3

Values in the (-1, 0) range are rounded to -0 becase count is passed through ToInteger, as documented by the specification. That step in the specification dictates that count be casted with a formula like the one below.

function ToInteger (number) {
  return Math.floor(Math.abs(number)) * Math.sign(number)
}

The above translates to -0 for any values in the (-1, 0) range. Numbers below that will throw, and numbers above that won’t behave surprisingly, as you can only take Math.floor into account for positive values.

'na'.repeat(-0.1)
// <- '', count was rounded to -0

A good example use case for .repeat may be your typical “padding” method. The method shown below takes a multiline string and pads every line with as many spaces as desired.

function pad (text, spaces) {
  return text.split('\n').map(line => ' '.repeat(spaces) + line).join('\n')
}
pad('a\nb\nc', 2)
// <- '  a\n  b\n  c'

In ES6, strings adhere to the iterable protocol.

String.prototype[Symbol.iterator]

Before ES6, you could access each code unit (we’ll define these in a second) in a string via indices – kind of like with arrays. That meant you could loop over code units in a string with a for loop.

var text = 'foo'
for (let i = 0; i < text.length; i++) {
  console.log(text[i])
  // <- 'f'
  // <- 'o'
  // <- 'o'
}

In ES6, you could loop over the code points (not the same as code units) of a string using a for..of loop, because strings are iterable.

for (let codePoint of 'foo') {
  console.log(codePoint)
  // <- 'f'
  // <- 'o'
  // <- 'o'
}

What is this codePoint variable? There is a not-so-subtle distinction between code units and code points. Let’s switch protocols and talk about Unicode.

Unicode

JavaScript strings are represented using UTF-16 code units. Each code unit can be used to represent a code point in the [U+0000, U+FFFF] range – also known as the “basic multilingual plane” (BMP). You can represent individual code points in the BMP plane using the '\u3456' syntax. You could also represent code units in the [U+0000, U+0255] using the \x00..\xff notation. For instance, '\xbb' represents '»', the 187 character, as you can verify by doing parseInt('bb', 16) – or String.fromCharCode(187).

For code points beyond U+FFFF, you’d represent them as a surrogate pair. That is to say, two contiguous code units. For instance, the horse emoji '🐎' code point is represented with the '\ud83d\udc0e' contiguous code units. In ES6 notation you can also represent code points using the '\u{1f40e}' notation (that example is also the horse emoji). Note that the internal representation hasn’t changed, so there’s still two code units behind that code point. In fact, '\u{1f40e}'.length evaluates to 2.

The '\ud83d\udc0e\ud83d\udc71\u2764' string found below evaluates to a few emoji.

'\ud83d\udc0e\ud83d\udc71\u2764'
// <- '🐎👱❤'

While that string consists of 5 code units, we know that the length should really be three – as there’s only three emoji.

'\ud83d\udc0e\ud83d\udc71\u2764'.length
// <- 5
'🐎👱❤'.length
// <- 5, still

Before ES6, JavaScript didn’t make any effort to figure out unicode quirks on your behalf – you were pretty much on your own when it came to counting cards (err, code points). Take for instance Object.keys, still five code units long.

console.log(Object.keys('🐎👱❤'))
// <- ['0', '1', '2', '3', '4']

If we now go back to our for loop, we can observe how this is a problem. We actually wanted '🐎', '👱', '❤', but we didn’t get that.

var text = '🐎👱❤'
for (let i = 0; i < text.length; i++) {
  console.log(text[i])
  // <- '?'
  // <- '?'
  // <- '?'
  // <- '?'
  // <- '❤'
}

Instead, we got some weird unicode boxes – and that’s if we were lucky and looking at Firefox.

Printing some emoji character by character on the Firefox console
Printing some emoji character by character on the Firefox console

That didn’t turn out okay. In ES6 we can use the string iterator to go over the code points instead. The iterators produced by the string iterable are aware of this limitation of looping by code units, and so they yield code points instead.

for (let codePoint of '🐎👱❤') {
  console.log(codePoint)
  // <- '🐎'
  // <- '👱'
  // <- '❤'
}

If we want to measure the length, we’d have trouble with the .length property, as we saw earlier. We can use the iterator to split the string into its code points – as seen in the for..of example we just went over. That means the unicode-aware length of a string equals the length of the array that contains the sequence of code points produced by an iterator. We could use the spread operator to place the code points in an array, and then pull that array’s .length.

[...'🐎👱❤'].length
// <- 3

Keep in mind that splitting strings into code points isn’t enough if you want to be 100% precise about string length. Take for instance the “combining overline” \u0305 unicode code unit. On its own, this code unit is just an overline.

'\u0305'
// <- ' ̅'

When preceded by another code unit, however, they are combined together into a single glyph.

'_\u0305'
// <- '_̅'
'foo\u0305'
// <- 'foo̅'

Attempts to figure out the actual length by counting code points prove insufficient – just like using .length.

'foo\u0305'.length
// <- 4
'foo̅'.length
// <- 4
[...'foo̅'].length
// <- 4

I was confused about this one as I’m no expert when it comes to unicode. So I went to someone who is an expert – Mathias Bynens. He promptly pointed out that – indeed – splitting by code points isn’t enough. Unlike surrogate pairs like the emojis we’ve used in our earlier examples, other grapheme clusters aren’t taken into account by the string iterator.

In these cases we’re out of luck, and we simply have to fall back to regular expressions to correctly calculate the string length. For a comprehensive discussion of the subject I suggest you read his excellent “JavaScript has a Unicode problem” piece.

Let’s look at the other methods.

String.prototype.codePointAt

You can use .codePointAt to get the base-10 numeric representation of a code point at a given position in a string. Note that the position is indexed by code unit, not by code point. In the example below we print the code points for each of the three emoji in our demo '🐎👱❤' string.

'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(0)
// <- 128014
'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(2)
// <- 128113
'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(4)
// <- 10084

Figuring out the indices on your own may prove cumbersome, which is why you should just loop through the string iterator so that figures them out on your behalf. You can then just call .codePointAt(0) for each code point in the sequence.

for (let codePoint of '\ud83d\udc0e\ud83d\udc71\u2764') {
  console.log(codePoint.codePointAt(0))
  // <- 128014
  // <- 128113
  // <- 10084
}

Or maybe just use a combination of spread and .map.

[...'\ud83d\udc0e\ud83d\udc71\u2764'].map(cp => cp.codePointAt(0))
// <- [128014, 128113, 10084]

You could then take the hexadecimal (base-16) representation of those base-10 integers and render them on a string using the new unicode code point escape syntax of \u{codePoint}. This syntax allows you to represent unicode code points that are beyond the “basic multilingual plane” (BMP) – i.e, code points outside the [U+0000, U+FFFF] range that are typically represented using the \u1234 syntax.

Let’s start by updating our example to print the hexadecimal version of our code points.

for (let codePoint of '\ud83d\udc0e\ud83d\udc71\u2764') {
  console.log(codePoint.codePointAt(0).toString(16))
  // <- '1f40e'
  // <- '1f471'
  // <- '2764'
}

You can wrap those in '\u{codePoint}' and voilá – you’ll get the emoji out of the string once again.

'\u{1f40e}'
// <- '🐎'
'\u{1f471}'
// <- '👱'
'\u{2764}'
// <- '❤'

Yay!

String.fromCodePoint

This method takes in a number and returns a code point. Note how I can use the 0x prefix with the terse base-16 code points we got from .codePointAt moments ago.

String.fromCodePoint(0x1f40e)
// <- '🐎'
String.fromCodePoint(0x1f471)
// <- '👱'
String.fromCodePoint(0x2764)
// <- '❤'

Obviously, you can just as well use their base-10 counterparts to achieve the same results.

String.fromCodePoint(128014)
// <- '🐎'
String.fromCodePoint(128113)
// <- '👱'
String.fromCodePoint(10084)
// <- '❤'

You can pass in as many code points as you’d like.

String.fromCodePoint(128014, 128113, 10084)
// <- '🐎👱❤'

As an exercise in futility, we could map a string to their numeric representation of code points, and back to the code points themselves.

[...'\ud83d\udc0e\ud83d\udc71\u2764']
  .map(cp => cp.codePointAt(0))
  .map(cp => String.fromCodePoint(cp))
  .join('')
// <- '🐎👱❤'

Maybe you’re feeling like playing a joke on your fellow inmates – I mean, coworkers. You can now stab them to death with this piece of code that doesn’t really do anything other than converting the string into code points and then spreading those code points as parameters to String.fromCodePoint, which in turn restores the original string. As amusing as it is useless!

String.fromCodePoint(...[
  ...'\ud83d\udc0e\ud83d\udc71\u2764'
].map(cp => cp.codePointAt(0)))
// <- '🐎👱❤'

Since we’re on it, you may’ve noticed that reversing the string itself would cause issues.

'\ud83d\udc0e\ud83d\udc71\u2764'.split('').reverse().join('')

The problem is that you’re reversing individual code units as opposed to code points.

Reversing individual code units ends up breaking the code points
Reversing individual code units ends up breaking the code points

If we were to use the spread operator to split the string by its code points, and then reverse that, the code points would be preserved and the string would be properly reversed.

[...'\ud83d\udc0e\ud83d\udc71\u2764'].reverse().join('')
// <- '❤👱🐎'

This way we avoid breaking code points, but once again keep in mind that this won’t work for all grapheme clusters, as Mathias pointed out in his tweet.

[...'foo\u0305'].reverse().join('')
// <- ' ̅oof'

The last method we’ll cover today is .normalize.

String.prototype.normalize

There’s different ways to represent strings that look identical to humans, even though their code points differ. Mathias gives an example as follows.

'mañana' === 'mañana'
// <- false

What’s going on here? We have a combining tilde ̃ and an n on the right, while the left just has an ñ. These look alike, but if you look at the code points you’ll notice they’re different.

[...'mañana'].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', 'f1', '61', '6e', '61']
[...'mañana'].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', '6e', '303', '61', '6e', '61']

Just like with the 'foo̅' example, the second string has a length of 7, even though it is 6 glyphs long.

'mañana'.length
// <- 6
'mañana'.length
// <- 7

If we normalize the second version, we’ll get back the same code points we had in the first version.

var normalized = 'mañana'.normalize()
[...normalized].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', 'f1', '61', '6e', '61']
normalized.length
// <- 6

Just for completeness’ sake, note that you can represent these code points using the '\x6d' syntax.

'\\x' + [...'mañana'].map(cp => cp.codePointAt(0).toString(16)).join('\\x')
// <- '\\x6d\\x61\\xf1\\x61\\x6e\\x61'
'\x6d\x61\xf1\x61\x6e\x61'
// <- 'mañana'

We could use .normalize on both strings to see if they’re really equal.

function compare (left, right) {
  return left.normalize() === right.normalize()
}
compare('mañana', 'mañana')
// <- true

Or, to prove the point in something that’s a bit more visible to human eyes, let’s use the \x syntax. Note that you can only use \x to represent code units with codes below 256 (\xff is 255). For anything larger than that we should use the \u escape. Such is the case of the U+0303 combining tilde.

compare(
  '\x6d\x61\xf1\x61\x6e\x61',
  '\x6d\x61\x6e\u0303\x61\x6e\x61'
)
// <- true

See you tomorrow? Modules are coming.

Many thanks to Mathias for reviewing drafts of this article, ❤!

Liked the article? Subscribe below to get an email when new articles come out! Also, follow @ponyfoo on Twitter and @ponyfoo on Facebook.
One-click unsubscribe, anytime. Learn more.

Comments (6)

Tarjei Huse wrote

Hi, this was a great article!

What I’m wondering about is how Babel handles this. That is, if I’m iterating over a string, let’s say

var mystring =" something with utf8";      
for (let c of mystring) {
 // is this handled?
}

If so, is there an overhead wrt to using the for x of y syntax in other contexts?

Dmitry Semigradsky wrote

You can see it on https://babeljs.io/repl/