Like I did in previous articles on the series, I would love to point out that you should probably set up Babel and follow along the examples with either a REPL or the
babel-node
CLI and a file. That’ll make it so much easier for you to internalize the concepts discussed in the series. If you aren’t the “install things on my computer” kind of human, you might prefer to hop on CodePen and then click on the gear icon for JavaScript – they have a Babel preprocessor which makes trying out ES6 a breeze. Another alternative that’s also quite useful is to use Babel’s online REPL – it’ll show you compiled ES5 code to the right of your ES6 code for quick comparison.
Before getting into it, let me shamelessly ask for your support if you’re enjoying my ES6 in Depth series. Your contributions will go towards helping me keep up with the schedule, server bills, keeping me fed, and maintaining Pony Foo as a veritable source of JavaScript goodies.
Thanks for reading that, and let’s go into updates to the String
object.
Updates to String
We’ve already covered template literals earlier in the series, and you may recall that those can be used to mix strings and variables to produce string output.
function greet (name) {
return `hello ${name}!`
}
greet('ponyfoo')
// <- 'hello ponyfoo!'
Besides template literals, strings are getting a numbre of new methods come ES6. These can be categorized as string manipulation methods and unicode related methods.
- String Manipulation
- Unicode
We’ll begin with the string manipulation methods and then we’ll take a look at the unicode related ones.
String.prototype.startsWith
A very common question in our code is “does this string start with this other string?”. In ES5 we’d ask that question using the .indexOf
method.
'ponyfoo'.indexOf('foo')
// <- 4
'ponyfoo'.indexOf('pony')
// <- 0
'ponyfoo'.indexOf('horse')
// <- -1
If you wanted to check if a string started with another one, you’d compare them with .indexOf
and check whether the “needle” starts at the 0
position – the beginning of the string.
'ponyfoo'.indexOf('pony') === 0
// <- true
'ponyfoo'.indexOf('foo') === 0
// <- false
'ponyfoo'.indexOf('horse') === 0
// <- false
You can now use the more descriptive and terse .startsWith
method instead.
'ponyfoo'.startsWith('pony')
// <- true
'ponyfoo'.startsWith('foo')
// <- false
'ponyfoo'.startsWith('horse')
// <- false
If you wanted to figure out whether a string contains another one starting in a specific location, it would get quite verbose, as you’d need to grab a slice of that string first.
'ponyfoo'.slice(4).indexOf('foo') === 0
// <- true
The reason why you can’t just ask === 4
is that this would give you false negatives when the query is found before reaching that index.
'foo,foo'.indexOf('foo') === 4
// <- false, because result was 0
Of course, you could use the startIndex
parameter for indexOf
to get around that. Note that we’re still comparing against 4
in this case, because the string wasn’t split into smaller parts.
'foo,foo'.indexOf('foo', 4) === 4
// <- true
Instead of keeping all of these string searching implementation details in your head and writing code that worries too much about the how and not so much about the what, you could just use startsWith
passing in the optional startIndex
parameter as well.
'foo,foo'.startsWith('foo', 4)
// <- true
Then again, it’s kind of confusing that the method is called .startsWith
but we’re starting at a non-zero index – that being said it sure beats using .indexOf
when we actually want a boolean result.
String.prototype.endsWith
This method mirrors .startsWith
in the same way that .lastIndexOf
mirrors .indexOf
. It tells us whether a string ends with another string.
'ponyfoo'.endsWith('foo')
// <- true
'ponyfoo'.endsWith('pony')
// <- false
Just like .startsWith
, we have a position index that indicates where the lookup should end. It defaults to the length of the string.
'ponyfoo'.endsWith('foo', 7)
// <- true
'ponyfoo'.endsWith('pony', 0)
// <- false
'ponyfoo'.endsWith('pony', 4)
// <- true
Yet another method that simplifies a specific use case for .indexOf
is .includes
.
String.prototype.includes
You can use .includes
to figure out whether a string contains another one.
'ponyfoo'.includes('ny')
// <- true
'ponyfoo'.includes('sf')
// <- false
This is equivalent to the ES5 use case of .indexOf
where we’d compare its results with -1
to see if the search string was anywhere to be found.
'ponyfoo'.indexOf('ny') !== -1
// <- true
'ponyfoo'.indexOf('sf') !== -1
// <- false
Naturally you can also pass in a start index where the search should begin.
'ponyfoo'.includes('ny', 3)
// <- false
'ponyfoo'.includes('ny', 2)
// <- true
Let’s move onto something that’s not an .indexOf
replacement.
String.prototype.repeat
This handy method allows you to repeat a string count
times.
'na'.repeat(0)
// <- ''
'na'.repeat(1)
// <- 'na'
'na'.repeat(2)
// <- 'nana'
'na'.repeat(5)
// <- 'nanananana'
The provided count
should be a positive finite number.
'na'.repeat(Infinity)
// <- RangeError
'na'.repeat(-1)
// <- RangeError
Non-numeric values are coerced into numbers.
'na'.repeat('na')
// <- ''
'na'.repeat('3')
// <- 'nanana'
Using NaN
is as good as 0
.
'na'.repeat(NaN)
// <- ''
Decimal values are floored.
'na'.repeat(3.9)
// <- 'nanana', count was floored to 3
Values in the (-1, 0)
range are rounded to -0
becase count
is passed through ToInteger
, as documented by the specification. That step in the specification dictates that count
be casted with a formula like the one below.
function ToInteger (number) {
return Math.floor(Math.abs(number)) * Math.sign(number)
}
The above translates to -0
for any values in the (-1, 0)
range. Numbers below that will throw, and numbers above that won’t behave surprisingly, as you can only take Math.floor
into account for positive values.
'na'.repeat(-0.1)
// <- '', count was rounded to -0
A good example use case for .repeat
may be your typical “padding” method. The method shown below takes a multiline string and pads every line with as many spaces
as desired.
function pad (text, spaces) {
return text.split('\n').map(line => ' '.repeat(spaces) + line).join('\n')
}
pad('a\nb\nc', 2)
// <- ' a\n b\n c'
In ES6, strings adhere to the iterable protocol.
String.prototype[Symbol.iterator]
Before ES6, you could access each code unit (we’ll define these in a second) in a string via indices – kind of like with arrays. That meant you could loop over code units in a string with a for
loop.
var text = 'foo'
for (let i = 0; i < text.length; i++) {
console.log(text[i])
// <- 'f'
// <- 'o'
// <- 'o'
}
In ES6, you could loop over the code points (not the same as code units) of a string using a for..of
loop, because strings are iterable.
for (let codePoint of 'foo') {
console.log(codePoint)
// <- 'f'
// <- 'o'
// <- 'o'
}
What is this codePoint
variable? There is a not-so-subtle distinction between code units and code points. Let’s switch protocols and talk about Unicode.
Unicode
JavaScript strings are represented using UTF-16 code units. Each code unit can be used to represent a code point in the [U+0000, U+FFFF]
range – also known as the “basic multilingual plane” (BMP). You can represent individual code points in the BMP plane using the '\u3456'
syntax. You could also represent code units in the [U+0000, U+0255]
using the \x00..\xff
notation. For instance, '\xbb'
represents '»'
, the 187
character, as you can verify by doing parseInt('bb', 16)
– or String.fromCharCode(187)
.
For code points beyond U+FFFF
, you’d represent them as a surrogate pair. That is to say, two contiguous code units. For instance, the horse emoji ''
code point is represented with the '\ud83d\udc0e'
contiguous code units. In ES6 notation you can also represent code points using the '\u{1f40e}'
notation (that example is also the horse emoji). Note that the internal representation hasn’t changed, so there’s still two code units behind that code point. In fact, '\u{1f40e}'.length
evaluates to 2
.
The '\ud83d\udc0e\ud83d\udc71\u2764'
string found below evaluates to a few emoji.
'\ud83d\udc0e\ud83d\udc71\u2764'
// <- ''
While that string consists of 5 code units, we know that the length should really be three – as there’s only three emoji.
'\ud83d\udc0e\ud83d\udc71\u2764'.length
// <- 5
''.length
// <- 5, still
Before ES6, JavaScript didn’t make any effort to figure out unicode quirks on your behalf – you were pretty much on your own when it came to counting cards (err, code points). Take for instance Object.keys
, still five code units long.
console.log(Object.keys(''))
// <- ['0', '1', '2', '3', '4']
If we now go back to our for
loop, we can observe how this is a problem. We actually wanted ''
, ''
, ''
, but we didn’t get that.
var text = ''
for (let i = 0; i < text.length; i++) {
console.log(text[i])
// <- '?'
// <- '?'
// <- '?'
// <- '?'
// <- ''
}
Instead, we got some weird unicode boxes – and that’s if we were lucky and looking at Firefox.
That didn’t turn out okay. In ES6 we can use the string iterator to go over the code points instead. The iterators produced by the string iterable are aware of this limitation of looping by code units, and so they yield code points instead.
for (let codePoint of '') {
console.log(codePoint)
// <- ''
// <- ''
// <- ''
}
If we want to measure the length, we’d have trouble with the .length
property, as we saw earlier. We can use the iterator to split the string into its code points – as seen in the for..of
example we just went over. That means the unicode-aware length of a string equals the length of the array that contains the sequence of code points produced by an iterator. We could use the spread operator to place the code points in an array, and then pull that array’s .length
.
[...''].length
// <- 3
Keep in mind that splitting strings into code points isn’t enough if you want to be 100% precise about string length. Take for instance the “combining overline” \u0305
unicode code unit. On its own, this code unit is just an overline.
'\u0305'
// <- ' ̅'
When preceded by another code unit, however, they are combined together into a single glyph.
'_\u0305'
// <- '_̅'
'foo\u0305'
// <- 'foo̅'
Attempts to figure out the actual length by counting code points prove insufficient – just like using .length
.
'foo\u0305'.length
// <- 4
'foo̅'.length
// <- 4
[...'foo̅'].length
// <- 4
I was confused about this one as I’m no expert when it comes to unicode. So I went to someone who is an expert – Mathias Bynens. He promptly pointed out that – indeed – splitting by code points isn’t enough. Unlike surrogate pairs like the emojis we’ve used in our earlier examples, other grapheme clusters aren’t taken into account by the string iterator.
In these cases we’re out of luck, and we simply have to fall back to regular expressions to correctly calculate the string length. For a comprehensive discussion of the subject I suggest you read his excellent “JavaScript has a Unicode problem” piece.
Let’s look at the other methods.
String.prototype.codePointAt
You can use .codePointAt
to get the base-10 numeric representation of a code point at a given position in a string. Note that the position is indexed by code unit, not by code point. In the example below we print the code points for each of the three emoji in our demo ''
string.
'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(0)
// <- 128014
'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(2)
// <- 128113
'\ud83d\udc0e\ud83d\udc71\u2764'.codePointAt(4)
// <- 10084
Figuring out the indices on your own may prove cumbersome, which is why you should just loop through the string iterator so that figures them out on your behalf. You can then just call .codePointAt(0)
for each code point in the sequence.
for (let codePoint of '\ud83d\udc0e\ud83d\udc71\u2764') {
console.log(codePoint.codePointAt(0))
// <- 128014
// <- 128113
// <- 10084
}
Or maybe just use a combination of spread and .map
.
[...'\ud83d\udc0e\ud83d\udc71\u2764'].map(cp => cp.codePointAt(0))
// <- [128014, 128113, 10084]
You could then take the hexadecimal (base-16) representation of those base-10 integers and render them on a string using the new unicode code point escape syntax of \u{codePoint}
. This syntax allows you to represent unicode code points that are beyond the “basic multilingual plane” (BMP) – i.e, code points outside the [U+0000, U+FFFF]
range that are typically represented using the \u1234
syntax.
Let’s start by updating our example to print the hexadecimal version of our code points.
for (let codePoint of '\ud83d\udc0e\ud83d\udc71\u2764') {
console.log(codePoint.codePointAt(0).toString(16))
// <- '1f40e'
// <- '1f471'
// <- '2764'
}
You can wrap those in '\u{codePoint}'
and voilá – you’ll get the emoji out of the string once again.
'\u{1f40e}'
// <- ''
'\u{1f471}'
// <- ''
'\u{2764}'
// <- ''
Yay!
String.fromCodePoint
This method takes in a number and returns a code point. Note how I can use the 0x
prefix with the terse base-16 code points we got from .codePointAt
moments ago.
String.fromCodePoint(0x1f40e)
// <- ''
String.fromCodePoint(0x1f471)
// <- ''
String.fromCodePoint(0x2764)
// <- ''
Obviously, you can just as well use their base-10 counterparts to achieve the same results.
String.fromCodePoint(128014)
// <- ''
String.fromCodePoint(128113)
// <- ''
String.fromCodePoint(10084)
// <- ''
You can pass in as many code points as you’d like.
String.fromCodePoint(128014, 128113, 10084)
// <- ''
As an exercise in futility, we could map a string to their numeric representation of code points, and back to the code points themselves.
[...'\ud83d\udc0e\ud83d\udc71\u2764']
.map(cp => cp.codePointAt(0))
.map(cp => String.fromCodePoint(cp))
.join('')
// <- ''
Maybe you’re feeling like playing a joke on your fellow inmates – I mean, coworkers. You can now stab them to death with this piece of code that doesn’t really do anything other than converting the string into code points and then spreading those code points as parameters to String.fromCodePoint
, which in turn restores the original string. As amusing as it is useless!
String.fromCodePoint(...[
...'\ud83d\udc0e\ud83d\udc71\u2764'
].map(cp => cp.codePointAt(0)))
// <- ''
Since we’re on it, you may’ve noticed that reversing the string itself would cause issues.
'\ud83d\udc0e\ud83d\udc71\u2764'.split('').reverse().join('')
The problem is that you’re reversing individual code units as opposed to code points.
If we were to use the spread operator to split the string by its code points, and then reverse that, the code points would be preserved and the string would be properly reversed.
[...'\ud83d\udc0e\ud83d\udc71\u2764'].reverse().join('')
// <- ''
This way we avoid breaking code points, but once again keep in mind that this won’t work for all grapheme clusters, as Mathias pointed out in his tweet.
[...'foo\u0305'].reverse().join('')
// <- ' ̅oof'
The last method we’ll cover today is .normalize
.
String.prototype.normalize
There’s different ways to represent strings that look identical to humans, even though their code points differ. Mathias gives an example as follows.
'mañana' === 'mañana'
// <- false
What’s going on here? We have a combining tilde ̃
and an n
on the right, while the left just has an ñ
. These look alike, but if you look at the code points you’ll notice they’re different.
[...'mañana'].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', 'f1', '61', '6e', '61']
[...'mañana'].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', '6e', '303', '61', '6e', '61']
Just like with the 'foo̅'
example, the second string has a length of 7
, even though it is 6
glyphs long.
'mañana'.length
// <- 6
'mañana'.length
// <- 7
If we normalize the second version, we’ll get back the same code points we had in the first version.
var normalized = 'mañana'.normalize()
[...normalized].map(cp => cp.codePointAt(0).toString(16))
// <- ['6d', '61', 'f1', '61', '6e', '61']
normalized.length
// <- 6
Just for completeness’ sake, note that you can represent these code points using the '\x6d'
syntax.
'\\x' + [...'mañana'].map(cp => cp.codePointAt(0).toString(16)).join('\\x')
// <- '\\x6d\\x61\\xf1\\x61\\x6e\\x61'
'\x6d\x61\xf1\x61\x6e\x61'
// <- 'mañana'
We could use .normalize
on both strings to see if they’re really equal.
function compare (left, right) {
return left.normalize() === right.normalize()
}
compare('mañana', 'mañana')
// <- true
Or, to prove the point in something that’s a bit more visible to human eyes, let’s use the \x
syntax. Note that you can only use \x
to represent code units with codes below 256 (\xff
is 255
). For anything larger than that we should use the \u
escape. Such is the case of the U+0303
combining tilde.
compare(
'\x6d\x61\xf1\x61\x6e\x61',
'\x6d\x61\x6e\u0303\x61\x6e\x61'
)
// <- true
See you tomorrow? Modules are coming.
Many thanks to Mathias for reviewing drafts of this article, !
Comments