String length in bytes in JavaScript
Solution 1:
Years passed and nowadays you can do it natively
(new TextEncoder().encode('foo')).length
Note that it's not supported by IE (you may use a polyfill for that).
MDN documentation
Standard specifications
Solution 2:
There is no way to do it in JavaScript natively. (See Riccardo Galli's answer for a modern approach.)
For historical reference or where TextEncoder APIs are still unavailable.
If you know the character encoding, you can calculate it yourself though.
encodeURIComponent
assumes UTF-8 as the character encoding, so if you need that encoding, you can do,
function lengthInUtf8Bytes(str) {
// Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
var m = encodeURIComponent(str).match(/%[89ABab]/g);
return str.length + (m ? m.length : 0);
}
This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.
The table in wikipedia makes it clearer
Bits Last code point Byte 1 Byte 2 Byte 3
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
...
If instead you need to understand the page encoding, you can use this trick:
function lengthInPageEncoding(s) {
var a = document.createElement('A');
a.href = '#' + s;
var sEncoded = a.href;
sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
var m = sEncoded.match(/%[0-9a-f]{2}/g);
return sEncoded.length - (m ? m.length * 2 : 0);
}
Solution 3:
Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent():
function byteLength(str) {
// returns the byte length of an utf8 string
var s = str.length;
for (var i=str.length-1; i>=0; i--) {
var code = str.charCodeAt(i);
if (code > 0x7f && code <= 0x7ff) s++;
else if (code > 0x7ff && code <= 0xffff) s+=2;
if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
}
return s;
}
Here is a performance comparison.
It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt() (based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters).
It follows RFC3629 (where UTF-8 characters are at most 4-bytes long).
Solution 4:
For simple UTF-8 encoding, with slightly better compatibility than TextEncoder
, Blob does the trick. Won't work in very old browsers though.
new Blob(["😀"]).size; // -> 4
Solution 5:
This function will return the byte size of any UTF-8 string you pass to it.
function byteCount(s) {
return encodeURI(s).split(/%..|./).length - 1;
}
Source