"wc -c" and "wc -m" command in linux

I have a text file, its content is:

i k k

When I use wc -m to count character numbers on this file, the result is 7.

Question 1: But why did I get 7, shouldn't I get "6" supposing that it counts the "end-of-line" character?

Question 2: How exactly does wc -m work?

Question 3: When I use wc -c (to count byte numbers), I have the same result as wc -m, so what is the point of having two different options? They do exactly the same job, don't they? If not, what's the difference and how wc -c works?

Solution 1:

You should indeed have only 6 characters there. Try running

cat -A filename

To see the non-printing characters of your file. You must have something extra. If I make a file just like yours, I see

i k k$

Did you put a space? That would make 7: i k k $ or maybe it has a newline:

i k k$
$

which is also 7

As you say

wc -m

counts characters and

wc -c

counts bytes. If all your characters are part of the ASCII character set, then there will be only 1 byte per character so you will get the same count from both commands.

Try on a file with non ASCII chars:

$ echo ك > testfile
$ wc -m testfile
2 testfile
$ wc -c testfile
3 testfile

Aha! More bytes than characters now.

Solution 2:

$ locale charmap
UTF-8

In my current environment, the character set is UTF-8, that is, characters are encoded with 1 to 4 bytes per character (though because the original definition of UTF-8 allowed character code points up to 0x7fffffff, most tools would recognise UTF-8 byte sequences of up to 6 bytes).

In that character set, all the characters from Unicode are available, a a is coded as byte value 65, a 乕 as the 3 bytes 228 185 149 and é as the two byte sequence 195 169 for instance.

$ printf 乕 | wc -mc
  1       3
$ printf a | wc -mc
  1       1

Now:

$ export fr_FR.iso885915@euro
$ locale charmap
ISO-8859-15

I've modified my environment, where the character set is now ISO-8859-15 (other things like language, currency symbol, date format have also been modified, the collection of those regional settings being referred to as the locale). I need to start a new terminal emulator in that environment for it to adapt its character rendering to the new locale.

ISO-8859-15 is a single byte character set which means it only has 256 characters (actually even fewer than that are actually covered). That particular character set is used for languages of Western Europe as it covers most of its languages (and the euro symbol).

It has the a character with byte value 65 like in UTF-8 or ASCII, it also has the é character (as commonly used in French or Spanish for instance) but with byte value 233, it doesn't have the 乕 character.

In that environment, wc -c and wc -m will always give the same result.

In Ubuntu like on most modern Unix-like systems, the default is usually UTF-8 as it's the only supported character set (and encoding) that covers the whole Unicode range.

Other multi-byte character encodings exist, but they're not as well supported on Ubuntu and you have to go through hoops to be able to generate a locale with those, and if you do, you'll find that many things don't work properly.

So in effect on Ubuntu, character sets are either single-byte, or UTF-8.

Now, a few more notes:

In UTF-8, not all byte sequences form valid characters. For instance, all UTF-8 characters which are not ASCII ones are formed with bytes that all have the 8th bit set, but where only the first one has the 7th bit set.

If you have a sequence of bytes with the 8th bit set, none of which has the 7th bit set, then that can't be translated to a character. And that's when you're starting to have problems and inconsistencies as software don't know what to do with those. For instance:

$ printf '\200\200\200' | wc -mc
      0       3
$ printf '\200\200\200' | grep -q . || echo no
no

wc and grep find no character in there but:

$ x=$'\200\200\200' bash -c 'echo "${#x}"'
3

bash finds 3. When it can't map a sequence of bytes to a character, it considers each byte a character.

It can get even more complicated as there are codepoints in Unicode that are invalid as characters, and some that are non-characters, and depending on the tool, their UTF-8 encoding may or may not be considered as a character.

Another thing to take into consideration is the difference between character and graphem, and how they are rendered.

$ printf 'e\u301\u20dd\n'
é⃝
$ printf 'e\u301\u20dd' | wc -mc
      3       6

There, we've code 3 characters as 6 bytes rendered as one graphem, because we've got 3 characters combined together (one base character, a combining acute accent, and a combining enclosing circle).

The GNU implementation of wc as found on Ubuntu has a -L switch to tell you the display width of the widest line in the input:

$ printf 'e\u301\u20dd\n' | wc -L
1

You'll also find that some characters occupy 2 cells in that width calculation like our 乕 character from above:

$ echo 乕 | wc -L
2

In conclusion: in the wilder word, byte, character and graphem are not necessarily the same.

Solution 3:

The difference between wc -c and wc -m is that in a locale with multibyte characters (say, UTF8), the former counts bytes, while the latter counts characters. Consider the following file:

$ hexdump -C dummy.txt 
00000000  78 79 cf 80 0a                                    |xy...|

(for those who don't speak UTF8, that's the letters 'x', 'y', and 'π', followed by a newline). It is five bytes long:

$ wc -c dummy.txt 
5 dummy.txt

but only four characters long:

$ wc -m dummy.txt 
4 dummy.txt

"wc -c" and "wc -m" command in linux

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts