Counting repeated characters in a string in Python

import collections

d = collections.defaultdict(int)
for c in thestring:
    d[c] += 1

A collections.defaultdict is like a dict (subclasses it, actually), but when an entry is sought and not found, instead of reporting it doesn't have it, it makes it and inserts it by calling the supplied 0-argument callable. Most popular are defaultdict(int), for counting (or, equivalently, to make a multiset AKA bag data structure), and defaultdict(list), which does away forever with the need to use .setdefault(akey, []).append(avalue) and similar awkward idioms.

So once you've done this d is a dict-like container mapping every character to the number of times it appears, and you can emit it any way you like, of course. For example, most-popular character first:

for c in sorted(d, key=d.get, reverse=True):
  print '%s %6d' % (c, d[c])

My first idea was to do this:

chars = "abcdefghijklmnopqrstuvwxyz"
check_string = "i am checking this string to see how many times each character appears"

for char in chars:
  count = check_string.count(char)
  if count > 1:
    print char, count

This is not a good idea, however! This is going to scan the string 26 times, so you're going to potentially do 26 times more work than some of the other answers. You really should do this:

count = {}
for s in check_string:
  if s in count:
    count[s] += 1
  else:
    count[s] = 1

for key in count:
  if count[key] > 1:
    print key, count[key]

This ensures that you only go through the string once, instead of 26 times.

Also, Alex's answer is a great one - I was not familiar with the collections module. I'll be using that in the future. His answer is more concise than mine is and technically superior. I recommend using his code over mine.


Python 2.7+ includes the collections.Counter class:

import collections
results = collections.Counter(the_string)
print(results)

Grand Performance Comparison

Scroll to the end for a TL;DR graph

Since I had "nothing better to do" (understand: I had just a lot of work), I decided to do a little performance contest. I assembled the most sensible or interesting answers and did some simple timeit in CPython 3.5.1 on them. I tested them with only one string, which is a typical input in my case:

>>> s = 'ZDXMZKMXFDKXZFKZ'
>>> len(s)
16

Be aware that results might vary for different inputs, be it different length of the string or different number of distinct characters, or different average number of occurrences per character.


Don't reinvent the wheel

Python has made it simple for us. The collections.Counter class does exactly what we want and a lot more. Its usage is by far the simplest of all the methods mentioned here.

taken from @oefe, nice find

>>> timeit('Counter(s)', globals=locals())
8.208566107001388

Counter goes the extra mile, which is why it takes so long.

¿Dictionary, comprende?

Let's try using a simple dict instead. First, let's do it declaratively, using dict comprehension.

I came up with this myself...

>>> timeit('{c: s.count(c) for c in s}', globals=locals())
4.551155784000002

This will go through s from beginning to end, and for each character it will count the number of its occurrences in s. Since s contains duplicate characters, the above method searches s several times for the same character. The result is naturally always the same. So let's count the number of occurrences just once for each character.

I came up with this myself, and so did @IrshadBhat

>>> timeit('{c: s.count(c) for c in set(s)}', globals=locals())
3.1484066140001232

Better. But we still have to search through the string to count the occurrences. One search for each distinct character. That means we're going to read the string more than once. We can do better than that! But for that, we have to get off our declarativist high horse and descend into an imperative mindset.

Exceptional code

AKA Gotta catch 'em all!

inspired by @anthony

>>> timeit('''
... d = {}
... for c in s:
...   try:
...     d[c] += 1
...   except KeyError:
...     d[c] = 1
... ''', globals=locals())
3.7060273620008957

Well, it was worth a try. If you dig into the Python source (I can't say with certainty because I have never really done that), you will probably find that when you do except ExceptionType, Python has to check whether the exception raised is actually of ExceptionType or some other type. Just for the heck of it, let's see how long will it take if we omit that check and catch all exceptions.

made by @anthony

>>> timeit('''
... d = {}
... for c in s:
...   try:
...     d[c] += 1
...   except:
...     d[c] = 1
... ''', globals=locals())
3.3506563019982423

It does save some time, so one might be tempted to use this as some sort of optimization.
Don't do that! Or actually do. Do it now:

INTERLUDE 1

import time
while True:
  try:
    time.sleep(1)
  except:
    print("You're trapped in your own trap!")

You see? It catches KeyboardInterrupt, besides other things. In fact, it catches all the exceptions there are. Including ones you might not have even heard about, like SystemExit.

INTERLUDE 2

import sys
try:
  print("Goodbye. I'm going to die soon.")
  sys.exit()
except:
  print('BACK FROM THE DEAD!!!')

Now back to counting letters and numbers and other characters.

Playing catch-up

Exceptions aren't the way to go. You have to try hard to catch up with them, and when you finally do, they just throw up on you and then raise their eyebrows like it's your fault. Luckily brave fellows have paved our way so we can do away with exceptions, at least in this little exercise.

The dict class has a nice method – get – which allows us to retrieve an item from a dictionary, just like d[k]. Except when the key k is not in the dictionary, it can return a default value. Let's use that method instead of fiddling with exceptions.

credit goes to @Usman

>>> timeit('''
... d = {}
... for c in s:
...   d[c] = d.get(c, 0) + 1
... ''', globals=locals())
3.2133633289995487

Almost as fast as the set-based dict comprehension. On larger inputs, this one would probably be even faster.

Use the right tool for the job

For at least mildly knowledgeable Python programmer, the first thing that comes to mind is probably defaultdict. It does pretty much the same thing as the version above, except instead of a value, you give it a value factory. That might cause some overhead, because the value has to be "constructed" for each missing key individually. Let's see how it performs.

hope @AlexMartelli won't crucify me for from collections import defaultdict

>>> timeit('''
... dd = defaultdict(int)
... for c in s:
...   dd[c] += 1
... ''', globals=locals())
3.3430528169992613

Not that bad. I'd say the increase in execution time is a small tax to pay for the improved readability. However, we also favor performance, and we will not stop here. Let's take it further and prepopulate the dictionary with zeros. Then we won't have to check every time if the item is already there.

hats off to @sqram

>>> timeit('''
... d = dict.fromkeys(s, 0)
... for c in s:
...   d[c] += 1
... ''', globals=locals())
2.6081761489986093

That's good. Over three times as fast as Counter, yet still simple enough. Personally, this is my favorite in case you don't want to add new characters later. And even if you do, you can still do it. It's just less convenient than it would be in other versions:

d.update({ c: 0 for c in set(other_string) - d.keys() })


Practicality beats purity (except when it's not really practical)

Now a bit different kind of counter. @IdanK has come up with something interesting. Instead of using a hash table (a.k.a. dictionary a.k.a. dict), we can avoid the risk of hash collisions and consequent overhead of their resolution. We can also avoid the overhead of hashing the key, and the extra unoccupied table space. We can use a list. The ASCII values of characters will be indices and their counts will be values. As @IdanK has pointed out, this list gives us constant time access to a character's count. All we have to do is convert each character from str to int using the built-in function ord. That will give us an index into the list, which we will then use to increment the count of the character. So what we do is this: we initialize the list with zeros, do the job, and then convert the list into a dict. This dict will only contain those characters which have non-zero counts, in order to make it compliant with other versions.

As a side note, this technique is used in a linear-time sorting algorithm known as count sort or counting sort. It's very efficient, but the range of values being sorted is limited, since each value has to have its own counter. To sort a sequence of 32-bit integers, 4.3 billion counters would be needed.

>>> timeit('''
... counts = [0 for _ in range(256)]
... for c in s:
...   counts[ord(c)] += 1
... d = {chr(i): count for i,count in enumerate(counts) if count != 0}
... ''', globals=locals())
25.438595562001865

Ouch! Not cool! Let's try and see how long it takes when we omit building the dictionary.

>>> timeit('''
... counts = [0 for _ in range(256)]
... for c in s:
...   counts[ord(c)] += 1
... ''', globals=locals())
10.564866792999965

Still bad. But wait, what's [0 for _ in range(256)]? Can't we write it more simply? How about [0] * 256? That's cleaner. But will it perform better?

>>> timeit('''
... counts = [0] * 256
... for c in s:
...   counts[ord(c)] += 1
... ''', globals=locals())
3.290163638001104

Considerably. Now let's put the dictionary back in.

>>> timeit('''
... counts = [0] * 256
... for c in s:
...   counts[ord(c)] += 1
... d = {chr(i): count for i,count in enumerate(counts) if count != 0}
... ''', globals=locals())
18.000623562998953

Almost six times slower. Why does it take so long? Because when we enumerate(counts), we have to check every one of the 256 counts and see if it's zero. But we already know which counts are zero and which are not.

>>> timeit('''
... counts = [0] * 256
... for c in s:
...   counts[ord(c)] += 1
... d = {c: counts[ord(c)] for c in set(s)}
... ''', globals=locals())
5.826531438000529

It probably won't get much better than that, at least not for such a small input. Plus it's only usable for 8-bit EASCII characters. О блять!

And the winner is...

>>> timeit('''
... d = {}
... for c in s:
...   if c in d:
...     d[c] += 1
...   else:
...     d[c] = 1
... ''', globals=locals())
1.8509794599995075

Yep. Even if you have to check every time whether c is in d, for this input it's the fastest way. No pre-population of d will make it faster (again, for this input). It's a lot more verbose than Counter or defaultdict, but also more efficient.


That's all folks

This little exercise teaches us a lesson: when optimizing, always measure performance, ideally with your expected inputs. Optimize for the common case. Don't presume something is actually more efficient just because its asymptotic complexity is lower. And last but not least, keep readability in mind. Try to find a compromise between "computer-friendly" and "human-friendly".



UPDATE

I have been informed by @MartijnPieters of the function collections._count_elements available in Python 3.

Help on built-in function _count_elements in module _collections:

_count_elements(...)
    _count_elements(mapping, iterable) -> None

    Count elements in the iterable, updating the mappping

This function is implemented in C, so it should be faster, but this extra performance comes at a price. The price is incompatibility with Python 2 and possibly even future versions, since we're using a private function.

From the documentation:

[...] a name prefixed with an underscore (e.g. _spam) should be treated as a non-public part of the API (whether it is a function, a method or a data member). It should be considered an implementation detail and subject to change without notice.

That said, if you still want to save those 620 nanoseconds per iteration:

>>> timeit('''
... d = {}
... _count_elements(d, s)
... ''', globals=locals())
1.229239897998923



UPDATE 2: Large strings

I thought it might be a good idea to re-run the tests on some larger input, since a 16 character string is such a small input that all the possible solutions were quite comparably fast (1,000 iterations in under 30 milliseconds).

I decided to use the complete works of Shakespeare as a testing corpus, which turned out to be quite a challenge (since it's over 5MiB in size 😅). I just used the first 100,000 characters of it, and I had to limit the number of iterations from 1,000,000 to 1,000.

import urllib.request
url = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
s = urllib.request.urlopen(url).read(100_000)

collections.Counter was really slow on a small input, but the tables have turned

Counter(s)

=> 7.63926783799991

Naïve Θ(n2) time dictionary comprehension simply doesn't work

{c: s.count(c) for c in s}

=> 15347.603935000052s (tested on 10 iterations; adjusted for 1000)

Smart Θ(n) time dictionary comprehension works fine

{c: s.count(c) for c in set(s)}

=> 8.882608592999986

Exceptions are clumsy and slow

d = {}
for c in s:
  try:
    d[c] += 1
  except KeyError:
    d[c] = 1

=> 21.26615508399982

Omitting the exception type check doesn't save time (since the exception is only thrown a few times)

d = {}
for c in s:
  try:
    d[c] += 1
  except:
    d[c] = 1

=> 21.943328911999743

dict.get looks nice but runs slow

d = {}
for c in s:
  d[c] = d.get(c, 0) + 1

=> 28.530086210000007

collections.defaultdict isn't very fast either

dd = defaultdict(int)
for c in s:
  dd[c] += 1

=> 19.43012963199999

dict.fromkeys requires reading the (very long) string twice

d = dict.fromkeys(s, 0)
for c in s:
  d[c] += 1

=> 22.70960557699999

Using list instead of dict is neither nice nor fast

counts = [0 for _ in range(256)]
for c in s:
  counts[ord(c)] += 1

d = {chr(i): count for i,count in enumerate(counts) if count != 0}

=> 26.535474792000002

Leaving out the final conversion to dict doesn't help

counts = [0 for _ in range(256)]
for c in s:
  counts[ord(c)] += 1

=> 26.27811567400005

It doesn't matter how you construct the list, since it's not the bottleneck

counts = [0] * 256
for c in s:
  counts[ord(c)] += 1

=> 25.863524940000048


counts = [0] * 256
for c in s:
  counts[ord(c)] += 1

d = {chr(i): count for i,count in enumerate(counts) if count != 0}

=> 26.416733378000004

If you convert list to dict the "smart" way, it's even slower (since you iterate over the string twice)

counts = [0] * 256
for c in s:
  counts[ord(c)] += 1

d = {c: counts[ord(c)] for c in set(s)}

=> 29.492915620000076

The dict.__contains__ variant may be fast for small strings, but not so much for big ones

d = {}
for c in s:
  if c in d:
    d[c] += 1
  else:
    d[c] = 1

=> 23.773295123000025

collections._count_elements is about as fast as collections.Counter (which uses _count_elements internally)

d = {}
_count_elements(d, s)

=> 7.5814381919999505


Final verdict: Use collections.Counter unless you cannot or don't want to :)



Appendix: NumPy

The numpy package provides a method numpy.unique which accomplishes (almost) precisely what we want.

The way this method works is very different from all the above methods:

  • It first sorts a copy of the input using Quicksort, which is an O(n2) time operation in the worst case, albeit O(n log n) on average and O(n) in the best case.

  • Then it creates a "mask" array containing True at indices where a run of the same values begins, viz. at indices where the value differs from the previous value. Repeated values produce False in the mask. Example: [5,5,5,8,9,9] produces a mask [True, False, False, True, True, False].

  • This mask is then used to extract the unique values from the sorted input ‒ unique_chars in the code below. In our example, they would be [5, 8, 9].

  • Positions of the True values in the mask are taken into an array, and the length of the input is appended at the end of this array. For the above example, this array would be [0, 3, 4, 6].

  • For this array, differences between its elements are calculated, eg. [3, 1, 2]. These are the respective counts of the elements in the sorted array ‒ char_counts in the code below.

  • Finally, we create a dictionary by zipping unique_chars and char_counts: {5: 3, 8: 1, 9: 2}.


import numpy as np

def count_chars(s):
  # The following statement needs to be changed for different input types.
  # Our input `s` is actually of type `bytes`, so we use `np.frombuffer`.
  # For inputs of type `str`, change `np.frombuffer` to `np.fromstring`
  #  or transform the input into a `bytes` instance.
  arr = np.frombuffer(s, dtype=np.uint8)

  unique_chars, char_counts = np.unique(arr, return_counts=True)

  return dict(zip(unique_chars, char_counts))

For the test input (first 100,000 characters of the complete works of Shakespeare), this method performs better than any other tested here. But note that on a different input, this approach might yield worse performance than the other methods. Pre-sortedness of the input and number of repetitions per element are important factors affecting the performance.

count_chars(s)

=> 2.960809530000006


If you are thinking about using this method because it's over twice as fast as collections.Counter, consider this:

  • collections.Counter has linear time complexity. numpy.unique is linear at best, quadratic at worst.

  • The speedup is not really that significant ‒ you save ~3.5 milliseconds per iteration on an input of length 100,000.

  • Using numpy.unique obviously requires numpy.

That considered, it seems reasonable to use Counter unless you need to be really fast. And in that case, you better know what you're doing or else you'll end up being slower with numpy than without it.



Appendix 2: A somewhat useful plot

I ran the 13 different methods above on prefixes of the complete works of Shakespeare and made an interactive plot. Note that in the plot, both prefixes and durations are displayed in logarithmic scale (the used prefixes are of exponentially increasing length). Click on the items in the legend to show/hide them in the plot.

Interactive plot (click!)

Click to open!


This is the shortest, most practical I can comeup with without importing extra modules.

text = "hello cruel world. This is a sample text"
d = dict.fromkeys(text, 0)
for c in text: d[c] += 1

print d['a'] would output 2

And it's also fast.