Counting repeated characters in a string in Python
import collections
d = collections.defaultdict(int)
for c in thestring:
d[c] += 1
A collections.defaultdict
is like a dict
(subclasses it, actually), but when an entry is sought and not found, instead of reporting it doesn't have it, it makes it and inserts it by calling the supplied 0-argument callable. Most popular are defaultdict(int)
, for counting (or, equivalently, to make a multiset AKA bag data structure), and defaultdict(list)
, which does away forever with the need to use .setdefault(akey, []).append(avalue)
and similar awkward idioms.
So once you've done this d
is a dict-like container mapping every character to the number of times it appears, and you can emit it any way you like, of course. For example, most-popular character first:
for c in sorted(d, key=d.get, reverse=True):
print '%s %6d' % (c, d[c])
My first idea was to do this:
chars = "abcdefghijklmnopqrstuvwxyz"
check_string = "i am checking this string to see how many times each character appears"
for char in chars:
count = check_string.count(char)
if count > 1:
print char, count
This is not a good idea, however! This is going to scan the string 26 times, so you're going to potentially do 26 times more work than some of the other answers. You really should do this:
count = {}
for s in check_string:
if s in count:
count[s] += 1
else:
count[s] = 1
for key in count:
if count[key] > 1:
print key, count[key]
This ensures that you only go through the string once, instead of 26 times.
Also, Alex's answer is a great one - I was not familiar with the collections module. I'll be using that in the future. His answer is more concise than mine is and technically superior. I recommend using his code over mine.
Python 2.7+ includes the collections.Counter class:
import collections
results = collections.Counter(the_string)
print(results)
Grand Performance Comparison
Scroll to the end for a TL;DR graph
Since I had "nothing better to do" (understand: I had just a lot of work), I decided to do
a little performance contest. I assembled the most sensible or interesting answers and did
some simple timeit
in CPython 3.5.1 on them. I tested them with only one string, which
is a typical input in my case:
>>> s = 'ZDXMZKMXFDKXZFKZ'
>>> len(s)
16
Be aware that results might vary for different inputs, be it different length of the string or different number of distinct characters, or different average number of occurrences per character.
Don't reinvent the wheel
Python has made it simple for us. The collections.Counter
class does exactly what we want
and a lot more. Its usage is by far the simplest of all the methods mentioned here.
taken from @oefe, nice find
>>> timeit('Counter(s)', globals=locals())
8.208566107001388
Counter
goes the extra mile, which is why it takes so long.
¿Dictionary, comprende?
Let's try using a simple dict
instead. First, let's do it declaratively, using dict
comprehension.
I came up with this myself...
>>> timeit('{c: s.count(c) for c in s}', globals=locals())
4.551155784000002
This will go through s
from beginning to end, and for each character it will count the number
of its occurrences in s
. Since s
contains duplicate characters, the above method searches
s
several times for the same character. The result is naturally always the same. So let's count
the number of occurrences just once for each character.
I came up with this myself, and so did @IrshadBhat
>>> timeit('{c: s.count(c) for c in set(s)}', globals=locals())
3.1484066140001232
Better. But we still have to search through the string to count the occurrences. One search for each distinct character. That means we're going to read the string more than once. We can do better than that! But for that, we have to get off our declarativist high horse and descend into an imperative mindset.
Exceptional code
AKA Gotta catch 'em all!
inspired by @anthony
>>> timeit('''
... d = {}
... for c in s:
... try:
... d[c] += 1
... except KeyError:
... d[c] = 1
... ''', globals=locals())
3.7060273620008957
Well, it was worth a try. If you dig into the Python source (I can't say with certainty because
I have never really done that), you will probably find that when you do except ExceptionType
,
Python has to check whether the exception raised is actually of ExceptionType
or some other
type. Just for the heck of it, let's see how long will it take if we omit that check and catch
all exceptions.
made by @anthony
>>> timeit('''
... d = {}
... for c in s:
... try:
... d[c] += 1
... except:
... d[c] = 1
... ''', globals=locals())
3.3506563019982423
It does save some time, so one might be tempted to use this as some sort of optimization.
Don't do that! Or actually do. Do it now:
INTERLUDE 1
import time
while True:
try:
time.sleep(1)
except:
print("You're trapped in your own trap!")
You see? It catches KeyboardInterrupt
, besides other things. In fact, it catches all the
exceptions there are. Including ones you might not have even heard about, like SystemExit
.
INTERLUDE 2
import sys
try:
print("Goodbye. I'm going to die soon.")
sys.exit()
except:
print('BACK FROM THE DEAD!!!')
Now back to counting letters and numbers and other characters.
Playing catch-up
Exceptions aren't the way to go. You have to try hard to catch up with them, and when you finally do, they just throw up on you and then raise their eyebrows like it's your fault. Luckily brave fellows have paved our way so we can do away with exceptions, at least in this little exercise.
The dict
class has a nice method – get
– which allows us to retrieve an item from a
dictionary, just like d[k]
. Except when the key k
is not in the dictionary, it can return
a default value. Let's use that method instead of fiddling with exceptions.
credit goes to @Usman
>>> timeit('''
... d = {}
... for c in s:
... d[c] = d.get(c, 0) + 1
... ''', globals=locals())
3.2133633289995487
Almost as fast as the set-based dict comprehension. On larger inputs, this one would probably be even faster.
Use the right tool for the job
For at least mildly knowledgeable Python programmer, the first thing that comes to mind is
probably defaultdict
. It does pretty much the same thing as the version above, except instead
of a value, you give it a value factory. That might cause some overhead, because the value has
to be "constructed" for each missing key individually. Let's see how it performs.
hope @AlexMartelli won't crucify me for from collections import defaultdict
>>> timeit('''
... dd = defaultdict(int)
... for c in s:
... dd[c] += 1
... ''', globals=locals())
3.3430528169992613
Not that bad. I'd say the increase in execution time is a small tax to pay for the improved readability. However, we also favor performance, and we will not stop here. Let's take it further and prepopulate the dictionary with zeros. Then we won't have to check every time if the item is already there.
hats off to @sqram
>>> timeit('''
... d = dict.fromkeys(s, 0)
... for c in s:
... d[c] += 1
... ''', globals=locals())
2.6081761489986093
That's good. Over three times as fast as Counter
, yet still simple enough. Personally, this is
my favorite in case you don't want to add new characters later. And even if you do, you can
still do it. It's just less convenient than it would be in other versions:
d.update({ c: 0 for c in set(other_string) - d.keys() })
Practicality beats purity (except when it's not really practical)
Now a bit different kind of counter. @IdanK has come up with something interesting. Instead
of using a hash table (a.k.a. dictionary a.k.a. dict
), we can avoid the risk of hash collisions
and consequent overhead of their resolution. We can also avoid the overhead of hashing the key,
and the extra unoccupied table space. We can use a list
. The ASCII values of characters will be
indices and their counts will be values. As @IdanK has pointed out, this list gives us constant
time access to a character's count. All we have to do is convert each character from str
to
int
using the built-in function ord
. That will give us an index into the list, which we will
then use to increment the count of the character. So what we do is this: we initialize the list
with zeros, do the job, and then convert the list into a dict
. This dict
will only contain
those characters which have non-zero counts, in order to make it compliant with other versions.
As a side note, this technique is used in a linear-time sorting algorithm known as count sort or counting sort. It's very efficient, but the range of values being sorted is limited, since each value has to have its own counter. To sort a sequence of 32-bit integers, 4.3 billion counters would be needed.
>>> timeit('''
... counts = [0 for _ in range(256)]
... for c in s:
... counts[ord(c)] += 1
... d = {chr(i): count for i,count in enumerate(counts) if count != 0}
... ''', globals=locals())
25.438595562001865
Ouch! Not cool! Let's try and see how long it takes when we omit building the dictionary.
>>> timeit('''
... counts = [0 for _ in range(256)]
... for c in s:
... counts[ord(c)] += 1
... ''', globals=locals())
10.564866792999965
Still bad. But wait, what's [0 for _ in range(256)]
? Can't we write it more simply? How about
[0] * 256
? That's cleaner. But will it perform better?
>>> timeit('''
... counts = [0] * 256
... for c in s:
... counts[ord(c)] += 1
... ''', globals=locals())
3.290163638001104
Considerably. Now let's put the dictionary back in.
>>> timeit('''
... counts = [0] * 256
... for c in s:
... counts[ord(c)] += 1
... d = {chr(i): count for i,count in enumerate(counts) if count != 0}
... ''', globals=locals())
18.000623562998953
Almost six times slower. Why does it take so long? Because when we enumerate(counts)
, we have
to check every one of the 256 counts and see if it's zero. But we already know which counts are
zero and which are not.
>>> timeit('''
... counts = [0] * 256
... for c in s:
... counts[ord(c)] += 1
... d = {c: counts[ord(c)] for c in set(s)}
... ''', globals=locals())
5.826531438000529
It probably won't get much better than that, at least not for such a small input. Plus it's only usable for 8-bit EASCII characters. О блять!
And the winner is...
>>> timeit('''
... d = {}
... for c in s:
... if c in d:
... d[c] += 1
... else:
... d[c] = 1
... ''', globals=locals())
1.8509794599995075
Yep. Even if you have to check every time whether c
is in d
, for this input it's the fastest
way. No pre-population of d
will make it faster (again, for this input). It's a lot more
verbose than Counter
or defaultdict
, but also more efficient.
That's all folks
This little exercise teaches us a lesson: when optimizing, always measure performance, ideally with your expected inputs. Optimize for the common case. Don't presume something is actually more efficient just because its asymptotic complexity is lower. And last but not least, keep readability in mind. Try to find a compromise between "computer-friendly" and "human-friendly".
UPDATE
I have been informed by @MartijnPieters of the function collections._count_elements
available in Python 3.
Help on built-in function _count_elements in module _collections: _count_elements(...) _count_elements(mapping, iterable) -> None Count elements in the iterable, updating the mappping
This function is implemented in C, so it should be faster, but this extra performance comes at a price. The price is incompatibility with Python 2 and possibly even future versions, since we're using a private function.
From the documentation:
[...] a name prefixed with an underscore (e.g.
_spam
) should be treated as a non-public part of the API (whether it is a function, a method or a data member). It should be considered an implementation detail and subject to change without notice.
That said, if you still want to save those 620 nanoseconds per iteration:
>>> timeit('''
... d = {}
... _count_elements(d, s)
... ''', globals=locals())
1.229239897998923
UPDATE 2: Large strings
I thought it might be a good idea to re-run the tests on some larger input, since a 16 character string is such a small input that all the possible solutions were quite comparably fast (1,000 iterations in under 30 milliseconds).
I decided to use the complete works of Shakespeare as a testing corpus, which turned out to be quite a challenge (since it's over 5MiB in size 😅). I just used the first 100,000 characters of it, and I had to limit the number of iterations from 1,000,000 to 1,000.
import urllib.request
url = 'https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt'
s = urllib.request.urlopen(url).read(100_000)
collections.Counter
was really slow on a small input, but the tables have turned
Counter(s)
=> 7.63926783799991
Naïve Θ(n2) time dictionary comprehension simply doesn't work
{c: s.count(c) for c in s}
=> 15347.603935000052s (tested on 10 iterations; adjusted for 1000)
Smart Θ(n) time dictionary comprehension works fine
{c: s.count(c) for c in set(s)}
=> 8.882608592999986
Exceptions are clumsy and slow
d = {}
for c in s:
try:
d[c] += 1
except KeyError:
d[c] = 1
=> 21.26615508399982
Omitting the exception type check doesn't save time (since the exception is only thrown a few times)
d = {}
for c in s:
try:
d[c] += 1
except:
d[c] = 1
=> 21.943328911999743
dict.get
looks nice but runs slow
d = {}
for c in s:
d[c] = d.get(c, 0) + 1
=> 28.530086210000007
collections.defaultdict
isn't very fast either
dd = defaultdict(int)
for c in s:
dd[c] += 1
=> 19.43012963199999
dict.fromkeys
requires reading the (very long) string twice
d = dict.fromkeys(s, 0)
for c in s:
d[c] += 1
=> 22.70960557699999
Using list
instead of dict
is neither nice nor fast
counts = [0 for _ in range(256)]
for c in s:
counts[ord(c)] += 1
d = {chr(i): count for i,count in enumerate(counts) if count != 0}
=> 26.535474792000002
Leaving out the final conversion to dict
doesn't help
counts = [0 for _ in range(256)]
for c in s:
counts[ord(c)] += 1
=> 26.27811567400005
It doesn't matter how you construct the list
, since it's not the bottleneck
counts = [0] * 256
for c in s:
counts[ord(c)] += 1
=> 25.863524940000048
counts = [0] * 256
for c in s:
counts[ord(c)] += 1
d = {chr(i): count for i,count in enumerate(counts) if count != 0}
=> 26.416733378000004
If you convert list
to dict
the "smart" way, it's even slower (since you iterate over
the string twice)
counts = [0] * 256
for c in s:
counts[ord(c)] += 1
d = {c: counts[ord(c)] for c in set(s)}
=> 29.492915620000076
The dict.__contains__
variant may be fast for small strings, but not so much for big ones
d = {}
for c in s:
if c in d:
d[c] += 1
else:
d[c] = 1
=> 23.773295123000025
collections._count_elements
is about as fast as collections.Counter
(which uses
_count_elements
internally)
d = {}
_count_elements(d, s)
=> 7.5814381919999505
Final verdict: Use collections.Counter
unless you cannot or don't want to :)
Appendix: NumPy
The numpy
package provides a method numpy.unique
which accomplishes (almost)
precisely what we want.
The way this method works is very different from all the above methods:
It first sorts a copy of the input using Quicksort, which is an O(n2) time operation in the worst case, albeit O(n log n) on average and O(n) in the best case.
Then it creates a "mask" array containing
True
at indices where a run of the same values begins, viz. at indices where the value differs from the previous value. Repeated values produceFalse
in the mask. Example:[5,5,5,8,9,9]
produces a mask[True, False, False, True, True, False]
.This mask is then used to extract the unique values from the sorted input ‒
unique_chars
in the code below. In our example, they would be[5, 8, 9]
.Positions of the
True
values in the mask are taken into an array, and the length of the input is appended at the end of this array. For the above example, this array would be[0, 3, 4, 6]
.For this array, differences between its elements are calculated, eg.
[3, 1, 2]
. These are the respective counts of the elements in the sorted array ‒char_counts
in the code below.Finally, we create a dictionary by zipping
unique_chars
andchar_counts
:{5: 3, 8: 1, 9: 2}
.
import numpy as np
def count_chars(s):
# The following statement needs to be changed for different input types.
# Our input `s` is actually of type `bytes`, so we use `np.frombuffer`.
# For inputs of type `str`, change `np.frombuffer` to `np.fromstring`
# or transform the input into a `bytes` instance.
arr = np.frombuffer(s, dtype=np.uint8)
unique_chars, char_counts = np.unique(arr, return_counts=True)
return dict(zip(unique_chars, char_counts))
For the test input (first 100,000 characters of the complete works of Shakespeare), this method performs better than any other tested here. But note that on a different input, this approach might yield worse performance than the other methods. Pre-sortedness of the input and number of repetitions per element are important factors affecting the performance.
count_chars(s)
=> 2.960809530000006
If you are thinking about using this method because it's over twice as fast as
collections.Counter
, consider this:
collections.Counter
has linear time complexity.numpy.unique
is linear at best, quadratic at worst.The speedup is not really that significant ‒ you save ~3.5 milliseconds per iteration on an input of length 100,000.
Using
numpy.unique
obviously requiresnumpy
.
That considered, it seems reasonable to use Counter
unless you need to be really fast. And in
that case, you better know what you're doing or else you'll end up being slower with numpy
than
without it.
Appendix 2: A somewhat useful plot
I ran the 13 different methods above on prefixes of the complete works of Shakespeare and made an interactive plot. Note that in the plot, both prefixes and durations are displayed in logarithmic scale (the used prefixes are of exponentially increasing length). Click on the items in the legend to show/hide them in the plot.
Click to open!
This is the shortest, most practical I can comeup with without importing extra modules.
text = "hello cruel world. This is a sample text"
d = dict.fromkeys(text, 0)
for c in text: d[c] += 1
print d['a'] would output 2
And it's also fast.