How to count each digit in a range of integers?

Solution 1:

There's a clear mathematical solution to a problem like this. Let's assume the value is zero-padded to the maximum number of digits (it's not, but we'll compensate for that later), and reason through it:

  • From 0-9, each digit occurs once
  • From 0-99, each digit occurs 20 times (10x in position 1 and 10x in position 2)
  • From 0-999, each digit occurs 300 times (100x in P1, 100x in P2, 100x in P3)

The obvious pattern for any given digit, if the range is from 0 to a power of 10, is N * 10N-1, where N is the power of 10.

What if the range is not a power of 10? Start with the lowest power of 10, then work up. The easiest case to deal with is a maximum like 399. We know that for each multiple of 100, each digit occurs at least 20 times, but we have to compensate for the number of times it appears in the most-significant-digit position, which is going to be exactly 100 for digits 0-3, and exactly zero for all other digits. Specifically, the extra amount to add is 10N for the relevant digits.

Putting this into a formula, for upper bounds that are 1 less than some multiple of a power of 10 (i.e. 399, 6999, etc.) it becomes: M * N * 10N-1 + iif(d <= M, 10N, 0)

Now you just have to deal with the remainder (which we'll call R). Take 445 as an example. This is whatever the result is for 399, plus the range 400-445. In this range, the MSD occurs R more times, and all digits (including the MSD) also occur at the same frequencies they would from range [0 - R].

Now we just have to compensate for the leading zeros. This pattern is easy - it's just:

10N + 10N-1 + 10N-2 + ... + **100

Update: This version correctly takes into account "padding zeros", i.e. the zeros in middle positions when dealing with the remainder ([400, 401, 402, ...]). Figuring out the padding zeros is a bit ugly, but the revised code (C-style pseudocode) handles it:

function countdigits(int d, int low, int high) {
    return countdigits(d, low, high, false);
}

function countdigits(int d, int low, int high, bool inner) {
    if (high == 0)
        return (d == 0) ? 1 : 0;

    if (low > 0)
        return countdigits(d, 0, high) - countdigits(d, 0, low);

    int n = floor(log10(high));
    int m = floor((high + 1) / pow(10, n));
    int r = high - m * pow(10, n);
    return
        (max(m, 1) * n * pow(10, n-1)) +                             // (1)
        ((d < m) ? pow(10, n) : 0) +                                 // (2)
        (((r >= 0) && (n > 0)) ? countdigits(d, 0, r, true) : 0) +   // (3)
        (((r >= 0) && (d == m)) ? (r + 1) : 0) +                     // (4)
        (((r >= 0) && (d == 0)) ? countpaddingzeros(n, r) : 0) -     // (5)
        (((d == 0) && !inner) ? countleadingzeros(n) : 0);           // (6)
}

function countleadingzeros(int n) {
      int tmp= 0;
      do{
         tmp= pow(10, n)+tmp;
         --n;
         }while(n>0);
         return tmp;
         }

function countpaddingzeros(int n, int r) {
    return (r + 1) * max(0, n - max(0, floor(log10(r))) - 1);
}

As you can see, it's gotten a bit uglier but it still runs in O(log n) time, so if you need to handle numbers in the billions, this will still give you instant results. :-) And if you run it on the range [0 - 1000000], you get the exact same distribution as the one posted by High-Performance Mark, so I'm almost positive that it's correct.

FYI, the reason for the inner variable is that the leading-zero function is already recursive, so it can only be counted in the first execution of countdigits.

Update 2: In case the code is hard to read, here's a reference for what each line of the countdigits return statement means (I tried inline comments but they made the code even harder to read):

  1. Frequency of any digit up to highest power of 10 (0-99, etc.)
  2. Frequency of MSD above any multiple of highest power of 10 (100-399)
  3. Frequency of any digits in remainder (400-445, R = 45)
  4. Additional frequency of MSD in remainder
  5. Count zeros in middle position for remainder range (404, 405...)
  6. Subtract leading zeros only once (on outermost loop)

Solution 2:

I'm assuming you want a solution where the numbers are in a range, and you have the starting and ending number. Imagine starting with the start number and counting up until you reach the end number - it would work, but it would be slow. I think the trick to a fast algorithm is to realize that in order to go up one digit in the 10^x place and keep everything else the same, you need to use all of the digits before it 10^x times plus all digits 0-9 10^(x-1) times. (Except that your counting may have involved a carry past the x-th digit - I correct for this below.)

Here's an example. Say you're counting from 523 to 1004.

  • First, you count from 523 to 524. This uses the digits 5, 2, and 4 once each.
  • Second, count from 524 to 604. The rightmost digit does 6 cycles through all of the digits, so you need 6 copies of each digit. The second digit goes through digits 2 through 0, 10 times each. The third digit is 6 5 times and 5 100-24 times.
  • Third, count from 604 to 1004. The rightmost digit does 40 cycles, so add 40 copies of each digit. The second from right digit doers 4 cycles, so add 4 copies of each digit. The leftmost digit does 100 each of 7, 8, and 9, plus 5 of 0 and 100 - 5 of 6. The last digit is 1 5 times.

To speed up the last bit, look at the part about the rightmost two places. It uses each digit 10 + 1 times. In general, 1 + 10 + ... + 10^n = (10^(n+1) - 1)/9, which we can use to speed up counting even more.

My algorithm is to count up from the start number to the end number (using base-10 counting), but use the fact above to do it quickly. You iterate through the digits of the starting number from least to most significant, and at each place you count up so that that digit is the same as the one in the ending number. At each point, n is the number of up-counts you need to do before you get to a carry, and m the number you need to do afterwards.

Now let's assume pseudocode counts as a language. Here, then, is what I would do:

convert start and end numbers to digit arrays start[] and end[]
create an array counts[] with 10 elements which stores the number of copies of
     each digit that you need

iterate through start number from right to left. at the i-th digit,
    let d be the number of digits you must count up to get from this digit
        to the i-th digit in the ending number. (i.e. subtract the equivalent
        digits mod 10)
    add d * (10^i - 1)/9 to each entry in count.
    let m be the numerical value of all the digits to the right of this digit,
        n be 10^i - m.
    for each digit e from the left of the starting number up to and including the
        i-th digit, add n to the count for that digit.
    for j in 1 to d
        increment the i-th digit by one, including doing any carries
        for each digit e from the left of the starting number up to and including
            the i-th digit, add 10^i to the count for that digit
    for each digit e from the left of the starting number up to and including the
        i-th digit, add m to the count for that digit.
    set the i-th digit of the starting number to be the i-th digit of the ending
        number.

Oh, and since the value of i increases by one each time, keep track of your old 10^i and just multiply it by 10 to get the new one, instead of exponentiating each time.