Randomly Generate Letters According to their Frequency of Use?
How can I randomly generate letters according to their frequency of use in common speech?
Any pseudo-code appreciated, but an implementation in Java would be fantastic. Otherwise just a poke in the right direction would be helpful.
Note: I don't need to generate the frequencies of usage - I'm sure I can look that up easily enough.
Solution 1:
I am assuming that you store the frequencies as floating point numbers between 0 and 1 that total to make 1.
First you should prepare a table of cumulative frequencies, i.e. the sum of the frequency of that letter and all letters before it.
To simplify, if you start with this frequency distribution:
A 0.1
B 0.3
C 0.4
D 0.2
Your cumulative frequency table would be:
A 0.1
B 0.4 (= 0.1 + 0.3)
C 0.8 (= 0.1 + 0.3 + 0.4)
D 1.0 (= 0.1 + 0.3 + 0.4 + 0.2)
Now generate a random number between 0 and 1 and see where in this list that number lies. Choose the letter that has the smallest cumulative frequency larger than your random number. Some examples:
Say you randomly pick 0.612. This lies between 0.4 and 0.8, i.e. between B and C, so you'd choose C.
If your random number was 0.039, that comes before 0.1, i.e. before A, so choose A.
I hope that makes sense, otherwise feel free to ask for clarifications!
Solution 2:
One quick way to do it would be to generate a list of letters, where each letter appeared in the list in accordance with its frequency. Say, if "e" was used 25.6% of the time, and your list had length 1000, it would have 256 "e"s.
Then you could just randomly pick spots from the list by using (int) (Math.random() * 1000)
to generate random numbers between 0 and 999.
Solution 3:
What I would do is scale the relative frequencies as floating point numbers such that their sum is 1.0. Then I would create an array of the cumulative totals per letter, i.e. the number that must be topped to get that letter and all those "below" it. Say the frequency of A is 10%, b is 2% and z is 1%; then your table would look something like this:
0.000 A ; from 0% to 10% gets you an A
0.100 B ; above 10% is at least a B
0.120 C ; 12% for C...
...
0.990 Z ; if your number is >= 99% then you get a Z
Then you generate yourself a random number between 0.0 and 1.0 and do a binary search in the array for the first number smaller than your random number. Then pick the letter at that position. Done.
Solution 4:
Not even a pseudo-code, but a possible approach is as follows:
Let p1, p2, ..., pk be the frequencies that you want to match.
- Calculate the cumulative frequencies: p1, p1+p2, p1+p2+p3, ... , 1
- Generate a random uniform (0,1) number x
- Check which interval of the cumulative frequencies x belongs to: if it is between, say, p1+..+pi and p1+...+pi+p(i+1), then output the (i+1)st letter
Depending on how you implement the interval-finding, the procedure is usually more efficient if the p1,p2,... are sorted in decreasing order, because you will usually find the interval containing x sooner.