How do 20 questions AI algorithms work?

Simple online games of 20 questions powered by an eerily accurate AI.

How do they guess so well?


You can think of it as the Binary Search Algorithm. In each iteration, we ask a question, which should eliminate roughly half of the possible word choices. If there are total of N words, then we can expect to get an answer after log2(N) questions.

With 20 question, we should optimally be able to find a word among 2^20 = 1 million words.

One easy way to eliminate outliers (wrong answers) would be to probably use something like RANSAC. This would mean, instead of taking into account all questions which have been answered, you randomly pick a smaller subset, which is enough to give you a single answer. Now you repeat that a few times with different random subset of questions, till you see that most of the time, you are getting the same result. you then know you have the right answer.

Of course this is just one way of many ways of solving this problem.


I recommend reading about the game here: http://en.wikipedia.org/wiki/Twenty_Questions

In particular the Computers section:

The game suggests that the information (as measured by Shannon's entropy statistic) required to identify an arbitrary object is about 20 bits. The game is often used as an example when teaching people about information theory. Mathematically, if each question is structured to eliminate half the objects, 20 questions will allow the questioner to distinguish between 220 or 1,048,576 subjects. Accordingly, the most effective strategy for Twenty Questions is to ask questions that will split the field of remaining possibilities roughly in half each time. The process is analogous to a binary search algorithm in computer science.


A decision tree supports this kind of application directly. Decision trees are commonly used in artificial intelligence.

A decision tree is a binary tree that asks "the best" question at each branch to distinguish between the collections represented by its left and right children. The best question is determined by some learning algorithm that the creators of the 20 questions application use to build the tree. Then, as other posters point out, a tree 20 levels deep gives you a million things.

A simple way to define "the best" question at each point is to look for a property that most evenly divides the collection into half. That way when you get a yes/no answer to that question, you get rid of about half of the collection at each step. This way you can approximate binary search.

Wikipedia gives a more complete example:

http://en.wikipedia.org/wiki/Decision_tree_learning

And some general background:

http://en.wikipedia.org/wiki/Decision_tree