Objective metric to describe skill vs luck in games that include randomness

There are many games that even though they include some random component (for example dice rolls or dealing of cards) they are skill games. In other words, there is definite skill in how one can play the game taking into account the random component. Think of backgammon or poker as good examples of such games.

Moreover, novice players or outsiders might fail to recognise the skill involved and attribute wins purely to luck. As someone gains experience, they usually start to appreciate the skill involved, and concede that there is more to the game than just luck. They might even concede that there is "more" skill than luck. How do we quantify this? How "much" luck vs skill? People can have very subjective feelings about this balance. Recently, I was reading someone arguing that backgammon is $9/10$ luck, while another one was saying it's $6/10$. These numbers mean very little other than expressing gut feelings.

Can we do better? Can we have an objective metric to give us a good sense of the skill vs luck component of a game.

I was thinking along these lines: Given a game and a lot of empirical data on matches between players with different skills, a metric could be:

How many games on the average do we need to have an overall win (positive win-loss balance) with probability $P$ (let's use $P=0.95$) between a top level player and a novice?

For the game of chess this metric would be $1$ (or very close to $1$). For the game of scissors-paper-rock it would be $\infty$.

This is an objective measure (we can calculate it based on empirical data) and it is intuitive. There is however an ambiguity in what top-level and novice players mean. Empirical data alone does not suffice to classify the players as novices or experts. For example, imagine that we have the results from 10,000 chess games between 20 chess grandmasters. Some will be better, some will be worse, but analysing the data with the criterion I defined, we will conclude that chess has a certain (significant) element of luck. Can we make this more robust? Also, given a set of empirical data (match outcomes) how do we know we have enough data?

What other properties do we want to include? Maybe a rating between $[0, 1]$, zero meaning no luck, and one meaning all luck, would be easier to talk about.

I am happy to hear completely different approaches too.


A natural place to start, depending upon your mathematical background might be the modern cardinal generalizations of the voting literature. Consider the following toy model, and feel free to adapt any parts which don't seem appropriate to the problem at hand.

Data: Say we have a fixed, finite population of players $N$ who we observe play. A single observation consists of the result of a single match between two players $\{i \succ j\}$ for some $i,j \in N$. Let's the space of datasets $\mathcal{D}$ consists, for fixed $N$, of all finite collections of such observations.

Solution Concept: Our goal here is twofold: mathematically we're going to try to get a vector in $\mathbb{R}^N$ whose $i^\textrm{th}$ component is the 'measure of player $i$'s caliber relative to all the others.' A player $i$ is better than player $j$ by our rule if the $i^\textrm{th}$ component of this score vector is higher than the $j^\textrm{th}$. We might also like the magnitude of the differences in these components to carry be increasing in some measure of pairwise skill difference (the score vector is cardinal, to use economics lingo).

Edit: This score vector should not be seen as a measure of luck versus skill. But what we will be able to do is use this to 'clean' our data of the component of it generated by differences in skill, leaving behind a residual which we may then interpret as 'the component of the results attributable to luck.' In particular, this 'luck component' lives in a normed space, and hence comes with a natural means of measuring its magnitude, which seems to be what you are after.

Our approach is going to use a cardinal generalization of something known as the Borda count from voting theory.

Aggregation: Our first step is to aggregate our dataset. Given a dataset, consider the following $N\times N$ matrix 'encoding' it. For all $i,j \in N$ define:

$$ D_{ij} = \frac{\textrm{Number of times $i$ has won over $j$}- \textrm{Number of times $j$ has won over $i$}}{\textrm{Number of times $i$ and $j$ have played}} $$

if the denominator is non-zero, and $0$ else. The ${ij}^\textrm{th}$ element of this matrix encodes the relative frequency with which $i$ has beaten $j$. Moreover, $D_{ij} = -D_{ij}$. Thus we have identified this dataset with a skew-symmetric, real-valued $N\times N$ matrix.

An equivalent way of viewing this data is as a flow on a graph whose vertices correspond to the $N$ players (every flow on a graph has a representation as a skew-symmetric matrix and vice-versa). In this language, a natural candidate for a score vector is a potential function: a function from the vertices to $\mathbb{R}$ (i.e. a vector in $\mathbb{R}^N$) such that the value of the flow on any edge is given by the its gradient. In other words, we ask if there exists some vector $s \in \mathbb{R}^N$ such that, for all $i,j \in N$:

$$ D_{ij} = s_j - s_i. $$

This would very naturally be able to be perceived as a metric of 'talent' given the dataset. If such a vector existed, it would denote that differences in skill could 'explain away all variation in the data.' Generally, however, for a given aggregate data matrix, such a potential function generally does not exist (as we would hope, in line with our interpretation).

Edit: It should be noted that the way we are aggregating the data (counting relative wins) will generally preclude such a function from existing, even when the game is 'totally determined by skill.' In such cases our $D$ matrix will take values exclusively in $\{-1, 0 ,1\}$. Following the approach outlined below, one will get out a score function which rationalizes this ordering but the residual will not necessarily be zero (but will generally take a specific form, of a $N$-cycle conservative flow that goes through each vertex). If one were to, say, make use of scores of games, the aggregated data would have a cardinal interpretation.

Construction of a score: The good news is that the mathematical tools exist to construct a scoring function that is, in a rigorous sense, the 'best fit for the data,' even if it is not a perfect match (think about how a data point cloud rarely falls perfectly on a line, but we nonetheless can find it instructive to find the line that is the best fit for it). Since this has been tagged a soft question, I'll just give a sketch of how and why such a score can be constructed. The space of all such aggregate data matrices actually admits a decomposition into the linear subspace of flows that admit a potential function, and its orthogonal complement. Formally, this is a combinatorial version of the Hodge decomposition from de Rham cohomology (but if those words mean nothing, don't worry about looking them up). Then, loosely speaking, in the spirit of classical regression, we can solve a least-squares minimization problem to orthogonally project our aggregate data matrix onto the linear subspace of flows that admit a potential function:

$$ \min_{s \in \mathbb{R}^N} \|D - (-\textrm{grad}(s))\|^2 $$ where $-\textrm{grad}(s)$ is an $N\times N$ matrix whose $ij^\textrm{th}$ element is $s_i - s_j$.

If you're interested in seeing this approach used to construct a ranking for college football teams, see:

http://www.ams.org/samplings/feature-column/fc-2012-12

If you'd like to read a bit more about the machinery underlying this including some brief reading about connections to the mathematical voting literature, see:

https://www.stat.uchicago.edu/~lekheng/meetings/mathofranking/ref/jiang-lim-yao-ye.pdf


Found this: How to Measure Luck vs Skill in Games? at
boardgames.stackexchange.com

The following links might also be helpful and/or of general interest:

Kelly criterion
Edward O. Thorp
Daniel Kahneman


The Success Equation: Untangling Skill and Luck in Business, Sports, and Investing
$\;$ by Michael J. Mauboussin
NOTES ON THE SUCCESS EQUATION

Estimated true skill = Grand average + shrinkage factor (observed average − grand average)

Michael Mauboussin: Three Things to Consider in Order To Make an Effective Prediction


Bayesian inference


I hope the OP can forgive me for not focusing on his questions. Ten years ago I might have been more enthusiastic, but now all I could muster up was to provide links that anyone interested in games should find interesting.

So what is my problem? Two letters:

AI

Ten years ago it was thought that it would take AI 'bots' several decades to become #1 over humanity in two games:

Go (a game of perfect information)

OR

Poker (a game of imperfect information/randomness)

Today, it is a fait accompli.

The researchers matching their bots against human players in poker can claim victory whenever the out-performance is statistically significant.

GO is a different animal. Any win against a 10 dan professional player is 'off the charts' significant (throw that statistics book out the window).