What is the most general formalism for machine learning?

No idea why but the other answers ignore the rich and well-studied theory in ML that goes back decades - research that still continues today. As ML has blown up in terms of practical applications recently, there are many more people aware of the practical aspects (download library, preprocess data, and click go! (: ), but unfortunately ignorant of the theoretical aspects.

The one comment (from 4 years ago) has the right idea. Probably approximately correct (PAC) learning theory is the classical method of describing general learning theory. Another core pillar in learning theory is Empirical Risk minimization. A standard basic reference is Understanding Machine Learning: From Theory to Algorithms by Shalev-Shwartz and Ben-David..

The comment also mentions Vapnikā€“Chervonenkis (VC) theory. The VC dimension is a method of quantifying the capacity of a model (i.e., how expressive or complicated it can become, if needed). There are other methods of measuring this, like covering numbers and Rademacher complexities. Basically, if you want to bound the generalization error (i.e., guarantee you are not overfitting, given some assumptions on the data distribution [see the domain adaptation PAC theory]), then the complexity of your model plays a role in that. As you might expect, a more powerful model will tend to increase the difference between the training and testing error (i.e., have more overfitting) and in certain cases this amount can actually be quantified using measures like the VC dimension $\mathfrak{D}[H]$ of some hypothesis space $H$.

As an example, let $T$ be the set of IID training data points for binary classification, drawn from a distribution $P$, with $|T| >> \mathfrak{D}[H]$, and let $0 < \delta < 1$. Then $\forall\;h\in H$, we have that $$ P\left( R[h] \leq R_E[h] + 2\sqrt{\frac{{2}}{|T|}\left[ \mathfrak{D}[H]\left[\ln\left(\frac{2|T|}{\mathfrak{D}[H]}\right) + 1\right] - \ln\left(\frac{\delta}{4}\right) \right]}\; \right) \geq 1 - \delta $$ where the empirical risk (training error, or in-sample loss) is $\mathcal{L}_E[h] = (1/|T|) \sum_{x,y \in T} L(y, h(x)) $, the generalization error (out-of-sample loss, or test-time error) is $ \mathcal{L}[h] = \iint_{X,Y} P(x,y) L(y, h(x))\, dx\,dy $. There are a few other assumptions; there was nice summary here and wiki also mentions it.

Another popular example is the sample complexity of learning a binary function. Let $H$ be the space of binary functions. Then it is $(\epsilon,\delta)$-learnable with a sample of size $N$ on the order of $$ N = \mathcal{O}\left( \frac{\mathfrak{D}[H] - \ln\delta}{\epsilon} \right). $$ This tell us how many samples we require to learn a given function.

I will say though that PAC theory tends to be hard to apply in practice (e.g., to deep learning models and modern datasets). I do see some useful and interesting theory in some cases though, such as results concerning domain adaptation, which tell us how close two datasets need to be in order for e.g. transfer learning to work. Nevertheless, as a general, unified mathematical/theoretical framework to study classes of functions (hypotheses) and their ability to minimize an abstract concept of risk/error, this is the best I know of.


As to your questions (which incidentally should each be their own questions quite honestly):

My question is, is there a single unifying formalism or theoretical framework in which one can think about problems of inferring patterns and conclusions from data?

Sure, the standard one I mentioned above. Define a function space (hypothesis class) to search in and a error/risk function to minimize, and then use PAC theory to e.g. define notions of sample complexity (how much data is needed to get $\epsilon$ level accuracy with probability $1-\delta$?).

In such a framework I'd be interested in exploring these kinds of questions: What is the correct number of free parameters to use for a given data set? How do we know when we've over-fit? How do we express the trade-off between the accuracy of low-resolution fitting and the precision of high-resolution fitting?

Good question, but these are very applied/practical questions actually, and are hard to answer in general. These days we usually overfit on purpose with a "deep network", and then regularize it until it stops overfitting. We know when we overfit based on estimates of the generalization error.

Nevertheless, questions like overfitting are well studied in PAC theory (e.g., the bound I gave above expresses it rather concisely as the distance between train and test error, and bounds it for a particular case). Questions like "the correct number of free parameters" are also answerable: you can relate the number of free parameters to measures of complexity and capacity (like VC dimension), and deduce how complex the model should be. Of course, this is very tough in practice.

Separately, Bayesian learning approaches give us a very different way to detect overfitting via a natural Occam's razor. See e.g. Occam's Razor by Rasmussen and Gharamani, 2001, for some simple but interesting examples. Bayesian approaches allow for natural regularization in this way, and non-parametric Bayesian methods (e.g., Gaussian processes) are naturally intelligent in that they increase their capacity as a function of the amount of available data. Nevertheless, the arbitrariness of the choice of prior does cause some difficulty (though this is well studied, see e.g. Radford Neal's thesis on Bayesian Learning for Neural Networks, which has a neat section on priors and their implications).

As an aside, the bias-variance tradeoff is a rather simplistic instantiation of another, much more general set of theoretical principles regarding tradeoffs. This includes the "No Free Lunch Theorem", which shows there can be no "universal learner" (for every learning algo/model, there is a task on which it fails), and the Bias-Complexity Tradeoff, which shows that minimizing risk (generalized error) has a balance: choosing a powerful hypothesis class helps reduce approximation error (caused by a weak model), but increases estimation error (caused by overfitting). This notion can be made very concrete and exact (see the book I linked for an intro).

Can we define a machine learning techniques in general terms (instead of specific ones like neurons or forests or deep learning or graphical models or...)? How can we objectively compare these techniques? Are any of them actually identical?

As mentioned before, you can view ML as searching through a function space to find a member (function/model) that minimizes some functional called the risk or loss. This is a very general formulation.

I do think it is interesting to consider the merging of probabilistic models with deep learning, via Bayesian methods. For instance, infinitely large neural networks are known to be equivalent to non-parametric Bayesian Gaussian processes. (e.g. [a], [b]).

What kinds of signals are possible to extract from data?

Again, too many ways to tackle this question.

For instance, in generative modelling, we want to approximate the marginal distribution $p_\text{data}(x)$. Lots of theoretical work in GANs, VAEs, etc... discuss what different models are able to handle. (For instance, the choice of latent prior may determine which data topologies can be represented well).

The more classic answer is to ask what functions can be learned. This corresponds to the view of machine learning as the study of constructing function approximators. Most people are familiar with universal approximation theorems, showing capabilities on e.g. continuous functions. This kind of work is still common: e.g. more recent work on set-theoretic neural layers show that certain layers are universal approximators over permutation invariant functions; other recent work look at universal approximator theorems over signals defined on graphs (via graph convolutional networks)

What kinds of noise could prevent successful learning?

Depends where the noise is. Assuming you want to learn a mapping $f: X\rightarrow Y$, with a dataset $T=\{(x_i,y_i)\}_i$, if there is intrinsic noise in the data (i.e. in the $x$ and $y$ values), then an aleatoric uncertainty is created. In the simple case, this manifests as an additional error term in the bias-variance decomposition (with bias added if the noise has non-zero mean, and variance dependent on the spread of the noise). See e.g. here. There are other types of "noise", however, such as domain mismatch between train and test sets, which much domain adaptation research focuses on fighting.


Related: [1], [2], [3], [4], [5]


I agree with the other respondent. The focus on machine learning is more geared towards ready-to-implement algorithms. However there are some Simple results like the Bias-Variance Tradeoff and other sort of discussions of "limiting" performance. From what I've seen, your question might be most practically addressed by the area of model selection (i.e. how to pick one of the algorithms you mentioned to solve a problem). In some cases, there is no single best algorithm, other than to use both (or several ) algorithms in an ensemble method, e.g. boosting, which takes advantage of when different learning algorithms are best at classifying different types of examples.


One of the general formalism for using ensemble methods for classification was developed by this author and his followers https://en.m.wikipedia.org/wiki/Yuri_Zhuravlev and called algebraic theory of algorithms, but i haven't seen any book or course in english, may be some papers, most information in russian.