What is inductive bias in machine learning? [closed]
What is inductive bias in machine learning? Why is it necessary?
Every machine learning algorithm with any ability to generalize beyond the training data that it sees has some type of inductive bias, which are the assumptions made by the model to learn the target function and to generalize beyond training data.
For example, in linear regression, the model assumes that the output or dependent variable is related to independent variable linearly (in the weights). This is an inductive bias of the model.
What is inductive bias?
Pretty much every design choice in machine learning signifies some sort of inductive bias. "Relational inductive biases, deep learning, and graph networks" (Battaglia et. al, 2018) is an amazing 🙌 read, which I will be referring to throughout this answer.
An inductive bias allows a learning algorithm to prioritize one solution (or interpretation) over another, independent of the observed data. [...] Inductive biases can express assumptions about either the data-generating process or the space of solutions.
Examples in deep learning
Concretely speaking, the very composition of layers 🍰 in deep learning provides a type of relational inductive bias: hierarchical processing. The type of layer imposes further relational inductive biases:
More generally, non-relational inductive biases used in deep learning include:
- activation non-linearities,
- weight decay,
- dropout,
- batch and layer normalization,
- data augmentation,
- training curricula,
- optimization algorithms,
- anything that imposes constraints on the learning trajectory.
Examples outside of deep learning
In a Bayesian model, inductive biases are typically expressed through the choice and parameterization of the prior distribution. Adding a Tikhonov regularization penalty to your loss function implies assuming that simpler hypotheses are more likely.
Conclusion
The stronger the inductive bias, the better the sample efficiency--this can be understood in terms of the bias-variance tradeoff. Many modern deep learning methods follow an “end-to-end” design philosophy which emphasizes minimal a priori representational and computational assumptions, which explains why they tend to be so data-intensive. On the other hand, there is a lot of research into baking stronger relational inductive biases into deep learning architectures, e.g. with graph networks.
An aside about the word "inductive"
In philosophy, inductive reasoning refers to generalization from specific observations to a conclusion. This is a counterpoint to deductive reasoning, which refers to specialization from general ideas to a conclusion.
Inductive bias is the set of assumptions a learner uses to predict results given inputs it has not yet encountered.
According to Tom Mitchell's definition,
an inductive bias of a learner is the set of additional assumptions sufficient to justify its inductive inferences as deductive inferences.
I couldn't quite understand above definition so I searched through Wikipedia and was able to summarize the definition in layman's terms.
Given a data set, which learning model (=Inductive Bias) should be chosen?
Inductive Bias has some prior assumptions about the tasks. Not one bias that is best on all problems and there have been a lot of research efforts to automatically discover the Inductive Bias.
The following is a list of common inductive biases in machine learning algorithms.
Maximum conditional independence: if the hypothesis can be cast in a Bayesian framework, try to maximize conditional independence. This is the bias used in the Naive Bayes classifier.
Minimum cross-validation error: when trying to choose among hypotheses, select the hypothesis with the lowest cross-validation error. Although cross-validation may seem to be free of bias, the "no free lunch" theorems show that cross-validation must be biased.
Maximum margin: when drawing a boundary between two classes, attempt to maximize the width of the boundary. This is the bias used in support vector machines. The assumption is that distinct classes tend to be separated by wide boundaries.
Minimum description length: when forming a hypothesis, attempt to minimize the length of the description of the hypothesis. The assumption is that simpler hypotheses are more likely to be true. See Occam's razor.
Minimum features: unless there is good evidence that a feature is useful, it should be deleted. This is the assumption behind feature selection algorithms.
Nearest neighbors: assume that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for which the class is unknown, guess that it belongs to the same class as the majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm. The assumption is that cases that are near each other tend to belong to the same class.
More information is here: Inductive Bias - How to generalize on novel data
Inductive bias can be thought of as the set of assumptions we make about a domain which we are trying to learn about.
Technically, when we are trying to learn Y from X and, initially, the hypothesis space (different functions for learning X->Y) for Y is infinite. To learn anything at all, we need to reduce the scope. This is done in the form of our beliefs/assumptions about the hypothesis space, also called inductive bias.
Through the introduction of these assumptions, we constrain our hypothesis space and also get the capability to incrementally test and improve on the data in the form of hyper-parameters.
Examples of inductive bias -
- Linear Regression: Y varies linearly in X (in parameters of X).
- Logistic Regression: There exists a hyperplane which separates negative / positive examples
- Neural Networks: crudely speaking, Y is some non-linear function of X (the non linearity depends on the activation functions, topology etc.)