Is there a connection between topological mixing and squashing functions used in neural networks?

Sigmoid, ReLU, tanh, logistic -type "squashing" functions are popular in neural networks to introduce nonlinearity into the transformations of the input vector, allowing the network to fit complex input-output surfaces.

Deeper layers (stacking multiple compositions of these nonlinear activation functions) are known to produce "more nonlinear" approximations to the target output.

If viewed as iterated maps, these nonlinear activation functions ($\mathbb{R} \rightarrow \mathbb{R}$) perform a "squashing/folding" operation.

The forward-propagation of weights from layer to layer is like a "stretching operation", where layer activations are summed together to produce larger values. Then they are clamped / folded again, and so on across layers. This reminds me of iterated maps in chaotic systems, which also exhibit mixing behavior.

Granted, the mappings from layer to layer are not exactly an iterated map because layers can have different sizes, but suppose that each layer is associated with a bijective function that projects the representation at that layer to a common basis $\mathbb{R}^a$ - then forward propagation is an iterated map.

I noticed that each nonlinear activation function is quite unique with respect to how it squashes its input space: tanh "pinches" $\mathbb{R}$ from both positive and negative directions while ReLU holds the right one fixed and pushes everything in from the left.

Is there any literature that discusses these nonlinear activation functions or artificial neural networks, in context to how they pull apart and push together their input space?


I'm afraid this probably won't answer your question exactly, but the closest thing that came to mind is this blog post. The idea is that nonlinear units in a neural network deforms the input space in such a way that the inputs become linearly separable.