Mathematical explanation behind a picture posted (lifted from facebook)

In this image given below, there is an actor's (famous south Indian actor Rajinikanth) image which can be seen only if you shake your head ! I had lifted this from Facebook.

enter image description here

I am just curious to know if there is any mathematical explanation for it. Is there any way to know how this image was created in the first place.

PS : If this question (although interesting) is inappropriate here, it is still okay and i hope it could be migrated to some stackexchange site.

ADDED

(...experiment to prove that it is a physical phenomenon)

After some comments expressing doubt as to whether this is a physical phenomenon, I have done a small experiment using a simple camera. I have have shot photos of the picture displayed on a LCD monitor in two different cases. In case-1, the camera was still and in case-2 the camera was shaking in a circular arc (in to and fro) about and axis in the vertical plane passing through the centre of the camera. (just as like we shake our head). I did it with hand just as we shake our head. I have given the photos below.

Case-1 (still camera)

**Case-1** (still camera)

Case-2 (shaking camera)

**Case-2** (shaking camera)

It can be observed that the face of Rajini is more clearly visible in Case-2 (shaking camera) than in the Case-1 (still camera) where the face is not clearly visible.

PS : Now there is really no need to shake our head.

Added 2

after a recommendation by Willie, here, I have added Case-3 where the camera is shaking vertically (parallel to the stripes in the picture).

It can be observed that not much of an effect there when camera is shaking parallel to the stripes.

Case-3 (shaking camera vertically)

**Case-3** (shaking camera vertically)


What you are seeing is a physical manifestation of the mathematical operation known as the convolution.

First let me show you some pictures; we'll get into the mathematics afterwards. We start with the original

enter image description here

I take the image, desaturated the colours, and duplicated another layer, and pixel-wise added the layers after some translation. With a horizontal translation that is half the "wavelength" of the black bars, we get

enter image description here

With a translation of the same number of pixels, but vertically, we get

enter image description here

and finally, a diagonal translation at -45 degrees.

enter image description here


So what is going on? Why did I say that this is a manifestation of convolution?

Recall that the convolution of two functions defined on (say) the real line $\mathbb{R}$ is defined to be

$$ f * g (x) = \int_{\mathbb{R}} f(y) g(x-y) dy $$

In a course in Fourier analysis, one is taught to emphasize that this is the dual operation of multiplication. That is, convolution in physical space corresponds to (point-wise) multiplication in Fourier space. This immediately gives the following interpretation of a convolution in signal processing:

Convolving a signal $f$ by a function $\psi$ is the same as applying a frequency dependent filter $\hat{\psi}$ to the signal $f$.

Another way of looking at the convolution, however, after staring at the above definition for a bit, is that

A convolution is a way of taking weighted average of a signal with its translates. The weight depends on the amount of translation.

It is in this second sense that we will first look at the phenomenon you asked above. In the second image of this post, I averaged the signal with its translation horizontally by half the wavelength of the black bars. Hence this is a convolution. Similarly, in the third/fourth image of this post, I averaged the original with a vertical/diagonal translation. They are also convolutions. And you see that this reproduces the observation you made that the direction in which you shake your head/camera produces an effect on the image seen/captured.

So how is the process of shaking your head of shaking a camera a process of convolution? The idea is that the image you see with your eyes and you capture with a camera do not come from photons all emitted at the same instant in time (special relativity notwithstanding). In your vision, there is the well-known phenomenon of persistence of vision which posits that the perceived image is actually made up of photons arriving in a 40 millisecond interval. Similarly, the shutter-speed of a camera determines how long a camera registers light, and so a camera set on 1/25 for the shutter-speed will "open its eye" for 40 milliseconds, and the image registered on the CCD or on film will be photons arriving in that window.

Now, if you shake your head or camera so that the retina or the CCD or the film moves significantly during that 40 milliseconds, each of your retina cell, each of the photoelements on the CCD, or each of the dye pigments on the film will be exposed to photons originating from different spatial positions. (I am grossly simplifying here, but that's the moral of the story.)

To summarise: your eyes and cameras already take convolution of the incoming signal in time when they compose the image. By shaking the apparati you convert the temporal convolution to a spatial convolution. Which means that you are taking a weighted average of the image and its spatial translations, which is why what you see and capture on camera can be analogously described by digitally manipulating the image via an averaging/convolution procedure.

Note that this corresponds somewhat with Henning's comment to your question. The "eye's edge detection" he mentions is, roughly speaking, a description of how the eye is sensitive to different spatial frequencies of a signal (not to be confused with the actual electromagnetic frequencies with determines the colour). By shaking your head you apply a convolution operator, which in frequency space introduces a cut-off for high spatial frequency components. Buy reducing the high spatial frequency components, your eye is forced to get its information from the lower-frequency components in which the image of the Indian Actor hide. (There's some technical inaccuracies in this paragraph about how human physiology works and how it interacts with the shaking of the head, but I think this simpler picture illustrates the idea better.)


At this point I should mention that the idea of taking spatial convolutions of images and the exchange between temporal and spatial convolutions with the motion of the camera is not only useful for optical illusions. It actually has industry application in automatic image deblurring.


(this is in response to the answer by Willie..I have added this here as i couldn't insert an image in the comments...i hope it is ok in this special case.)

The notion of spatial frequency has been used by Willie in his answer. He mentions that the information pertaining to the face of the actor is present in the lower spatial frequency components, and the stripes correspond to high spatial frequency components. According to his answer, when we shake the head/apparati, we are convolving the picture with a low pass filter with some cut off frequency, and this operation allows only lower spatial frequency components in the final image there by forcing the eye to interpret the information in the lower spatial frequency components which is nothing but the face of the actor.

Here I argue that the notion of spatial frequency is not useful in all circumstances, for example consider an image shown below. The upper half of this image is taken from the original image and the lower half is taken from the second image of the answer by Willie. The second image of the Willie's answer is result of convolving the original image with a low pass filter. Hence it does not contain high spatial frequency components.

The new image formed here is a combination of two images. The upper half contain high spatial frequency components in the form of stripes. The lower half does not contain high spatial frequency components. But if we consider the entire image as one signal then it contains high spatial frequency components. But there are no stripes in the lower half of it and the lower half of the actor's face is clearly visible without any need for head shaking. In order to view the upper half of the actor we still need head shaking. In this case spatial frequency is of no use to characterize the stripes. If there are high spatial frequency components then we can say that there could be stripes, but we cannot say whether they are present only in the upper half or lower half or everywhere. Simply speaking the spatial frequency notion cannot give any information about where the stripes are present in the image. But the human eye does so well that it can see the face in the lower half of the image where the stripes are not present !

enter image description here