Solution 1:

The pooling and convolutional ops slide a "window" across the input tensor. Using tf.nn.conv2d as an example: If the input tensor has 4 dimensions: [batch, height, width, channels], then the convolution operates on a 2D window on the height, width dimensions.

strides determines how much the window shifts by in each of the dimensions. The typical use sets the first (the batch) and last (the depth) stride to 1.

Let's use a very concrete example: Running a 2-d convolution over a 32x32 greyscale input image. I say greyscale because then the input image has depth=1, which helps keep it simple. Let that image look like this:

00 01 02 03 04 ...
10 11 12 13 14 ...
20 21 22 23 24 ...
30 31 32 33 34 ...
...

Let's run a 2x2 convolution window over a single example (batch size = 1). We'll give the convolution an output channel depth of 8.

The input to the convolution has shape=[1, 32, 32, 1].

If you specify strides=[1,1,1,1] with padding=SAME, then the output of the filter will be [1, 32, 32, 8].

The filter will first create an output for:

F(00 01
  10 11)

And then for:

F(01 02
  11 12)

and so on. Then it will move to the second row, calculating:

F(10, 11
  20, 21)

then

F(11, 12
  21, 22)

If you specify a stride of [1, 2, 2, 1] it won't do overlapping windows. It will compute:

F(00, 01
  10, 11)

and then

F(02, 03
  12, 13)

The stride operates similarly for the pooling operators.

Question 2: Why strides [1, x, y, 1] for convnets

The first 1 is the batch: You don't usually want to skip over examples in your batch, or you shouldn't have included them in the first place. :)

The last 1 is the depth of the convolution: You don't usually want to skip inputs, for the same reason.

The conv2d operator is more general, so you could create convolutions that slide the window along other dimensions, but that's not a typical use in convnets. The typical use is to use them spatially.

Why reshape to -1 -1 is a placeholder that says "adjust as necessary to match the size needed for the full tensor." It's a way of making the code be independent of the input batch size, so that you can change your pipeline and not have to adjust the batch size everywhere in the code.

Solution 2:

The inputs are 4 dimensional and are of form: [batch_size, image_rows, image_cols, number_of_colors]

Strides, in general, define an overlap between applying operations. In the case of conv2d, it specifies what is the distance between consecutive applications of convolutional filters. The value of 1 in a specific dimension means that we apply the operator at every row/col, the value of 2 means every second, and so on.

Re 1) The values that matter for convolutions are 2nd and 3rd and they represent the overlap in the application of the convolutional filters along rows and columns. The value of [1, 2, 2, 1] says that we want to apply the filters on every second row and column.

Re 2) I don't know the technical limitations (might be CuDNN requirement) but typically people use strides along the rows or columns dimensions. It doesn't necessarily make sense to do it over batch size. Not sure of the last dimension.

Re 3) Setting -1 for one of the dimension means, "set the value for the first dimension so that the total number of elements in the tensor is unchanged". In our case, the -1 will be equal to the batch_size.