CNN - Image Resizing VS Padding (keeping aspect ratio or not?)

While usually people tend to simply resize any image into a square while training a CNN (for example resnet takes a 224x224 square image), that looks ugly to me, especially when the aspect ratio is not around 1.

(In fact that might change ground truth eg the label that an expert might give the distorted image could be different than the original one).

So now I resize the image to,say, 224x160 , keeping the original ratio, and then I pad the image with 0s (paste it into a random location in a totally black 224x224 image).

My approach doesn't seem original to me, and yet I cannot find any information whatsoever about my approach versus the "usual" approach. Funky!

So, which approach is better? Why? (if the answer is data dependent please share your thought regarding when one if preferable over the other.)


Solution 1:

According to Jeremy Howard, padding a big piece of the image (64x160 pixels) will have the following effect: The CNN will have to learn that the black part of the image is not relevant and does not help distinguishing between the classes (in a classification setting), as there is no correlation between the pixels in the black part and belonging to a given class. As you are not hard coding this, the CNN will have to learn it by gradient descent, and this might probably take some epochs. For this reason, you can do it if you have lots of images and computational power, but if you are on a budget on any of them, resizing should work better.

Solution 2:

Sorry, this is late but this answer is for anyone facing the same issue.

First, if scaling with changing the aspect ratio will affect some important features, then you have to use zero-padding.

Zero padding doesn't make it take longer for the network to learn because of the large black area itself but because of the different possible locations that the unpadded image could be inside the padded image since you can pad an image in many ways.

For areas with zero pixels, the output of the convolution operation is zero. The same with max or average pooling. Also, you can prove that the weight is not updated after backpropagation if the input associated with that weight is zero under some activation functions (e.g. relu, sigmoid). So the large area doesn't make any updates to the weights in this sense.

However, the relative position of the unpadded image inside the padded image does indeed affect training. This is not due to the convolution nor the pooling layers but the last fully connected layer(s). For example, if the unpadded image is on the left relative inside the padded image and the output of flattening the last convolution or pooling layer was [1, 0, 0] and the output for the same unpadded image but on the right relative inside the padded image was [0, 0, 1] then the fully connected layer(s) must learn that [1, 0, 0] and [0, 0, 1] are the same thing for a classification problem.

Therefore, learning the equivariance of different possible positions of the image is what makes training take more time. If you have 1,000,000 images then after resizing you will have the same number of images; on the other hand, if you pad and want to consider different possible locations (10 randomly for each image) then you will have 10,000,000 images. That is, training will take 10 times longer.

That said, it depends on your problem and what you want to achieve. Also, testing both methods will not hurt.