How can I use a pre-trained neural network with grayscale images?

I have a dataset containing grayscale images and I want to train a state-of-the-art CNN on them. I'd very much like to fine-tune a pre-trained model (like the ones here).

The problem is that almost all models I can find the weights for have been trained on the ImageNet dataset, which contains RGB images.

I can't use one of those models because their input layer expects a batch of shape (batch_size, height, width, 3) or (64, 224, 224, 3) in my case, but my images batches are (64, 224, 224).

Is there any way that I can use one of those models? I've thought of dropping the input layer after I've loaded the weights and adding my own (like we do for the top layers). Is this approach correct?


The model's architecture cannot be changed because the weights have been trained for a specific input configuration. Replacing the first layer with your own would pretty much render the rest of the weights useless.

-- Edit: elaboration suggested by Prune--
CNNs are built so that as they go deeper, they can extract high-level features derived from the lower-level features that the previous layers extracted. By removing the initial layers of a CNN, you are destroying that hierarchy of features because the subsequent layers won't receive the features that they are supposed to as their input. In your case the second layer has been trained to expect the features of the first layer. By replacing your first layer with random weights, you are essentially throwing away any training that has been done on the subsequent layers, as they would need to be retrained. I doubt that they could retain any of the knowledge learned during the initial training.
--- end edit ---

There is an easy way, though, which you can make your model work with grayscale images. You just need to make the image to appear to be RGB. The easiest way to do so is to repeat the image array 3 times on a new dimension. Because you will have the same image over all 3 channels, the performance of the model should be the same as it was on RGB images.

In numpy this can be easily done like this:

print(grayscale_batch.shape)  # (64, 224, 224)
rgb_batch = np.repeat(grayscale_batch[..., np.newaxis], 3, -1)
print(rgb_batch.shape)  # (64, 224, 224, 3)

The way this works is that it first creates a new dimension (to place the channels) and then it repeats the existing array 3 times on this new dimension.

I'm also pretty sure that keras' ImageDataGenerator can load grayscale images as RGB.


Converting grayscale images to RGB as per the currently accepted answer is one approach to this problem, but not the most efficient. You most certainly can modify the weights of the model's first convolutional layer and achieve the stated goal. The modified model will both work out of the box (with reduced accuracy) and be finetunable. Modifying the weights of the first layer does not render the rest of the weights useless as suggested by others.

To do this, you'll have to add some code where the pretrained weights are loaded. In your framework of choice, you need to figure out how to grab the weights of the first convolutional layer in your network and modify them before assigning to your 1-channel model. The required modification is to sum the weight tensor over the dimension of the input channels. The way the weights tensor is organized varies from framework to framework. The PyTorch default is [out_channels, in_channels, kernel_height, kernel_width]. In Tensorflow I believe it is [kernel_height, kernel_width, in_channels, out_channels].

Using PyTorch as an example, in a ResNet50 model from Torchvision (https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py), the shape of the weights for conv1 is [64, 3, 7, 7]. Summing over dimension 1 results in a tensor of shape [64, 1, 7, 7]. At the bottom I've included a snippet of code that would work with the ResNet models in Torchvision assuming that an argument (inchans) was added to specify a different number of input channels for the model.

To prove this works I did three runs of ImageNet validation on ResNet50 with pretrained weights. There is a slight difference in the numbers for run 2 & 3, but it's minimal and should be irrelevant once finetuned.

  1. Unmodified ResNet50 w/ RGB Images : Prec @1: 75.6, Prec @5: 92.8
  2. Unmodified ResNet50 w/ 3-chan Grayscale Images: Prec @1: 64.6, Prec @5: 86.4
  3. Modified 1-chan ResNet50 w/ 1-chan Grayscale Images: Prec @1: 63.8, Prec @5: 86.1
def _load_pretrained(model, url, inchans=3):
    state_dict = model_zoo.load_url(url)
    if inchans == 1:
        conv1_weight = state_dict['conv1.weight']
        state_dict['conv1.weight'] = conv1_weight.sum(dim=1, keepdim=True)
    elif inchans != 3:
        assert False, "Invalid number of inchans for pretrained weights"
    model.load_state_dict(state_dict)

def resnet50(pretrained=False, inchans=3):
    """Constructs a ResNet-50 model.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = ResNet(Bottleneck, [3, 4, 6, 3], inchans=inchans)
    if pretrained:
        _load_pretrained(model, model_urls['resnet50'], inchans=inchans)
    return model