Why is training-set accuracy during fit() different to accuracy calculated right after using predict on same data?

Have written a basic deep learning model in Tensorflow - Keras.

Why is the training-set accuracy as reported at the end of training (0.4097) differs to that reported directly afterwards with a direct calculation on the same training data using the predict function (or using evaluate, which gives the same number) = 0.6463?

MWE below; output directly after.

from extra_keras_datasets import kmnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization
import numpy as np


# Model configuration
no_classes = 10


# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist')

# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 

# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)



# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)
input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)


# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize input data
input_train = input_train / 255
input_test = input_test / 255


# Create the model
model = Sequential()
model.add(Flatten(input_shape=input_shape))
model.add(Dense(no_classes, activation='softmax'))


# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])


# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=2000,
            epochs=1,
            verbose=1)

prediction = model.predict(input_train)
print("Prediction accuracy = ", np.mean( np.argmax(prediction, axis=1) == target_train))

model.evaluate(input_train, target_train, verbose=2)

Last couple of lines of output:

30/30 [==============================] - 0s 3ms/step - loss: 1.8336 - accuracy: 0.4097
Prediction accuracy =  0.6463166666666667
1875/1875 - 1s - loss: 1.3406 - accuracy: 0.6463

Edit.

The initial answers below have solved my first problem by pointing out that the batch size matters when you only run 1 epoch. When running small batch sizes (or batch size = 1), or more epochs, you can push the post-fitting prediction accuracy pretty close to the final accuracy thrown out in fitting itself. Which is good!

I originally asked this question because I was having trouble with a more complex model.

I'm still have trouble with understanding what's happening in this case (and yes, it involves batch normalisation). To get my MWE, replace everything below 'create the model' above with the code below to implement a few fully connected layers with batch normalisation.

When you run two epochs of this - you'll see really stable accuracies from all 30 mini-batches (30 because 60,000 in training set divided by 2000 in each batch). I see very consistently 83% accuracy across the whole second epoch of training.

But the prediction after fitting is an abysmal 10% or thereabouts after doing this. Can anyone explain this?

model = Sequential()
model.add(Dense(50, activation='relu', input_shape = input_shape))
model.add(BatchNormalization())
model.add(Dense(20, activation='relu'))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(no_classes, activation='softmax'))


# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])


# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=2000,
            epochs=2,
            verbose=1)

prediction = model.predict(input_train)

print("Prediction accuracy = ", np.mean( np.argmax(prediction, axis=1) == target_train))

model.evaluate(input_train, target_train, verbose=2, batch_size = batch_size)
30/30 [==============================] - 46s 2s/step - loss: 0.5567 - accuracy: 0.8345
Prediction accuracy =  0.10098333333333333

Solution 1:

One reason this can happen, is the last accuracy reported takes into account the entire epoch, with its parameters non constant, and still being optimized.

When evaluating the model, the parameters stop changing, and they remain in their final (hopefully, most optimized) state. Unlike during the last epoch, for which the parameters were in all kinds of (hopefully, less optimized) states, more so at the start of the epoch.


Deleted because I now see you didn't use batch norm in this case.


I am assuming this is due to BatchNormalization.

See for example here

During training, a moving average is used.

During inference, we already have the normalization parameters

This is likely to be the cause of the difference.

Please try without it, and see if still such drastic differences exist.

Solution 2:

Just adding to @Gulzar answer: this effect can be very clear because OP used only one epoch (a lot of parameters are changing in the very beginning of trainning), batch size not equal in evaluate method (default to 32) and fit method, batch size lot less than whole data (meaning a lot of updating during each epoch).

Just adding more epochs to same experiment would attenuate this effect.

# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=2000,
            epochs=40,
            verbose=1)

Result

Epoch 40/40
30/30 [==============================] - 0s 11ms/step - loss: 0.5663 - accuracy: 0.8339
Prediction accuracy =  0.8348
1875/1875 - 2s - loss: 0.5643 - accuracy: 0.8348 - 2s/epoch - 1ms/step
[0.5643048882484436, 0.8348000049591064]