Keras: How to save model and continue training?

Solution 1:

As it's quite difficult to clarify where the problem is, I created a toy example from your code, and it seems to work alright.

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
assert_allclose(model.predict(x_train),
                new_model.predict(x_train),
                1e-5)

# fit the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

The loss continues to decrease after model loading. (restarting python also gives no problem)

Using TensorFlow backend.
Epoch 1/5
500/500 [==============================] - 2s - loss: 0.3216     Epoch 00000: loss improved from inf to 0.32163, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.2923     Epoch 00001: loss improved from 0.32163 to 0.29234, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.2542     Epoch 00002: loss improved from 0.29234 to 0.25415, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.2086     Epoch 00003: loss improved from 0.25415 to 0.20860, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1725     Epoch 00004: loss improved from 0.20860 to 0.17249, saving model to model.h5

Epoch 1/5
500/500 [==============================] - 0s - loss: 0.1454     Epoch 00000: loss improved from inf to 0.14543, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.1289     Epoch 00001: loss improved from 0.14543 to 0.12892, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.1169     Epoch 00002: loss improved from 0.12892 to 0.11694, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.1097     Epoch 00003: loss improved from 0.11694 to 0.10971, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1057     Epoch 00004: loss improved from 0.10971 to 0.10570, saving model to model.h5

BTW, redefining the model followed by load_weight() definitely won't work, because save_weight() and load_weight() does not save/load the optimizer.

Solution 2:

I compared my code with this example http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ by carefully block out line-by-line and run again. After a whole day, finally, I found what was wrong.

When making char-int mapping, I used

# title_str_reduced is a string
chars = list(set(title_str_reduced))
# make char to int index mapping
char2int = {}
for i in range(len(chars)):
    char2int[chars[i]] = i    

A set is an unordered data structure. In python, when a set is converted to a list which is ordered, the order is randamly given. Thus my char2int dictionary is randomized everytime when I reopen python. I fixed my code by adding a sorted()

chars = sorted(list(set(title_str_reduced)))

This forces the conversion to a fixed order.

Solution 3:

The checkmarked Answer is not correct; the real problem is more subtle.

When you create a ModelCheckpoint() , check the best:

cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
print(cp1.best)

you will see that this is set to np.inf, which unfortunately is not your last best when you stopped training. So when you re-train and recreate the ModelCheckpoint(), if you call fit and if the loss is less than previously known value, then it seems to work, but in more complex problems you will end up saving a bad model and lose the best.

You can fix this by overwriting the cp.best parameter as shown below:

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
cp1= ModelCheckpoint(filepath=filepath, monitor='loss',     save_best_only=True, verbose=1, mode='min')
callbacks_list = [cp1]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, shuffle=True, validation_split=0.1, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
#assert_allclose(model.predict(x_train),new_model.predict(x_train), 1e-5)
score = model.evaluate(x_train, y_train, batch_size=50)
cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
cp1.best = score # <== ****THIS IS THE KEY **** See source for  ModelCheckpoint

# fit the model
callbacks_list = [cp1]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)