TensorFlow TextVectorization producing Ragged Tensor with no padding after loading it from pickle

I have a TensorFlow TextVectorization layer named "eng_vectorization":

vocab_size = 15000
sequence_length = 20

eng_vectorization = TextVectorization(max_tokens = vocab_size,
                                  output_mode = 'int',
                                  output_sequence_length = sequence_length)

train_eng_texts = [pair[0] for pair in text_pairs]  # Where text_pairs is my english-spanish text data.
eng_vectorization.adapt(train_eng_texts)

and I saved it in a pickle file, using this code:

pickle.dump({'config': eng_vectorization.get_config(),
             'weights': eng_vectorization.get_weights()},
             open("english_vocab.pkl", "wb"))

Then I load that pickle file properly as new_eng_vectorization:

from_disk = pickle.load(open("english_vocab.pkl", "rb"))

new_eng_vectorization = TextVectorization.from_config(from_disk['config'])
new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_eng_vectorization.set_weights(from_disk['weights'])

Now I am expecting, both previous vectorization eng_vectorization and newly loaded vectorization new_eng_vectorization to work the same, but they are not.

The output of original vectorization, eng_vectorization(['Hello people']) is a Tensor:

<tf.Tensor: shape=(1, 20), dtype=int64, numpy=
array([[1800,  110,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0]])>

And the output of pickled vectorization, new_eng_vectorization(['Hello people']) is a Ragged Tensor.

<tf.RaggedTensor [[1800, 110]]>

Both eng_vectorization and new_eng_vectorization have same config:

{'batch_input_shape': (None,),
 'dtype': 'string',
 'idf_weights': None,
 'max_tokens': 15000,
 'name': 'text_vectorization',
 'ngrams': None,
 'output_mode': 'int',
 'output_sequence_length': 20,
 'pad_to_max_tokens': False,
 'ragged': False,
 'sparse': False,
 'split': 'whitespace',
 'standardize': 'lower_and_strip_punctuation',
 'trainable': True,
 'vocabulary': None}

I think there is some problem with the way I saved the vectorization, how do I fix this? I am using this for deployment, that's why I want that pickled vectorization to work as the previous one.

Here is a Google Colab link to a reproduciable code - [CLICK HERE]


Solution 1:

The problem is related to a very recent bug, where the output_mode is not set correctly when it comes from a saved configuration.

This works:

pickle.dump({'config': eng_vectorization.get_config(),
             'weights': eng_vectorization.get_weights()},
             open("english_vocab.pkl", "wb"))

from_disk = pickle.load(open("english_vocab.pkl", "rb"))

new_eng_vectorization = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
                                          output_mode='int',
                                          output_sequence_length=from_disk['config']['output_sequence_length'])

new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_eng_vectorization.set_weights(from_disk['weights'])
new_eng_vectorization(['Hello people'])
<tf.Tensor: shape=(1, 20), dtype=int64, numpy=
array([[1800,  110,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0]])>

This is currently not working correctly:

pickle.dump({'config': eng_vectorization.get_config(),
             'weights': eng_vectorization.get_weights()},
             open("english_vocab.pkl", "wb"))

from_disk = pickle.load(open("english_vocab.pkl", "rb"))
new_eng_vectorization = TextVectorization(max_tokens=from_disk['config']['max_tokens'],
                                          output_mode=from_disk['config']['output_mode'],
                                          output_sequence_length=from_disk['config']['output_sequence_length'])

new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_eng_vectorization.set_weights(from_disk['weights'])
new_eng_vectorization(['Hello people'])
<tf.RaggedTensor [[1800, 110]]>

Even though both 'int' and from_disk['config']['output_mode'] are equal and of the same data type. Anyway, you can use the workaround for now.