Adapt a numerical tensorflow dataset as a textvector

If I understood you correctly, you can use your existing dataset with the TextVectorization layer like this:

import tensorflow as tf

input_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20,
    output_mode="int",
    output_sequence_length=6,
)
target_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20,
    output_mode="int",
    output_sequence_length=6 + 1
)

# Get inputs only and flatten them
inputs = ds.map(lambda x, y: tf.reshape(x, (tf.math.reduce_prod(tf.shape(x)), )))

# Get targets only and flatten them
targets = ds.map(lambda x, y: tf.reshape(y, (tf.math.reduce_prod(tf.shape(y)), )))

input_vectorization.adapt(inputs)
target_vectorization.adapt(targets)
print(input_vectorization.get_vocabulary())
print(target_vectorization.get_vocabulary())

['', '[UNK]', '7', '6', '5', '4', '8', '3', '9', '2', '10', '1']
['', '[UNK]', '9', '8', '7', '6', '11', '10', '5', '12']

Note that the adapt function simply creates a vocabulary based on the inputs and each word in the vocabulary is mapped to a unique integer value. Also, due to the default parameter standardize='lower_and_strip_punctuation' of the TextVectorization layer, the minus signs are removed when calling adapt. You can avoid this behavior, if you want, by setting for example standardize='lower'.

Adapt a numerical tensorflow dataset as a textvector

Related

Recent Posts