What does Keras.io.preprocessing.sequence.pad_sequences do?
The Keras documentation could be improved here. After reading through this, I still do not understand what this does exactly: Keras.io.preprocessing.sequence.pad_sequences
Could someone illuminate what this function does, and ideally provide an example?
pad_sequences
is used to ensure that all sequences in a list have the same length. By default this is done by padding 0
in the beginning of each sequence until each sequence has the same length as the longest sequence.
For example
>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]])
array([[0, 1, 2, 3],
[3, 4, 5, 6],
[0, 0, 7, 8]], dtype=int32)
[3, 4, 5, 6]
is the longest sequence, so 0
will be padded to the other sequences so their length matches [3, 4, 5, 6]
.
If you rather want to pad to the end of the sequences you can set padding='post'
.
If you want to specify the maximum length of each sequence you can use the maxlen
argument. This will truncate all sequences longer than maxlen
.
>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]], maxlen=3)
array([[1, 2, 3],
[4, 5, 6],
[0, 7, 8]], dtype=int32)
Now each sequence have the length 3 instead.
According to the documentation one can control the truncation with the pad_sequences. By default truncating is set to pre
, which truncates the beginning part of the sequence. If you rather want to truncate the end part of the sequence you can set it to post
.
some examples:
>>> from keras.preprocessing.sequence import pad_sequences
>>> a = [[1, 2, 3], [3, 4, 5, 6], [7, 8]]
>>> # add the 0's on the beginning of sequences
>>> pad_sequences(a)
array([[0, 1, 2, 3],
[3, 4, 5, 6],
[0, 0, 7, 8]])
>>> # add the 0's on the end of sequences
>>> pad_sequences(a, padding="post")
array([[1, 2, 3, 0],
[3, 4, 5, 6],
[7, 8, 0, 0]])
>>> # add a limit length of sequences
>>> pad_sequences(a, maxlen=3)
array([[1, 2, 3],
[4, 5, 6],
[0, 7, 8]])
>>> # add a limit length on the end of sequences
>>> pad_sequences(a, maxlen=3, padding="post")
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 0]])