What does Keras.io.preprocessing.sequence.pad_sequences do?

The Keras documentation could be improved here. After reading through this, I still do not understand what this does exactly: Keras.io.preprocessing.sequence.pad_sequences

Could someone illuminate what this function does, and ideally provide an example?


pad_sequences is used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.

For example

>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]])
array([[0, 1, 2, 3],
       [3, 4, 5, 6],
       [0, 0, 7, 8]], dtype=int32)

[3, 4, 5, 6] is the longest sequence, so 0 will be padded to the other sequences so their length matches [3, 4, 5, 6].

If you rather want to pad to the end of the sequences you can set padding='post'.

If you want to specify the maximum length of each sequence you can use the maxlen argument. This will truncate all sequences longer than maxlen.

>>> pad_sequences([[1, 2, 3], [3, 4, 5, 6], [7, 8]], maxlen=3)
array([[1, 2, 3],
       [4, 5, 6],
       [0, 7, 8]], dtype=int32)

Now each sequence have the length 3 instead.

According to the documentation one can control the truncation with the pad_sequences. By default truncating is set to pre, which truncates the beginning part of the sequence. If you rather want to truncate the end part of the sequence you can set it to post.


some examples:

>>> from keras.preprocessing.sequence import pad_sequences
>>> a = [[1, 2, 3], [3, 4, 5, 6], [7, 8]]

>>> # add the 0's on the beginning of sequences
>>> pad_sequences(a)
array([[0, 1, 2, 3],
       [3, 4, 5, 6],
       [0, 0, 7, 8]])

>>> # add the 0's on the end of sequences
>>> pad_sequences(a, padding="post")
array([[1, 2, 3, 0],
       [3, 4, 5, 6],
       [7, 8, 0, 0]])

>>> # add a limit length of sequences
>>> pad_sequences(a, maxlen=3)
array([[1, 2, 3],
       [4, 5, 6],
       [0, 7, 8]])


>>> # add a limit length on the end of sequences
>>> pad_sequences(a, maxlen=3, padding="post")
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 0]])