How to use Bert for long text classification?

Solution 1:

You have basically three options:

  1. You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
  2. You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
  3. You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.

I would suggest to try option 1, and only if this is not good enough to consider the other options.

Solution 2:

This paper compared a few different strategies: How to Fine-Tune BERT for Text Classification?. On the IMDb movie review dataset, they actually found that cutting out the middle of the text (rather than truncating the beginning or the end) worked best! It even outperformed more complex "hierarchical" approaches involving breaking the article into chunks and then recombining the results.

As another anecdote, I applied BERT to the Wikipedia Personal Attacks dataset here, and found that simple truncation worked well enough that I wasn't motivated to try other approaches :)