How to reduce the execution time for translation using mBART-50 and Hugging Face?

I could be able to explore 2 solutions to optimize the process. Lemme gimme a summary in below.

  • 1. executing it in gpu, if available
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model= MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")**.to('cuda')**

tokenizer=MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt",src_lang="hi_IN")

text="मैं ठीक हूँ।"
model_inputs=tokenizer(text,return_tensors="pt")
generated_tokens=model.generate(model_inputs,forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])**.to('cuda')**

translation=tokenizer.batch_decode(generated_tokens,skip_special_tokens=True)
print(translation)
  • 2. By limiting the no of words to be translated.
model_inputs=tokenizer(text,return_tensors="pt", max_length=500, truncation=True)
  • 3. Multiprocessing fails in this case - As the pretrained model is a neural network which runs in batches, hence any kind of multiprocessing fails.