Apache Beam Performance Between Python Vs Java Running on GCP Dataflow

Solution 1:

Yes, this is a very normal performance factor between Python and Java. In fact, for many programs the factor can be 10x or much more.

The details of the program can radically change the relative performance. Here are some things to consider:

  • Profiling the Dataflow job (official docs)
  • Profiling a Dataflow pipeline (medium blog)
  • Profiling Apache Beam Python pipelines (another medium blog)
  • Profiling Python (general Cloud Profiler docs)
  • How can I profile a Python Dataflow job? (previous StackOverflow question on profiling Python job)

If you prefer Python for its concise syntax or library ecosystem, the approach to achieve speed is to use optimized C libraries or Cython for the core processing, for example using pandas/numpy/etc. If you use Beam's new Pandas-compatible dataframe API you will automatically get this benefit.