Get dataflow job status using Java/Scala code
I am trying to programmatically find out the status of a dataflow job while its running. As in, I would like to poll the batch/ streaming job, and once it has completed, would like to trigger the next job. Is there a Java/Scala API that I can make use of to accomplish this. I have already tried using com.google.api.services.dataflow.Dataflow$Projects$Locations$Jobs
and is able to retrieve the JobMetrics, but I am trying to get more information like the jobs input & output metrics to determine if the streaming job has processed all data or not.
[EDITED - Based on follow-up question]. These are the signals I am trying to get from the dataflow jobs while they are running:
-
For Batch Jobs: Only if the job is
running
,completed
orfailed
. Based on these, I can decide whether to wait, trigger next job or not proceed with next job respectively. -
For Streaming Jobs: As I want to decide whether the job has processed all necessary data, maybe I can try to get the following information:
1. Data Freshness 2. No. of Elements ingested/ read (Input Metrics) 3. Elements written (Output Metrics)
While checking manually the streaming jobs, we conclude that the job is "done" when we see that the throughput has come down to zero and hasn't changed for a few minutes. Would like to do the same while automating this approach.
Is there anything available that can help me?
Apologies I'm Java illiterate but with regards to the details that you need I can point you to which endpoint in Dataflow API to get them. My examples are done by sending HTTP requests to the Dataflow API using curl.
-
Job Status (running, completed, failed)
- Using Jobs endpoint, you can submit a request and it should return the
currentState
. See currentState to view all states available for the job. This includes JOB_STATE_DONE,JOB_STATE_RUNNING, JOB_STATE_FAILED, etc. - This can be used for both batch and streaming.
- In Java if I'm not mistaken you can use Job class and method getCurrentState
- Example:
- Using Jobs endpoint, you can submit a request and it should return the
-
Elements ingested / written
- You mentioned that you already retrieved JobMetrics.
- JobMetrics endpoint returns lists of MetricStructuredName objects. These objects contain the source that generated the metric. You can find your metrics that can be seen on "Steps".
- For example if you have a
ReadFromPubSub
step, you can find the the number of messages that it read. See screenshot below on what the object looks like and what it looks like on the UI. - You just need to filter out the data properly for you to see the details you need.
- In Java it should be from JobMetrics class and method getMetrics
- Example (
scalar
contains the value of "Elements Added" values are not the same because it is continuously streaming):
-
Data freshness
- Data freshness cannot be pulled using the Dataflow API.
- The data is created using Cloud Monitoring and if you need this data you need to learn Monitoring API. See sample codes on how to use the API.
- To view the created metric go to GCP Console > Monitoring > Dashboards > GCP > Dataflow
- You should see a list of the Dataflow jobs and their metrics.
I hope this information can point you to the right direction and will be able to implement it using Java.