Where to find spark log in dataproc when running job on cluster mode
I am running the following code as job in dataproc. I could not find logs in console while running in 'cluster' mode.
import sys
import time
from datetime import datetime
from pyspark.sql import SparkSession
start_time = datetime.utcnow()
spark = SparkSession.builder.appName("check_confs").getOrCreate()
all_conf = spark.sparkContext.getConf().getAll()
print("\n\n=====\nExecuting at {}".format(datetime.utcnow()))
print(all_conf)
print("\n\n======================\n\n\n")
incoming_args = sys.argv
if len(incoming_args) > 1:
sleep_time = int(incoming_args[1])
print("Sleep time is {} seconds".format(sleep_time))
if sleep_time > 0:
time.sleep(sleep_time)
end_time = datetime.utcnow()
time_taken = (end_time - start_time).total_seconds()
print("Script execution completed in {} seconds".format(time_taken))
If I trigger the job using the deployMode
as cluster
property, I could not see corresponding logs.
But if the job is triggered in default mode which is client
mode, able to see the respective logs.
I have given the dictionary used for triggering the job.
"spark.submit.deployMode": "cluster"
{
'placement': {
'cluster_name': dataproc_cluster
},
'pyspark_job': {
'main_python_file_uri': "gs://" + compute_storage + "/" + job_file,
'args': trigger_params,
"properties": {
"spark.submit.deployMode": "cluster",
"spark.executor.memory": "3155m",
"spark.scheduler.mode": "FAIR",
}
}
}
21/12/07 19:11:27 INFO org.sparkproject.jetty.util.log: Logging initialized @3350ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.Server: Started @3467ms
21/12/07 19:11:27 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:40389}
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:11:28 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:10200
21/12/07 19:11:29 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:11:29 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:11:30 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0014
21/12/07 19:11:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8030
21/12/07 19:11:33 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
=====
Executing at 2021-12-07 19:11:35.100277
[....... ('spark.yarn.historyServer.address', '****-m:18080'), ('spark.ui.proxyBase', '/proxy/application_1638554180947_0014'), ('spark.driver.appUIAddress', 'http://***-m.c.***-123456.internal:40389'), ('spark.sql.cbo.enabled', 'true')]
======================
Sleep time is 1 seconds
Script execution completed in 9.411261 seconds
21/12/07 19:11:36 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@18528bea{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
Logs not coming in console while running in client mode
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ******-m/0.0.0.5:8032
21/12/07 19:09:04 INFO org.apache.hadoop.yarn.client.AHSProxy: Connecting to Application History server at ******-m/0.0.0.5:8032
21/12/07 19:09:05 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/07 19:09:05 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/07 19:09:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1638554180947_0013
Solution 1:
We can access the logs using query in Logs explorer in google cloud.
resource.type="cloud_dataproc_cluster" resource.labels.cluster_name="my_cluster_name"
resource.labels.cluster_uuid="aaaaa-123435-bbbbbb-ccccc"
severity=DEFAULT
jsonPayload.container_logname="stdout"
jsonPayload.message!=""
log_name="projects/my-project_id/logs/yarn-userlogs"
Solution 2:
When running jobs in cluster mode, the driver logs are in the Cloud Logging yarn-userlogs
. See the doc:
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user creates the Dataproc cluster by setting cluster properties to
--properties spark:spark.submit.deployMode=cluster
or submits the job in cluster mode by setting job properties to--properties spark.submit.deployMode=cluster
, driver output is listed in YARN userlogs, which can be accessed in Logging.