Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion
Class ProdsTransformer:
def __init__(self):
self.products_lookup_hmap = {}
self.broadcast_products_lookup_map = None
def create_broadcast_variables(self):
self.broadcast_products_lookup_map = sc.broadcast(self.products_lookup_hmap)
def create_lookup_maps(self):
// The code here builds the hashmap that maps Prod_ID to another space.
pt = ProdsTransformer ()
pt.create_broadcast_variables()
pairs = distinct_users_projected.map(lambda x: (x.user_id,
pt.broadcast_products_lookup_map.value[x.Prod_ID]))
I get the following error:
"Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."
Any help with how to deal with the broadcast variables will be great!
Solution 1:
By referencing the object containing your broadcast variable in your map
lambda, Spark will attempt to serialize the whole object and ship it to workers. Since the object contains a reference to the SparkContext, you get the error. Instead of this:
pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID]))
Try this:
bcast = pt.broadcast_products_lookup_map
pairs = distinct_users_projected.map(lambda x: (x.user_id, bcast.value[x.Prod_ID]))
The latter avoids the reference to the object (pt
) so that Spark only needs to ship the broadcast variable.