Although I have already set crossJoin.enable to true and autoBroadcastJoinThreshold to -1, I still got an error.
spark = ( SparkSession .builder.appName('my_spark') .config("spark.sql.crossJoin.enabled", "true") .config("spark.sql.autoBroadcastJoinThreshold", '-1') .getOrCreate() )
org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 8 GB
This error is due to default PySpark
broadcast size limit which is 8 GB.
There is no specific code or config we can set to solve this problem (at least I didn’t find one).
What we can do is to optimize our code.
Here are some of my ideas.
selectExpr(cols)to choose the columns we actually need before
jointo reduce the dataframe’s size.
filter(expr)to filter out what data we don’t need.
- Use normal
df1.join(df2)instead of using
I selected less columns of my dataframe (from 7 cols to 3 cols) to solve my problem.