
Spark性能优化总结
2019, Aug 28
Spark性能优化总结
- 配置调优
spark-submit
overwrite jar in the classpath of Spark cluster.spark-submit \ --conf spark.master=yarn \ --conf spark.submit.deployMode=cluster \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.yarn.queue=${YARN_QUEUE} \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.maxExecutors=20 \ --conf spark.dynamicAllocation.minExecutors=1 \ --conf spark.dynamicAllocation.initialExecutors=1 \ --conf spark.task.cpus=1 \ --conf spark.executor.cores=1 \ --conf spark.driver.memory=1g \ --conf spark.yarn.tags=${USER_TAGS} \ --conf spark.files.maxPartitionBytes=128m \ --conf spark.sql.files.maxPartitionBytes=128m \ --conf spark.sql.files.minPartitionNum= 1 \ --conf spark.executor.instances=5 \ --conf spark.executor.memory=4g \ --conf spark.executor.memoryOverhead=1g \ --conf spark.sql.files.mergeSmallFile.enabled=true \ --conf spark.sql.files.mergeSmallFile.maxBytes=268435456 \ --packages com.google.protobuf:protobuf-java:3.6.0 \ --conf spark.driver.extraClassPath=com.google.protobuf_protobuf-java-3.6.0.jar \ --conf spark.executor.extraClassPath=com.google.protobuf_protobuf-java-3.6.0.jar \ --conf ${MAIN_CLASS} \ {SPARK_APP_JAR}
-
代码调优
-
jvm调优
-
命令提示
- sync_partition_metadata
CALL system.sync_partition_metadata(schema_name => 'data_base', table_name => 'table_name', mode => 'DROP', case_sensitive => false)
- pyspark related
- virtual env
// Create virtual env under path /path/to/your/pyspark_venv python3 -m venv /path/to/your/pyspark_venv // Activate the virtual env source /path/to/your/pyspark_venv/bin/activate // Install library that user needed pip3 install pyarrow pandas venv-pack // package the venv venv-pack -o pyspark_venv.tar.gz
- pyspark submit
# cluster mode(spark-submit) spark-submit --queue {YOUR_QUEUE} \ --deploy-mode cluster \ --archives "/path/to/your/pyspark_venv.tar.gz#environment" \ --conf spark.pyspark.python=./environment/bin/python3 \ user_job.py # client mode(spark-submit) spark-submit --queue {YOUR_QUEUE} \ --deploy-mode client \ --archives "/path/to/your/pyspark_venv.tar.gz#environment" \ --conf spark.pyspark.python=./environment/bin/python3 \ --conf spark.pyspark.driver.python=/path/to/your/pyspark_venv/bin/python3 \ user_job.py # client mode(pyspark) pyspark --queue {YOUR_QUEUE} \ --conf spark.pyspark.driver.python=/path/to/your/pyspark_venv/bin/python3 \ --conf spark.pyspark.python=./environment/bin/python3 \ --archives "/path/to/your/pyspark_venv.tar.gz#environment" \ user_job.py
- virtual env
- sync_partition_metadata