Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

Journal of Beijing University of Posts and Telecommunications ›› 2020, Vol. 43 ›› Issue (2): 116-121.doi: 10.13190/j.jbupt.2019-092

• REPORTS • Previous Articles     Next Articles

A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark

YAN Yi-fei, WANG Zhi-li, QIU Xue-song, WANG Jia-lu   

  1. State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2019-05-28 Published:2020-04-28

Abstract: For the problem of uneven distribution of data caused during the shuffle phase in the Spark distributed platform, the reason of Spark's low efficiency in processing skewed data is analyzed, then a skew model that can uniformly quantize the skew degree of key-value data after shuffle is proposed. Based on this skew model is established, and a shuffle partitioning scheme that can solve various data skew problems in the Spark platform is proposed. Firstly, the output data of the Map stage is sampled, the size of the intermediate data is predicted, and then the sampled data is pre-partitioned according to the Hash-based best fit algorithm. Finally, all the intermediate data is partitioned according to the pre-partition situation. In the cases of key skew and value skew, the experimental results show that this shuffle partitioning scheme is universal and efficient, and can effectively handle the situation of key and value skew.

Key words: data skew, Spark, shuffle, partitioning algorithm, load balancing

CLC Number: