北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2020, Vol. 43 ›› Issue (2): 116-121.doi: 10.13190/j.jbupt.2019-092

• 研究报告 • 上一篇    下一篇

Spark环境下基于数据倾斜模型的Shuffle分区优化方案

阎逸飞, 王智立, 邱雪松, 王嘉潞   

  1. 北京邮电大学 网络与交换技术国家重点实验室, 北京 100876
  • 收稿日期:2019-05-28 发布日期:2020-04-28
  • 通讯作者: 王智立(1975-),男,副教授,E-mail:zlwang@bupt.edu.cn. E-mail:zlwang@bupt.edu.cn
  • 作者简介:阎逸飞(1993-),男,硕士生.

A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark

YAN Yi-fei, WANG Zhi-li, QIU Xue-song, WANG Jia-lu   

  1. State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2019-05-28 Published:2020-04-28

摘要: 针对Spark分布式平台在shuffle阶段中导致数据量分配不均衡的问题,首先分析了Spark平台中数据倾斜的原因,建立了一个可以统一量化shuffle后key-value数据倾斜程度的倾斜模型;基于倾斜模型提出了一个可以解决Spark平台中多种数据倾斜问题的shuffle分区方案.该分区方案首先对Map阶段的输出数据进行采样,预测出全局中间数据的大小,再根据基于哈希的最佳适应算法对采样数据进行预分区,得到一张预分区表,最后根据预分区表对全部的中间数据完成分区.在key和value这2种不同倾斜情况下的实验结果表明,该shuffle分区方案具有普适性和高效性,可以有效处理key和value倾斜的情况.

关键词: 数据倾斜, Spark, shuffle, 分区算法, 负载均衡

Abstract: For the problem of uneven distribution of data caused during the shuffle phase in the Spark distributed platform, the reason of Spark's low efficiency in processing skewed data is analyzed, then a skew model that can uniformly quantize the skew degree of key-value data after shuffle is proposed. Based on this skew model is established, and a shuffle partitioning scheme that can solve various data skew problems in the Spark platform is proposed. Firstly, the output data of the Map stage is sampled, the size of the intermediate data is predicted, and then the sampled data is pre-partitioned according to the Hash-based best fit algorithm. Finally, all the intermediate data is partitioned according to the pre-partition situation. In the cases of key skew and value skew, the experimental results show that this shuffle partitioning scheme is universal and efficient, and can effectively handle the situation of key and value skew.

Key words: data skew, Spark, shuffle, partitioning algorithm, load balancing

中图分类号: