A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark

doi:10.13190/j.jbupt.2019-092

Journal of Beijing University of Posts and Telecommunications ›› 2020, Vol. 43 ›› Issue (2): 116-121.doi: 10.13190/j.jbupt.2019-092

• REPORTS • Previous Articles Next Articles

A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark

YAN Yi-fei, WANG Zhi-li, QIU Xue-song, WANG Jia-lu

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

Received:2019-05-28 Published:2020-04-28

Abstract

Abstract: For the problem of uneven distribution of data caused during the shuffle phase in the Spark distributed platform, the reason of Spark's low efficiency in processing skewed data is analyzed, then a skew model that can uniformly quantize the skew degree of key-value data after shuffle is proposed. Based on this skew model is established, and a shuffle partitioning scheme that can solve various data skew problems in the Spark platform is proposed. Firstly, the output data of the Map stage is sampled, the size of the intermediate data is predicted, and then the sampled data is pre-partitioned according to the Hash-based best fit algorithm. Finally, all the intermediate data is partitioned according to the pre-partition situation. In the cases of key skew and value skew, the experimental results show that this shuffle partitioning scheme is universal and efficient, and can effectively handle the situation of key and value skew.

Key words: data skew, Spark, shuffle, partitioning algorithm, load balancing

CLC Number:

TP399

YAN Yi-fei, WANG Zhi-li, QIU Xue-song, WANG Jia-lu. A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(2): 116-121.

[1]	SU Yang, WEI Liansuo, GUO Yuan. Multi-Objective Fusion Potential Game Wireless Ad Hoc Network Topology Control Algorithm [J]. Journal of Beijing University of Posts and Telecommunications, 2022, 45(4): 91-97.
[2]	. Multi-Objective Fusion Ordinal Potential Game Wireless Ad Hoc Network Topology Control Algorithm [J]. Journal of Beijing University of Posts and Telecommunications, 2022, 45(4): 105-111.
[3]	HE Qian, LI Shuang-fu, HUANG Huan, XU Hong. A Fast Clustering Algorithm for Massive Data [J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(3): 118-124.
[4]	LIU Bao-ju, YU Peng, FENG Lei, QIU Xue-song, JIANG Hao. Rerouting Algorithm for Load Balancing in SDN-Enabled Smart Grid Communication Network [J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(2): 16-22.
[5]	MA Qing-liu, YU Peng, WU Jia-hui, XIONG Ao, YAN Yong. A Integrated Energy Service Channel Optimization Mechanism Based on Deep Reinforcement Learning [J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(2): 87-93.
[6]	WANG Xuan, HOU Rong-hui, XU Wei-lin. Dynamic Path Switching Technology for LEO Satellite Networks [J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(2): 80-86,109.
[7]	YOU Si-qing, ZHOU Li, ZHAO Dong-jie, XUE Fei. Research on Parallelization of Collaborative Filtering Recommendation Algorithm Based on Particle Swarm Optimization [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2018, 41(6): 115-122.
[8]	ZHU Bin, SUN Bin. A Load Balancing Predication Algorithm of CART and KNN [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2017, 40(s1): 93-97.
[9]	WANG Ying-jie, LIU Xiao-ning, LUO Huan-huan, ZHOU Gui-ping, WANG Yan-ru. Wavelength Allocation Scheme Based on User Traffic Prediction in TWDM-PON [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2017, 40(6): 74-79.
[10]	YANG Juan, ZHANG Peng-ye. Design and Implementation of Parallel UCSLIM Algorithm Based on Spark [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2016, 39(s1): 37-41.
[11]	WANG Ying, XIONG Wen-cheng, LI Wen-jing. Random Virtual Network Embedding Algorithm Based on Maximum Independent Link Set [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2014, 37(s1): 8-11.
[12]	WANG Ting, XU Ke, WANG Na, SONG Jun-de. Optimized Scalable Scheme for Server Selections in Cloud Computing Environments [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2014, 37(s1): 83-86.
[13]	GAO Ya, QIU Zhi-liang, ZHANG Jian. A Cell Assignment Algorithm for Balancing Multicast Traffic with Small Fanout [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2014, 37(5): 91-95.
[14]	LU Mei-lian, ZHU Liang-liang. Load Balancing Strategy Based on CMM Model in HDFS [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2014, 37(5): 20-25.
[15]	SONG Jie, LI Tian-tian, YAN Zhen-xing, ZHU Zhi-liang. Load-Balanced Data Layout Approach in Data-Intensive Computing [J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2013, 36(4): 76-80.

A Shuffle Partition Optimization Scheme Based on Data Skew Model in Spark

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments