高性能行任务散列法GPU一般稀疏矩阵-矩阵乘法

doi:10.13190/j.jbupt.2018-252

北京邮电大学学报 ›› 2019, Vol. 42 ›› Issue (3): 106-113.doi: 10.13190/j.jbupt.2018-252

高性能行任务散列法GPU一般稀疏矩阵-矩阵乘法

汤洋¹, 赵达非^2,3, 黄智濒^2,3, 戴志涛^2,3

1. 北京邮电大学理学院, 北京 100876;
2. 北京邮电大学智能通信软件与多媒体北京市重点实验室, 北京 100876;
3. 北京邮电大学计算机学院, 北京 100876

收稿日期:2018-10-09 出版日期:2019-06-28 发布日期:2019-06-20
作者简介:汤洋(1997-),男,硕士生;黄智濒(1978-),男,讲师,硕士生导师,E-mail:huangzb@bupt.edu.cn.
基金资助:
中央高校基本科研业务费专项资金项目（2017RC42）；IBMSUR项目（IA2016010）；提升政府治理能力大数据应用技术国家工程实验室重点支持项目；中国博士后科学基金面上项目（2014M550662）

High Performance Row-Based Hashing GPU SpGEMM

TANG Yang¹, ZHAO Da-fei^2,3, HUANG Zhi-bin^2,3, DAI Zhi-tao^2,3

1. School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China;
3. School of Computer Science, Beijing University of Posts and Telecommunication, Beijing 100876, China

Received:2018-10-09 Online:2019-06-28 Published:2019-06-20

摘要/Abstract

摘要： 针对一般稀疏矩阵-矩阵乘法（SpGEMM）的性能问题，提出了一种基于任务分类和低延迟散列表的图形处理器上的加速SpGEMM算法RBSPARSE.该算法由一种低成本子任务复杂度预分析方法和一种低延迟共享内存上的散列表的方法组成，以达到最大效率.通过解决负载均衡和内存延迟问题，RBSPARSE可以显著减少计算的总时间.比较了RBSparse和BHSparse，前者是最快的SpGEMM算法，结果表明RBSparse的性能是BHSparse的平均3.1倍，在最佳情况下可达到14.49倍.

关键词: 稀疏矩阵-矩阵乘法, 图形处理器, 性能优化, 散列表, 共享内存

Abstract: Aiming at the performance problem of general sparse matrix-matrix multiplication (SpGEMM), a graphics processing unit (GPU)-accelerate SpGEMM algorithm based on task classification and low-latency Hashing table, RBSPARSE, was presented in the paper. RBSPARSE consists of a low-cost pre-analysis method to identify the complexity of sub-tasks, and a Hashing table-based algorithm which could utilize low-latency shared memory to achieve max efficiency. By taking the load balancing issue and the memory latency issue into consideration, RBSPARSE could significantly reduce the overall time in computation. RBSparse and BHSparse are compared. BHSparse is the previous state-of-the-art algorithm for SpGEMM. The result shows that our algorithm is 3.1 times faster than BHSparse on average, and could achieve a maximum 14.49 times faster speed in the best scenario.

Key words: general sparse matrix-matrix multiplication, graphics processing unit, performance optimization, Hash table, shared memory

中图分类号:

TP391

汤洋, 赵达非, 黄智濒, 戴志涛. 高性能行任务散列法GPU一般稀疏矩阵-矩阵乘法[J]. 北京邮电大学学报, 2019, 42(3): 106-113.

TANG Yang, ZHAO Da-fei, HUANG Zhi-bin, DAI Zhi-tao. High Performance Row-Based Hashing GPU SpGEMM[J]. JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM, 2019, 42(3): 106-113.

参考文献

[1] Bell N, Dalton S, Olson L N.Exposing fine-grained parallelism in algebraic multigrid methods[J].SIAM Journal on Scientific Computing, 2012, 34(4):123-152.
[2] Buluç A, Gilbert J R.The combinatorial BLAS:design, implementation, and applications[J].International Journal of High Performance Computing Applications, 2011, 25(4):496-509.
[3] Yuan Tao, Huang Zhibin.Shuffle reduction based sparse matrix-vector multiplication on Kepler GPU[J].International Journal of Grid and Distributed Computing, 2016, 9(10):99-106.
[4] Greathouse J L, Daga M.Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.New York:IEEE Press, 2014:769-780.
[5] 菅立恒, 易卫东.使用GPU加速无线传感器网络信道仿真[J].北京邮电大学学报, 2013, 36(2):24-27.Jian Liheng, Yi Weidong.Acceleration of simulation of radio channel in wireless sensor networks using GPU[J].Journal of Beijing University of Posts and Telecommunications, 2013, 36(2):24-27.
[6] Liu Weifeng, Vinter B.An efficient GPU general sparse matrix-matrix multiplication for irregular data[C]//2014 IEEE 28^th International Parallel and Distributed Processing Symposium.New York:IEEE Press, 2014:370-381.
[7] 黄智濒, 周锋, 马华东.自适应访问模式的缓存替换策略[J].北京邮电大学学报, 2016, 39(3):44-48.Huang Zhibin, Zhou Feng, Ma Huadong.A cache replacement policy adapting to the request access pattern[J].Journal of Beijing University of Posts and Telecommunications, 2016, 39(3):44-48.
[8] Liu Junhong, He Xin, Liu Weifeng, et al.Register-based implementation of the sparse general matrix-matrix multiplication on GPUs[J].ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.New York:ACM, 2018:407-408.
[9] Anh P N Q, Fan Rui, Wen Yonggang.Balanced Hashing and efficient GPU sparse general matrix-matrix multiplication[C]//Proceedings of the 2016 International Conference on Supercomputing.New York:ACM, 2016:36.
[10] Dalton S, Bell N, Olson L, et al.CUSP:generic parallel algorithms for sparse matrix and graph computations:Version 0.5[EB/OL].(2015-03-13)[2018-05-30].https://cusplibrary.github.io.
[11] Batcher K E.Sorting networks and their applications[C]//Spring Joint Computer Conference.New York:ACM, 1968:307-314.
[12] Davis T A, Hu Yifan.The University of Florida sparse matrix collection[J].ACM Transactions on Mathematical Software, 2011, 38(1):1.

高性能行任务散列法GPU一般稀疏矩阵-矩阵乘法

High Performance Row-Based Hashing GPU SpGEMM

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价

[1]	李峻峰, 李丹, 黄昱恺, 程阳, 令瑞林. 通用平台高性能可扩展网络地址转换系统[J]. 北京邮电大学学报, 2021, 44(2): 14-19.
[2]	王莹, 李洪林, 费子轩, 赵竑宇, 王虹. 5G多接入网络TCP研究与展望[J]. 北京邮电大学学报, 2019, 42(1): 1-15.
[3]	李映雪, 钟士元, 雷静, 黄春明, 姚诸香. 认知无线电中OFDM协作频谱感知技术及其优化[J]. 北京邮电大学学报, 2015, 38(5): 96-98,103.
[4]	魏祥麟陈鸣张国敏. 一种综合的结构化P2P系统负载均衡机制[J]. 北京邮电大学学报, 2012, 35(3): 87-90.
[5]	吕晓鹏,王文东,龚向阳,马建. 混合网中的P2P资源共享机制[J]. 北京邮电大学学报, 2011, 34(4): 113-117.
[6]	杨武, 方滨兴, 云晓春, 张宏莉, 胡铭曾. 一种高性能分布式入侵检测系统的研究与实现[J]. 北京邮电大学学报, 2004, 27(4): 83-86.