北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2019, Vol. 42 ›› Issue (6): 111-117.doi: 10.13190/j.jbupt.2019-147

• 研究报告 • 上一篇    下一篇

混合采样与遗传算法相结合的垃圾网页检测

刘寒   

  1. 1. 北京邮电大学 软件学院, 北京 100876;
    2. 北京邮电大学 可信分布式计算与服务教育部重点实验室, 北京 100876
  • 收稿日期:2019-11-22 出版日期:2019-12-28 发布日期:2019-11-15
  • 作者简介:刘寒(1997-),女,硕士生,E-mail:liu_han@bupt.edu.cn.
  • 基金资助:
    国家重点研发计划项目(2017YFC1307705)

Spam Web Detection Based on Hybrid-Sampling and Genetic Algorithm

LIU Han   

  1. 1. School of Software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China;
    2. Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China
  • Received:2019-11-22 Online:2019-12-28 Published:2019-11-15

摘要: 垃圾网页检测存在数据不平衡、特征空间维度较高的问题,为此,提出一种基于随机混合采样和遗传算法的集成分类算法.首先,使用随机混合采样技术,通过随机抽样,减少多数类样本数量,用少数类样本合成过采样技术方法生成少数类样本,获得多个平衡的训练数据子集;然后使用改进的遗传算法对训练数据集进行降维,得到多个具有最优特征的训练数据子集;使用极端梯度算法(XGBoost)作为分类器,训练多个平衡数据子集,用简单投票法对多个分类器进行集成,得到新的分类器;最后对测试集进行预测,得到最终预测结果.实验结果表明,提出算法的分类结果与XGBoost的结果相比,准确率提高了约19.25%,且减少了建立学习模型的时间,提高了分类性能,是一种较好的分类算法.

关键词: 垃圾网页检测, 混合采样, 集成分类, 遗传算法, 极端梯度算法

Abstract: Spam web detection is of ten troubled by the problem of unbalanced data and high feature space dimension. In order to solve these two problems, the ensemble classification algorithm based on random hybrid-sampling and genetic algorithm was proposed. Firstly, a number of balanced training data subsets is obtained by reducing the number of majority samples through random sampling and generating minority samples by synthetic minority over-sampling technique(SMOTE) method. Then, the improved genetic algorithm is used to reduce the dimension of training data set to obtain multiple subsets of training data with optimal feature. Extreme gradient boosting(XGBoost)is also used as the classifier to train multiple balanced data subsets, and so a new classifier is obtained by ensemble multiple classifiers with simple voting method. Finally, the test set is predicted and the final prediction is obtained. Experiments show that, compared with XGBoost, the proposed algorithm improves the accuracy by about 19.25%, reduces the time to build the learning model, and improves the classification performance.

Key words: spam web detection, hybrid-sampling, ensemble classification, genetic algorithm, extreme gradient boosting

中图分类号: