Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM ›› 2019, Vol. 42 ›› Issue (6): 111-117.doi: 10.13190/j.jbupt.2019-147

• Reports • Previous Articles     Next Articles

Spam Web Detection Based on Hybrid-Sampling and Genetic Algorithm

LIU Han   

  1. 1. School of Software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China;
    2. Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China
  • Received:2019-11-22 Online:2019-12-28 Published:2019-11-15

Abstract: Spam web detection is of ten troubled by the problem of unbalanced data and high feature space dimension. In order to solve these two problems, the ensemble classification algorithm based on random hybrid-sampling and genetic algorithm was proposed. Firstly, a number of balanced training data subsets is obtained by reducing the number of majority samples through random sampling and generating minority samples by synthetic minority over-sampling technique(SMOTE) method. Then, the improved genetic algorithm is used to reduce the dimension of training data set to obtain multiple subsets of training data with optimal feature. Extreme gradient boosting(XGBoost)is also used as the classifier to train multiple balanced data subsets, and so a new classifier is obtained by ensemble multiple classifiers with simple voting method. Finally, the test set is predicted and the final prediction is obtained. Experiments show that, compared with XGBoost, the proposed algorithm improves the accuracy by about 19.25%, reduces the time to build the learning model, and improves the classification performance.

Key words: spam web detection, hybrid-sampling, ensemble classification, genetic algorithm, extreme gradient boosting

CLC Number: