Explicit content features of webpages are often unavailable due to distractions such as commercials, insufficient permissions, privacy protection, or deceptive disguises. To address the challenge of classifying webpages with severe content feature deficiency, a method combining graph embedding and extreme gradient boosting(XGBoost) was proposed. This method leveraged implicit relational features in webpage hyperlink networks for multi-classification. Firstly, a hyperlink network was constructed using relationships between webpages. Then, node features were extracted using graph embedding models, and statistical structural features such as clustering coefficients and PageRank values were concatenated to form dense feature vectors. Finally, ensemble learning models, including XGBoost, were trained to classify webpages for prediction. Experiments on a real Wikipedia dataset show that the Struct2Vec*+XGBoost approach achieves excellent classification results, with accuracy, precision, recall, and F1-score metrics reaching 0.987 5, 0.965 9, 0.971 3, and 0.964 1, respectively. These results are superior to those of comparison models. The findings demonstrate the effectiveness of using implicit link-based features for webpage classification in scenarios with content feature deficiency.
| 科 Family | 属数 Number of genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) | 属 Genus | 种数 Number of species | 占总种数比例 Percentage of total species (%) |
|---|---|---|---|---|---|---|
| 鹅膏菌科Amanitaceae | 2 | 11 | 5.26 | 鹅膏菌属 Amanita | 10 | 4.78 |
| 小菇科 Mycenaceae | 2 | 12 | 5.74 | 丝盖伞属 Inocybe | 5 | 2.39 |
| 多孔菌科 Polyporaceae | 8 | 14 | 6.70 | 蜡蘑属 Laccaria | 5 | 2.39 |
| 红菇科 Russulaceae | 3 | 23 | 11.00 | 小皮伞属 Marasmius | 6 | 2.87 |
| 小菇属 Mycena | 11 | 5.26 | ||||
| 光柄菇属 Pluteus | 5 | 2.39 | ||||
| 红菇属 Russula | 17 | 8.13 | ||||
| 栓菌属 Trametes | 5 | 2.39 |