Article(id=1261267656154406975, tenantId=1146029695717560320, journalId=1146123166801305609, issueId=1261262687258985194, articleNumber=null, orderNo=null, doi=10.12404/j.issn.1671-1815.2405389, pmid=null, cstr=null, oa=null, hot=null, price=null, onlineType=0, articleFormat=0, articleType=null, articleTypeStr=research-article, receivedDate=1721145600000, receivedDateStr=2024-07-17, revisedDate=1744387200000, revisedDateStr=2025-04-12, acceptedDate=null, acceptedDateStr=null, onlineDate=1778639242446, onlineDateStr=2026-05-13, pubDate=1752768000000, pubDateStr=2025-07-18, doiRegisterDate=null, doiRegisterDateStr=null, onlineIssueDate=1778639242446, onlineIssueDateStr=2026-05-13, onlineJustAcceptDate=null, onlineJustAcceptDateStr=null, onlineFirstDate=null, onlineFirstDateStr=null, sourceXml=null, magXml=null, createTime=1778639242446, creator=13701087609, updateTime=1778639242446, updator=13701087609, issue=Issue{id=1261262687258985194, tenantId=1146029695717560320, journalId=1146123166801305609, year='2025', volume='25', issue='20', pageStart='8317', pageEnd='8759', issueExtLink='null', onlineDate='null', pubDate='null', beforeIssueId=null, nextIssueId=null, price=null, status=1, issueComplete=1, articleOrder=1, issueType=-1, specialIssue=null, createTime=1778638057769, creator=13701087609, updateTime=1778753106634, updator=13701087609, preIssue=null, nextIssue=null, ext={EN=IssueExt(id=1261745237240722095, tenantId=1146029695717560320, journalId=1146123166801305609, issueId=1261262687258985194, language=EN, specialIssueTitle=, coverIllustrator=null, specialIssueEditor=, specialIssueAbout=), CN=IssueExt(id=1261745237240722096, tenantId=1146029695717560320, journalId=1146123166801305609, issueId=1261262687258985194, language=CN, specialIssueTitle=, coverIllustrator=null, specialIssueEditor=, specialIssueAbout=)}, issueFiles=null}, startPage=8604, endPage=8614, ext={EN=ArticleExt(id=1261267657806962759, articleId=1261267656154406975, tenantId=1146029695717560320, journalId=1146123166801305609, language=EN, title=Multi-Classification Prediction of Web Pages with Missing Content Features Based on Graph Embedding and Ensemble Classification Algorithm, columnId=1156262729162810294, journalTitle=Science Technology and Engineering, columnName=Papers·Automation and Computational Technology, runingTitle=null, highlight=null, articleAbstract=

Explicit content features of webpages are often unavailable due to distractions such as commercials, insufficient permissions, privacy protection, or deceptive disguises. To address the challenge of classifying webpages with severe content feature deficiency, a method combining graph embedding and extreme gradient boosting(XGBoost) was proposed. This method leveraged implicit relational features in webpage hyperlink networks for multi-classification. Firstly, a hyperlink network was constructed using relationships between webpages. Then, node features were extracted using graph embedding models, and statistical structural features such as clustering coefficients and PageRank values were concatenated to form dense feature vectors. Finally, ensemble learning models, including XGBoost, were trained to classify webpages for prediction. Experiments on a real Wikipedia dataset show that the Struct2Vec*+XGBoost approach achieves excellent classification results, with accuracy, precision, recall, and F1-score metrics reaching 0.987 5, 0.965 9, 0.971 3, and 0.964 1, respectively. These results are superior to those of comparison models. The findings demonstrate the effectiveness of using implicit link-based features for webpage classification in scenarios with content feature deficiency.

, correspAuthors=Bin LIAO, authorNote=null, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=null, pdfFileSize=null, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, authorCompany=null, fund=null, authors=null, authorsList=Tao ZHANG, Bin LIAO, Jiong YU), CN=ArticleExt(id=1261267666673721464, articleId=1261267656154406975, tenantId=1146029695717560320, journalId=1146123166801305609, language=CN, title=基于图嵌入与集成分类算法的内容特征缺失网页多分类预测方法, columnId=1156262729783567290, journalTitle=科学技术与工程, columnName=论文·自动化技术、计算机技术, runingTitle=null, highlight=null, articleAbstract=

由于噪声(如广告)、权限不足、隐私保护或恶意伪装等原因,造成大量网页显式内容特征不能被及时、全面的获取。在此背景下,为解决在网页内容特征严重缺失情况下如何对网页有效分类的问题,提出一种基于图嵌入与集成分类算法XGBoost(extreme gradient boosting)利用网页链接网络中隐含关系特征进行网页多分类的方法。首先,利用网页及网页间的超链接关系,构造出网页链接网络;然后,通过图嵌入模型抽取节点(网页)在链接网络中的隐含关系特征;其次,提取节点的集聚系数、PageRank值等统计学结构特征,共同构成节点的稠密特征向量;最后,利用基于XGBoost等集成学习模型构建节点分类预测模型对网页进行分类预测。在真实维基百科网页链接数据集上的实验结果表明:在完全缺乏网页显式内容特征情况下,所提出的Struct2Vec*+XGBoost组合方案实现了良好的网页分类效果,在准确率、精准率、查全率及F1值4项指标上分别达到0.987 5、0.965 9、0.971 3和0.964 1。

, correspAuthors=廖彬, authorNote=null, correspAuthorsNote=
* 廖彬(1986—),男,汉族,四川内江人,博士,副教授。研究方向:机器学习、数据挖掘及大数据计算模型等。E-mail:
, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=/W+RpxSmc9a2gaaqHR9tpQ==, magXml=gxjBfDcegKu7nlbTc9l5WQ==, pdfUrl=null, pdf=au57owMchFqGXo00Ppmy1A==, pdfFileSize=14875843, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=c570+KtSE0mXV2XputntuQ==, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=MchDmp0SRrIiM6yWQBn3Nw==, mapNumber=null, authorCompany=null, fund=null, authors=

张陶(1988—),女,汉族,安徽阜阳人,博士,讲师。研究方向:机器学习、数据挖掘与复杂网络分析。E-mail:

, authorsList=张陶, 廖彬, 于炯)}, authors=[Author(id=1261377031913259429, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, orderNo=0, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=zt59921661@126.com, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1261377032416575918, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377031913259429, language=EN, stringName=Tao ZHANG, firstName=Tao, middleName=null, lastName=ZHANG, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=1, 2, address=1 College of Information Engineering, Guizhou University of Traditional Chinese Medicine, Guiyang 550025, China
2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261377034132046265, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377031913259429, language=CN, stringName=张陶, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=1, 2, address=1 贵州中医药大学信息工程学院, 贵阳 550025
2 新疆大学信息科学与工程学院, 乌鲁木齐 830046, bio={"content":"

张陶(1988—),女,汉族,安徽阜阳人,博士,讲师。研究方向:机器学习、数据挖掘与复杂网络分析。E-mail:

"}, bioImg=null, bioContent=

张陶(1988—),女,汉族,安徽阜阳人,博士,讲师。研究方向:机器学习、数据挖掘与复杂网络分析。E-mail:

, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261377029438620028, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=1, ext=[AuthorCompanyExt(id=1261377029451202941, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377029438620028, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=1 College of Information Engineering, Guizhou University of Traditional Chinese Medicine, Guiyang 550025, China), AuthorCompanyExt(id=1261377029463785854, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377029438620028, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=1 贵州中医药大学信息工程学院, 贵阳 550025)]), AuthorCompany(id=1261377030470418821, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=2, ext=[AuthorCompanyExt(id=1261377030491390344, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China), AuthorCompanyExt(id=1261377030520750474, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 新疆大学信息科学与工程学院, 乌鲁木齐 830046)])]), Author(id=1261377034513727938, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, orderNo=1, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=liaobin665@163.com, emailSecond=null, emailThird=null, correspondingAuthor=1, authorType=1, ext={EN=AuthorExt(id=1261377035319034319, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377034513727938, language=EN, stringName=Bin LIAO, firstName=Bin, middleName=null, lastName=LIAO, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=3, *, address=3 College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261377036543771099, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377034513727938, language=CN, stringName=廖彬, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=3, *, address=3 贵州财经大学大数据统计学院, 贵阳 550025, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261377031254753686, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=3, ext=[AuthorCompanyExt(id=1261377031275725208, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377031254753686, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=3 College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China), AuthorCompanyExt(id=1261377031451885978, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377031254753686, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=3 贵州财经大学大数据统计学院, 贵阳 550025)])]), Author(id=1261377038317961702, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, orderNo=2, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=null, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1261377039093907958, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377038317961702, language=EN, stringName=Jiong YU, firstName=Jiong, middleName=null, lastName=YU, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=2, address=2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1261377040251535877, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, authorId=1261377038317961702, language=CN, stringName=于炯, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=2, address=2 新疆大学信息科学与工程学院, 乌鲁木齐 830046, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1261377030470418821, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=2, ext=[AuthorCompanyExt(id=1261377030491390344, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China), AuthorCompanyExt(id=1261377030520750474, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 新疆大学信息科学与工程学院, 乌鲁木齐 830046)])])], keywords=[Keyword(id=1261377042726175261, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, orderNo=1, keyword=missing content features), Keyword(id=1261377042910724646, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, orderNo=2, keyword=graph embedding), Keyword(id=1261377043359515185, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, orderNo=3, keyword=webpage hyperlink network), Keyword(id=1261377043749585465, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, orderNo=4, keyword=webpage multi-classification), Keyword(id=1261377044861076037, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, orderNo=1, keyword=内容特征缺失), Keyword(id=1261377045465055822, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, orderNo=2, keyword=图嵌入), Keyword(id=1261377047511876185, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, orderNo=3, keyword=网页链接网络), Keyword(id=1261377048363319906, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, orderNo=4, keyword=网页多分类)], refs=[Reference(id=1261377071838839621, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=45, issue=3, pageStart=669, pageEnd=675, url=null, language=null, rfNumber=[1], rfOrder=0, authorNames=王法玉, 于晓文, 陈洪涛, journalName=计算机工程与设计, refType=null, unstructuredReference=王法玉, 于晓文, 陈洪涛. 基于欠采样和多层集成学习的恶意网页识别[J]. 计算机工程与设计, 2024, 45(3): 669-675., articleTitle=基于欠采样和多层集成学习的恶意网页识别, refAbstract=null), Reference(id=1261377072019194698, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=45, issue=3, pageStart=669, pageEnd=675, url=null, language=null, rfNumber=[1], rfOrder=1, authorNames=Wang Fayu, Yu Xiaowen, Chen Hongtao, journalName=Computer Engineering and Design, refType=null, unstructuredReference=Wang Fayu, Yu Xiaowen, Chen Hongtao. Malicious web page recognition based on undersampling and multi-layer ensemble learning[J]. Computer Engineering and Design, 2024, 45(3): 669-675., articleTitle=Malicious web page recognition based on undersampling and multi-layer ensemble learning, refAbstract=null), Reference(id=1261377072325378898, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=23, issue=10, pageStart=4279, pageEnd=4285, url=null, language=null, rfNumber=[2], rfOrder=2, authorNames=张明杰, 肖奇荣, 朱烨行, journalName=科学技术与工程, refType=null, unstructuredReference=张明杰, 肖奇荣, 朱烨行. 基于XGBoost模型的融合多特征微博信息传播预测方法[J]. 科学技术与工程, 2023, 23(10): 4279-4285., articleTitle=基于XGBoost模型的融合多特征微博信息传播预测方法, refAbstract=null), Reference(id=1261377074091180888, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=23, issue=10, pageStart=4279, pageEnd=4285, url=null, language=null, rfNumber=[2], rfOrder=3, authorNames=Zhang Mingjie, Xiao Qirong, Zhu Yehang, journalName=Science Technology and Engineering, refType=null, unstructuredReference=Zhang Mingjie, Xiao Qirong, Zhu Yehang. Prediction method of microblog information dissemination based on XGBoost model and multi-feature fusion[J]. Science Technology and Engineering, 2023, 23(10): 4279-4285., articleTitle=Prediction method of microblog information dissemination based on XGBoost model and multi-feature fusion, refAbstract=null), Reference(id=1261377074426725217, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2022, volume=39, issue=4, pageStart=1043, pageEnd=1048, url=null, language=null, rfNumber=[3], rfOrder=4, authorNames=翁彬月, 秦永彬, 黄瑞章, journalName=计算机应用研究, refType=null, unstructuredReference=翁彬月, 秦永彬, 黄瑞章, . NEMTF: 基于多维度文本特征的新闻网页信息提取方法[J]. 计算机应用研究, 2022, 39(4): 1043-1048., articleTitle=NEMTF: 基于多维度文本特征的新闻网页信息提取方法, refAbstract=null), Reference(id=1261377074720326504, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2022, volume=39, issue=4, pageStart=1043, pageEnd=1048, url=null, language=null, rfNumber=[3], rfOrder=5, authorNames=Weng Binyue, Qin Yongbin, Huang Ruizhang, journalName=Application Research of Computers, refType=null, unstructuredReference=Weng Binyue, Qin Yongbin, Huang Ruizhang, et al. NEMTF: method of news Web content extraction based on multi-dimensional text features[J]. Application Research of Computers, 2022, 39(4): 1043-1048., articleTitle=NEMTF: method of news Web content extraction based on multi-dimensional text features, refAbstract=null), Reference(id=1261377074938430317, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=48, issue=6, pageStart=1265, pageEnd=1268, url=null, language=null, rfNumber=[4], rfOrder=6, authorNames=周文文, 韩斌, 黄树成, journalName=计算机与数字工程, refType=null, unstructuredReference=周文文, 韩斌, 黄树成. 结合文本语义图和词频统计的网页分类算法研究[J]. 计算机与数字工程, 2020, 48(6): 1265-1268, 1313., articleTitle=结合文本语义图和词频统计的网页分类算法研究, refAbstract=null), Reference(id=1261377075055870834, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=48, issue=6, pageStart=1265, pageEnd=1268, url=null, language=null, rfNumber=[4], rfOrder=7, authorNames=Zhou Wenwen, Han Bin, Huang Shucheng, journalName=Computer and Digital Engineering, refType=null, unstructuredReference=Zhou Wenwen, Han Bin, Huang Shucheng. Research on web page classification algorithm combining text semantic graph and word frequency statistics[J]. Computer and Digital Engineering, 2020, 48(6): 1265-1268, 1313., articleTitle=Research on web page classification algorithm combining text semantic graph and word frequency statistics, refAbstract=null), Reference(id=1261377075273974651, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=41, issue=7, pageStart=1395, pageEnd=1399, url=null, language=null, rfNumber=[5], rfOrder=8, authorNames=耿宜鹏, 鞠时光, 蔡文鹏, journalName=小型微型计算机系统, refType=null, unstructuredReference=耿宜鹏, 鞠时光, 蔡文鹏, . 基于Skip-PTM的网页主题分类与主题变迁的研究[J]. 小型微型计算机系统, 2020, 41(7): 1395-1399., articleTitle=基于Skip-PTM的网页主题分类与主题变迁的研究, refAbstract=null), Reference(id=1261377075399803779, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=41, issue=7, pageStart=1395, pageEnd=1399, url=null, language=null, rfNumber=[5], rfOrder=9, authorNames=Geng Yipeng, Ju Shiguang, Cai Wenpeng, journalName=Journal of Chinese Computer Systems, refType=null, unstructuredReference=Geng Yipeng, Ju Shiguang, Cai Wenpeng, et al. Research on topic classification and topic change of web pages based on Skip-PTM[J]. Journal of Chinese Computer Systems, 2020, 41(7): 1395-1399., articleTitle=Research on topic classification and topic change of web pages based on Skip-PTM, refAbstract=null), Reference(id=1261377075622101898, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2018, volume=18, issue=23, pageStart=81, pageEnd=89, url=null, language=null, rfNumber=[6], rfOrder=10, authorNames=冯健, 张莹, journalName=科学技术与工程, refType=null, unstructuredReference=冯健, 张莹. 基于文档对象模型结构聚类的钓鱼网页检测方法[J]. 科学技术与工程, 2018, 18(23): 81-89., articleTitle=基于文档对象模型结构聚类的钓鱼网页检测方法, refAbstract=null), Reference(id=1261377075991200661, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2018, volume=18, issue=23, pageStart=81, pageEnd=89, url=null, language=null, rfNumber=[6], rfOrder=11, authorNames=Feng Jian, Zhang Ying, journalName=Science Technology and Engineering, refType=null, unstructuredReference=Feng Jian, Zhang Ying. A detection method for phishing webpage based on DOM structure clustering[J]. Science Technology and Engineering, 2018, 18(23): 81-89., articleTitle=A detection method for phishing webpage based on DOM structure clustering, refAbstract=null), Reference(id=1261377076196721563, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=7, issue=null, pageStart=995, pageEnd=1004, url=null, language=null, rfNumber=[7], rfOrder=12, authorNames=Deng L, Du X, Shen J Z, journalName=Frontiers of Information Technology & Electronic Engineering, refType=null, unstructuredReference=Deng L, Du X, Shen J Z. Web page classification based on heterogeneous features and a combination of multiple classifiers[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 7: 995-1004., articleTitle=Web page classification based on heterogeneous features and a combination of multiple classifiers, refAbstract=null), Reference(id=1261377076460962722, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=46, issue=12, pageStart=97, pageEnd=102, url=null, language=null, rfNumber=[8], rfOrder=13, authorNames=淮晓永, 韩晓东, 高若辰, journalName=电子技术应用, refType=null, unstructuredReference=淮晓永, 韩晓东, 高若辰, . 一种自适应网页结构化信息提取方法[J]. 电子技术应用, 2020, 46(12): 97-102., articleTitle=一种自适应网页结构化信息提取方法, refAbstract=null), Reference(id=1261377076611957674, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=46, issue=12, pageStart=97, pageEnd=102, url=null, language=null, rfNumber=[8], rfOrder=14, authorNames=Huai Xiaoyong, Han Xiaodong, Gao Ruochen, journalName=Application of Electronic Technique, refType=null, unstructuredReference=Huai Xiaoyong, Han Xiaodong, Gao Ruochen, et al. An adaptive web page structured information extraction method[J]. Application of Electronic Technique, 2020, 46(12): 97-102., articleTitle=An adaptive web page structured information extraction method, refAbstract=null), Reference(id=1261377076741981103, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=40, issue=2, pageStart=320, pageEnd=325, url=null, language=null, rfNumber=[9], rfOrder=15, authorNames=洪良怡, 朱松林, 王轶骏, journalName=计算机应用与软件, refType=null, unstructuredReference=洪良怡, 朱松林, 王轶骏, . 基于卷积神经网络的暗网网页分类研究[J]. 计算机应用与软件, 2023, 40(2): 320-325, 330., articleTitle=基于卷积神经网络的暗网网页分类研究, refAbstract=null), Reference(id=1261377078528754611, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=40, issue=2, pageStart=320, pageEnd=325, url=null, language=null, rfNumber=[9], rfOrder=16, authorNames=Hong Liangyi, Zhu Songlin, Wang Yijun, journalName=Computer Applications and Software, refType=null, unstructuredReference=Hong Liangyi, Zhu Songlin, Wang Yijun, et al. Darknet web page classification based on convolutional neural network[J]. Computer Applications and Software, 2023, 40(2): 320-325, 330., articleTitle=Darknet web page classification based on convolutional neural network, refAbstract=null), Reference(id=1261377078751052730, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=41, issue=4, pageStart=391, pageEnd=396, url=null, language=null, rfNumber=[10], rfOrder=17, authorNames=张紫妍, 韩斌, 姜元昊, journalName=计算机仿真, refType=null, unstructuredReference=张紫妍, 韩斌, 姜元昊, . 融合差分进化的网页暗链集成分类检测方法[J]. 计算机仿真, 2024, 41(4): 391-396., articleTitle=融合差分进化的网页暗链集成分类检测方法, refAbstract=null), Reference(id=1261377078990128060, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=41, issue=4, pageStart=391, pageEnd=396, url=null, language=null, rfNumber=[10], rfOrder=18, authorNames=Zhang Ziyan, Han Bin, Jiang Yuanhao, journalName=Computer Simulation, refType=null, unstructuredReference=Zhang Ziyan, Han Bin, Jiang Yuanhao, et al. Integrated Classification and detection method of web page hidden hyperlink based on differential evolution[J]. Computer Simulation, 2024, 41(4): 391-396., articleTitle=Integrated Classification and detection method of web page hidden hyperlink based on differential evolution, refAbstract=null), Reference(id=1261377079120151488, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=9, issue=3, pageStart=176, pageEnd=190, url=null, language=null, rfNumber=[11], rfOrder=19, authorNames=杨胜杰, 陈朝阳, 徐逸, journalName=信息安全学报, refType=null, unstructuredReference=杨胜杰, 陈朝阳, 徐逸, . 基于深度学习与特征融合的恶意网页识别方法研究[J]. 信息安全学报, 2024, 9(3): 176-190., articleTitle=基于深度学习与特征融合的恶意网页识别方法研究, refAbstract=null), Reference(id=1261377079396975561, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=9, issue=3, pageStart=176, pageEnd=190, url=null, language=null, rfNumber=[11], rfOrder=20, authorNames=Yang Shengjie, Chen Zhaoyang, Xu Yi, journalName=Journal of Cyber Security, refType=null, unstructuredReference=Yang Shengjie, Chen Zhaoyang, Xu Yi, et al. Research on malicious web page identification method based on deep learning and feature fusion[J]. Journal of Cyber Security, 2024, 9(3): 176-190., articleTitle=Research on malicious web page identification method based on deep learning and feature fusion, refAbstract=null), Reference(id=1261377079652828109, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=13, issue=1, pageStart=54, pageEnd=null, url=null, language=null, rfNumber=[12], rfOrder=21, authorNames=Giamphy E, Guillaume J L, Doucet A, journalName=Social Network Analysis and Mining, refType=null, unstructuredReference=Giamphy E, Guillaume J L, Doucet A, et al. A survey on bipartite graphs embedding[J]. Social Network Analysis and Mining, 2023, 13(1): 54., articleTitle=A survey on bipartite graphs embedding, refAbstract=null), Reference(id=1261377079938040790, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=40, issue=6, pageStart=1601, pageEnd=1613, url=null, language=null, rfNumber=[13], rfOrder=22, authorNames=李青, 王一晨, 杜承烈, journalName=计算机应用研究, refType=null, unstructuredReference=李青, 王一晨, 杜承烈. 图表示学习方法研究综述[J]. 计算机应用研究, 2023, 40(6): 1601-1613., articleTitle=图表示学习方法研究综述, refAbstract=null), Reference(id=1261377080437162971, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=40, issue=6, pageStart=1601, pageEnd=1613, url=null, language=null, rfNumber=[13], rfOrder=23, authorNames=Li Qing, Wang Yichen, Du Chenglie, journalName=Application Research of Computers, refType=null, unstructuredReference=Li Qing, Wang Yichen, Du Chenglie. Survey on graph representation learning methods[J]. Application Research of Computers, 2023, 40(6): 1601-1613., articleTitle=Survey on graph representation learning methods, refAbstract=null), Reference(id=1261377080806261728, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2014, volume=null, issue=null, pageStart=701, pageEnd=710, url=null, language=null, rfNumber=[14], rfOrder=24, authorNames=Perozzi B, AL-Rfou R, Skiena S, journalName=Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining, refType=null, unstructuredReference=Perozzi B, AL-Rfou R, Skiena S. DeepWalk: online learning of social representations[C]// Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 701-710., articleTitle=DeepWalk: online learning of social representations, refAbstract=null), Reference(id=1261377081028559844, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2016, volume=null, issue=null, pageStart=855, pageEnd=864, url=null, language=null, rfNumber=[15], rfOrder=25, authorNames=Grover A, Leskovec J, journalName=Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, refType=null, unstructuredReference=Grover A, Leskovec J. Node2Vec: scalable feature learning for networks[C]// Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 855-864., articleTitle=Node2Vec: scalable feature learning for networks, refAbstract=null), Reference(id=1261377081246663654, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2018, volume=null, issue=null, pageStart=701, pageEnd=710, url=null, language=null, rfNumber=[16], rfOrder=26, authorNames=Cai C, Wang Y, journalName=Proceedings of the International Conference on Learning Representation, refType=null, unstructuredReference=Cai C, Wang Y. A simple yet effective baseline for non-attributed graph classification[C]// Proceedings of the International Conference on Learning Representation. New York: ACM, 2018: 701-710., articleTitle=A simple yet effective baseline for non-attributed graph classification, refAbstract=null), Reference(id=1261377083197015021, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2020, volume=null, issue=null, pageStart=3125, pageEnd=3132, url=null, language=null, rfNumber=[17], rfOrder=27, authorNames=Rozemberczki B, Kiss O, Sarkar R, journalName=Proceedings of the 29th ACM International Conference on Information & Knowledge Management, refType=null, unstructuredReference=Rozemberczki B, Kiss O, Sarkar R. Karate club: an api oriented open-source python framework for unsupervised learning on graphs[C]// Proceedings of the 29th ACM International Conference on Information & Knowledge Management. New York: ACM, 2020: 3125-3132., articleTitle=Karate club: an api oriented open-source python framework for unsupervised learning on graphs, refAbstract=null), Reference(id=1261377083503199218, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2018, volume=null, issue=null, pageStart=1476, pageEnd=1481, url=null, language=null, rfNumber=[18], rfOrder=28, authorNames=Yang H, Pan S, Zhang P, journalName=Proceedings of the International Conference on Data Mining. Piscataway, refType=null, unstructuredReference=Yang H, Pan S, Zhang P, et al. Binarized attributed network embedding[C]// Proceedings of the International Conference on Data Mining. Piscataway, NJ: IEEE, 2018: 1476-1481., articleTitle=Binarized attributed network embedding, refAbstract=null), Reference(id=1261377083666777078, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=46, issue=7, pageStart=1532, pageEnd=1552, url=null, language=null, rfNumber=[19], rfOrder=29, authorNames=张子威, 王鑫, 朱文武, journalName=计算机学报, refType=null, unstructuredReference=张子威, 王鑫, 朱文武. 图神经架构搜索综述[J]. 计算机学报, 2023, 46(7): 1532-1552., articleTitle=图神经架构搜索综述, refAbstract=null), Reference(id=1261377084035875834, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2023, volume=46, issue=7, pageStart=1532, pageEnd=1552, url=null, language=null, rfNumber=[19], rfOrder=30, authorNames=Zhang Ziwei, Wang Xin, Zhu Wenwu, journalName=Chinese Journal of Computers, refType=null, unstructuredReference=Zhang Ziwei, Wang Xin, Zhu Wenwu. Graph neural architecture search: a survey[J]. Chinese Journal of Computers. 2023, 46(7): 1532-1552., articleTitle=Graph neural architecture search: a survey, refAbstract=null), Reference(id=1261377084442722304, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2016, volume=null, issue=null, pageStart=785, pageEnd=794, url=null, language=null, rfNumber=[20], rfOrder=31, authorNames=Chen T, Guestrin C, journalName=Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, refType=null, unstructuredReference=Chen T, Guestrin C. XGBoost: a scalable tree boosting system[C]// Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 785-794., articleTitle=XGBoost: a scalable tree boosting system, refAbstract=null), Reference(id=1261377084593717252, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=60, issue=3, pageStart=406, pageEnd=415, url=null, language=null, rfNumber=[21], rfOrder=32, authorNames=孙辰星, 刘伟, 卢彬, journalName=南京大学学报(自然科学), refType=null, unstructuredReference=孙辰星, 刘伟, 卢彬, . 多视角网页分类数据集构建及性能评估[J]. 南京大学学报(自然科学), 2024, 60(3): 406-415., articleTitle=多视角网页分类数据集构建及性能评估, refAbstract=null), Reference(id=1261377084711157770, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=60, issue=3, pageStart=406, pageEnd=415, url=null, language=null, rfNumber=[21], rfOrder=33, authorNames=Sun Chenxing, Liu Wei, Lu Bin, journalName=Journal of Nanjing University(Natural Science), refType=null, unstructuredReference=Sun Chenxing, Liu Wei, Lu Bin, et al. Multi-View webpage classification dataset construction and evaluation[J]. Journal of Nanjing University(Natural Science), 2024, 60(3): 406-415., articleTitle=Multi-View webpage classification dataset construction and evaluation, refAbstract=null), Reference(id=1261377084996370444, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2012, volume=22, issue=7, pageStart=93, pageEnd=106, url=null, language=null, rfNumber=[22], rfOrder=34, authorNames=Pedroche F, journalName=International Journal of Bifurcation and Chaos, refType=null, unstructuredReference=Pedroche F. A Model to Classify users of social networks based on PageRank[J]. International Journal of Bifurcation and Chaos, 2012, 22(7): 93-106., articleTitle=A Model to Classify users of social networks based on PageRank, refAbstract=null), Reference(id=1261377085793288212, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2015, volume=null, issue=null, pageStart=1067, pageEnd=1077, url=null, language=null, rfNumber=[23], rfOrder=35, authorNames=Tang J, Qu M, Wang M, journalName=Proceedings of the 24th International Conference on World Wide Web, refType=null, unstructuredReference=Tang J, Qu M, Wang M, et al. LINE: large-scale information network embedding[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 1067-1077., articleTitle=LINE: large-scale information network embedding, refAbstract=null), Reference(id=1261377087735250970, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2016, volume=null, issue=null, pageStart=1225, pageEnd=1234, url=null, language=null, rfNumber=[24], rfOrder=36, authorNames=Wang D, Cui P, Zhu W, journalName=Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, refType=null, unstructuredReference=Wang D, Cui P, Zhu W. Structural Deep network embedding[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1225-1234., articleTitle=Structural Deep network embedding, refAbstract=null), Reference(id=1261377088108544031, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2017, volume=null, issue=null, pageStart=385, pageEnd=394, url=null, language=null, rfNumber=[25], rfOrder=37, authorNames=Ribeiro L F R, Saverese P H P, Figueiredo D R, journalName=Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, refType=null, unstructuredReference=Ribeiro L F R, Saverese P H P, Figueiredo D R. Struct2Vec: learning node representations from structural identity[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2017: 385-394., articleTitle=Struct2Vec: learning node representations from structural identity, refAbstract=null), Reference(id=1261377088293093411, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2022, volume=28, issue=1, pageStart=614, pageEnd=622, url=null, language=null, rfNumber=[26], rfOrder=38, authorNames=Ruitmark V D, Billeter M, Eisemann E, journalName=IEEE Transactions on Visualization and Computer Graphics, refType=null, unstructuredReference=Ruitmark V D, Billeter M, Eisemann E. An efficient dual-hierarchy t-SNE minimization[J]. IEEE Transactions on Visualization and Computer Graphics, 2022, 28(1): 614-622., articleTitle=An efficient dual-hierarchy t-SNE minimization, refAbstract=null), Reference(id=1261377088913850412, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=29, issue=8, pageStart=2333, pageEnd=2349, url=null, language=null, rfNumber=[27], rfOrder=39, authorNames=谢斌, 徐燕, 王冠超, journalName=中国图象图形学报, refType=null, unstructuredReference=谢斌, 徐燕, 王冠超, . t-SNE最大化的自适应彩色图像灰度化方法[J]. 中国图象图形学报, 2024, 29(8): 2333-2349., articleTitle=t-SNE最大化的自适应彩色图像灰度化方法, refAbstract=null), Reference(id=1261377089354252339, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, doi=null, pmid=null, pmcid=null, year=2024, volume=29, issue=8, pageStart=2333, pageEnd=2349, url=null, language=null, rfNumber=[27], rfOrder=40, authorNames=Xie Bin, Xu Yan, Wang Guanchao, journalName=Journal of Image and Graphics, refType=null, unstructuredReference=Xie Bin, Xu Yan, Wang Guanchao, et al. Adaptive decolorization method based on t-SNE maximization[J]. Journal of Image and Graphics, 2024, 29(8): 2333-2349., articleTitle=Adaptive decolorization method based on t-SNE maximization, refAbstract=null)], funds=[Fund(id=1261377070316307245, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, awardId=61562078, language=CN, fundingSource=国家自然科学基金(61562078), fundOrder=null, country=null), Fund(id=1261377070815429431, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, awardId=[2024]07号, language=CN, fundingSource=贵州中医药大学博士启动基金([2024]07号), fundOrder=null, country=null)], companyList=[AuthorCompany(id=1261377029438620028, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=1, ext=[AuthorCompanyExt(id=1261377029451202941, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377029438620028, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=1 College of Information Engineering, Guizhou University of Traditional Chinese Medicine, Guiyang 550025, China), AuthorCompanyExt(id=1261377029463785854, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377029438620028, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=1 贵州中医药大学信息工程学院, 贵阳 550025)]), AuthorCompany(id=1261377030470418821, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=2, ext=[AuthorCompanyExt(id=1261377030491390344, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China), AuthorCompanyExt(id=1261377030520750474, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377030470418821, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=2 新疆大学信息科学与工程学院, 乌鲁木齐 830046)]), AuthorCompany(id=1261377031254753686, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, xref=3, ext=[AuthorCompanyExt(id=1261377031275725208, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377031254753686, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=3 College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China), AuthorCompanyExt(id=1261377031451885978, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, companyId=1261377031254753686, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=3 贵州财经大学大数据统计学院, 贵阳 550025)])], figs=[ArticleFig(id=1261377051760706167, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.1, caption=Distribution of node’s degree, in degree and out degree, figureFileSmall=IiB657JPlrsRIdp4ShZ5lg==, figureFileBig=B99DtOohG9l9AaHjmRREPw==, tableContent=null), ArticleFig(id=1261377052494709375, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图1, caption=节点的度、入度及出度的分布情况, figureFileSmall=IiB657JPlrsRIdp4ShZ5lg==, figureFileBig=B99DtOohG9l9AaHjmRREPw==, tableContent=null), ArticleFig(id=1261377054143070865, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.2, caption=Node’s distribution of clustering coefficients and PageRank, figureFileSmall=RdIGRvsM6mLB414Wl+CxhA==, figureFileBig=kEt0D+jxknW2Ue5sAvN15w==, tableContent=null), ArticleFig(id=1261377056273777308, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图2, caption=节点的聚类系数及PageRank的分布情况, figureFileSmall=RdIGRvsM6mLB414Wl+CxhA==, figureFileBig=kEt0D+jxknW2Ue5sAvN15w==, tableContent=null), ArticleFig(id=1261377056751927972, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.3, caption=Comparison of feature importance distribution (taking DeepWalk and DeepWalk* as examples), figureFileSmall=Qagfa3bmee5PtKDg8JL8GA==, figureFileBig=/rDruXehpvMjpZiYW+rDhg==, tableContent=null), ArticleFig(id=1261377057582400173, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图3, caption=隐含关系特征重要性分布情况对比(以DeepWalk和DeepWalk*为例), figureFileSmall=Qagfa3bmee5PtKDg8JL8GA==, figureFileBig=/rDruXehpvMjpZiYW+rDhg==, tableContent=null), ArticleFig(id=1261377057980859062, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.4, caption=Comparison of feature importance under different feature extraction schemes (DeepWalk and Struct2Vec), figureFileSmall=zZCDV4xoP7ioKYxsdzSwyg==, figureFileBig=xdiPhU/AT48e3Dz9gjS/Ug==, tableContent=null), ArticleFig(id=1261377059163652801, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图4, caption=不同特征提取方案下的特征重要性对比(DeepWalk与Struct2Vec), figureFileSmall=zZCDV4xoP7ioKYxsdzSwyg==, figureFileBig=xdiPhU/AT48e3Dz9gjS/Ug==, tableContent=null), ArticleFig(id=1261377061223056074, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.5, caption=Comparison of learning curves for each classification model (under Struct2Vec* feature extraction scheme), figureFileSmall=lYx7hJ8RPFMEel6uyNwaDg==, figureFileBig=QAdfqtjkFOv72gFlA9gHcw==, tableContent=null), ArticleFig(id=1261377062095471313, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图5, caption=各分类模型的学习曲线对比(Struct2Vec*特征提取方案下)

Training为训练集;Cross-validation为交叉验证集

, figureFileSmall=lYx7hJ8RPFMEel6uyNwaDg==, figureFileBig=QAdfqtjkFOv72gFlA9gHcw==, tableContent=null), ArticleFig(id=1261377063026606810, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.6, caption=Comparison of the classification performance of each classification model, figureFileSmall=0Q86MpUQLEhT0PrmVkJ0ZA==, figureFileBig=B25QxOvzsi6MyfpbhklWMQ==, tableContent=null), ArticleFig(id=1261377065098592995, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图6, caption=各分类模型的分类性能对比, figureFileSmall=0Q86MpUQLEhT0PrmVkJ0ZA==, figureFileBig=B25QxOvzsi6MyfpbhklWMQ==, tableContent=null), ArticleFig(id=1261377065568355050, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Fig.7, caption=Scatter distribution of node features on two-dimensional space after dimensionality reduction, figureFileSmall=yUZ4kk7+wz9D/07QiW208g==, figureFileBig=grqEWFxUGg2cic+hga8Wow==, tableContent=null), ArticleFig(id=1261377065908093683, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=图7, caption=节点特征降维后在二维空间上的散点分布情况

图例处的数字0~16为节点(网页)类别编号, 共17个类别

, figureFileSmall=yUZ4kk7+wz9D/07QiW208g==, figureFileBig=grqEWFxUGg2cic+hga8Wow==, tableContent=null), ArticleFig(id=1261377066386244346, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Table 1, caption=

Hyperparameter configuration of graph embedding models

, figureFileSmall=null, figureFileBig=null, tableContent=
模型名称 模型参数 训练参数
DeepWalk walk_length=10, num_walks=80, workers=4 embed_size=128, window_size=5, workers=3, iter=3
Node2Vec walk_length=10, num_walks=80,p=0.25, q=4, workers=4, use_rejection_sampling=0 embed_size=128, window_size=5, workers=3, iter=3
LINE embedding_size=128, order='second', negative_ratio=5 batch_size=1 024, epochs=50, initial_epoch=0, verbose=2, times=1
SDNE hidden_size=[256, 128], alpha=1×10-6, beta=5., nu1=1×10-5, nu2=1×10-4 batch_size=3 000, epochs=40, initial_epoch=0, verbose=2
Struct2Vec walk_length=10, num_walks=80, workers=4, verbose=40, stay_prob=0.3, opt1_reduce_len=True, opt2_reduce_sim_calc=True, opt3_num_layers=None, temp_path='./temp_Struct2Vec/', reuse=False embed_size=128, window_size=5, workers=3, iter=5
), ArticleFig(id=1261377066763731710, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=表1, caption=

5种图嵌入模型的超参数配置

, figureFileSmall=null, figureFileBig=null, tableContent=
模型名称 模型参数 训练参数
DeepWalk walk_length=10, num_walks=80, workers=4 embed_size=128, window_size=5, workers=3, iter=3
Node2Vec walk_length=10, num_walks=80,p=0.25, q=4, workers=4, use_rejection_sampling=0 embed_size=128, window_size=5, workers=3, iter=3
LINE embedding_size=128, order='second', negative_ratio=5 batch_size=1 024, epochs=50, initial_epoch=0, verbose=2, times=1
SDNE hidden_size=[256, 128], alpha=1×10-6, beta=5., nu1=1×10-5, nu2=1×10-4 batch_size=3 000, epochs=40, initial_epoch=0, verbose=2
Struct2Vec walk_length=10, num_walks=80, workers=4, verbose=40, stay_prob=0.3, opt1_reduce_len=True, opt2_reduce_sim_calc=True, opt3_num_layers=None, temp_path='./temp_Struct2Vec/', reuse=False embed_size=128, window_size=5, workers=3, iter=5
), ArticleFig(id=1261377066906338051, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Table 2, caption=

Hyperparameter configuration of the prediction model

, figureFileSmall=null, figureFileBig=null, tableContent=
算法名称 算法参数配置
SVM C=1.0, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0, shrinking=True, probability=False, tol=1×10-3, cache_size=200, max_iter=-1, decision_function_shape=‘ovr’
Logistic Regression penalty=‘l2’,C=1.0, fit_intercept=True, intercept_scaling=1, dual=False, tol=1×10-4,class_weight=None, solver=‘lbfgs’, max_iter=100, multi_class=‘auto’, verbose=0
RandomForest n_estimators=100, criterion=“gini”, max_features=“auto”, min_impurity_decrease=0., bootstrap=True, verbose=0, ccp_alpha=0.0, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0
GradientBoost loss=‘deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=‘friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., max_depth=3, min_impurity_decrease=0., validation_fraction=0.1, tol=1×10-4
XGBoost max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective=“binary:logistic”, booster=‘gbtree’, n_jobs=1, min_child_weight=1, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5
), ArticleFig(id=1261377067376100110, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=表2, caption=

分类模型的超参数配置

, figureFileSmall=null, figureFileBig=null, tableContent=
算法名称 算法参数配置
SVM C=1.0, kernel=‘rbf’, degree=3, gamma=‘scale’, coef0=0.0, shrinking=True, probability=False, tol=1×10-3, cache_size=200, max_iter=-1, decision_function_shape=‘ovr’
Logistic Regression penalty=‘l2’,C=1.0, fit_intercept=True, intercept_scaling=1, dual=False, tol=1×10-4,class_weight=None, solver=‘lbfgs’, max_iter=100, multi_class=‘auto’, verbose=0
RandomForest n_estimators=100, criterion=“gini”, max_features=“auto”, min_impurity_decrease=0., bootstrap=True, verbose=0, ccp_alpha=0.0, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0
GradientBoost loss=‘deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=‘friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0., max_depth=3, min_impurity_decrease=0., validation_fraction=0.1, tol=1×10-4
XGBoost max_depth=3, learning_rate=0.1, n_estimators=100, verbosity=1, silent=None, objective=“binary:logistic”, booster=‘gbtree’, n_jobs=1, min_child_weight=1, subsample=1, colsample_bytree=1, colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5
), ArticleFig(id=1261377067615175442, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Table 3, caption=

Performance comparison of web page classification using implicit relational features

, figureFileSmall=null, figureFileBig=null, tableContent=
方法名称 Accuracy Precision Recall F1-score
Pagerank+SVM[22] 0.570 4 0.589 6 0.621 1 0.579 9
LDP+SVM[16] 0.582 1 0.372 3 0.336 8 0.317 5
DeepWalk+LR[17] 0.696 5 0.685 2 0.609 2 0.630 1
BANE+SVM[18] 0.719 3 0.689 2 0.572 5 0.607 6
DeepWalk*+XGBoost 0.977 1 0.980 4 0.936 1 0.947 8
LINE*+XGBoost 0.977 1 0.982 2 0.922 0 0.946 0
Node2Vec*+XGBoost 0.985 4 0.915 3 0.924 7 0.919 5
SDNE*+XGBoost 0.968 8 0.797 8 0.810 2 0.803 6
Struct2Vec*+XGBoost 0.987 5 0.965 9 0.971 3 0.964 1
), ArticleFig(id=1261377068022022938, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=表3, caption=

利用隐含关系特征网页分类的性能比较

, figureFileSmall=null, figureFileBig=null, tableContent=
方法名称 Accuracy Precision Recall F1-score
Pagerank+SVM[22] 0.570 4 0.589 6 0.621 1 0.579 9
LDP+SVM[16] 0.582 1 0.372 3 0.336 8 0.317 5
DeepWalk+LR[17] 0.696 5 0.685 2 0.609 2 0.630 1
BANE+SVM[18] 0.719 3 0.689 2 0.572 5 0.607 6
DeepWalk*+XGBoost 0.977 1 0.980 4 0.936 1 0.947 8
LINE*+XGBoost 0.977 1 0.982 2 0.922 0 0.946 0
Node2Vec*+XGBoost 0.985 4 0.915 3 0.924 7 0.919 5
SDNE*+XGBoost 0.968 8 0.797 8 0.810 2 0.803 6
Struct2Vec*+XGBoost 0.987 5 0.965 9 0.971 3 0.964 1
), ArticleFig(id=1261377069523583774, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=EN, label=Table 4, caption=

Comparison of classification performance of different classification models

, figureFileSmall=null, figureFileBig=null, tableContent=
节点特征
提取方法
分类方法 Accuracy Precision Recall F1-score
DeepWalk* LR 0.696 5 0.685 2 0.609 2 0.630 1
SVM 0.719 3 0.689 2 0.572 5 0.607 6
RandomForest 0.779 6 0.694 5 0.547 0 0.588 1
GradientBoost 0.962 6 0.802 6 0.803 9 0.802 7
XGBoost 0.977 1 0.922 4 0.887 6 0.894 8
LINE* LR 0.634 1 0.558 7 0.525 0 0.534 3
SVM 0.700 6 0.665 4 0.516 8 0.549 1
RandomForest 0.742 2 0.620 1 0.530 2 0.552 8
GradientBoost 0.983 4 0.859 1 0.870 2 0.862 5
XGBoost 0.970 9 0.940 3 0.913 4 0.920 9
Node2Vec* LR 0.663 2 0.554 2 0.517 1 0.526 2
SVM 0.719 3 0.663 1 0.540 7 0.571 9
RandomForest 0.756 8 0.732 7 0.598 0 0.638 8
GradientBoost 0.962 6 0.838 2 0.815 7 0.824 1
XGBoost 0.973 0 0.916 7 0.880 2 0.882 4
SDNE* LR 0.667 4 0.611 6 0.548 3 0.569 0
SVM 0.638 3 0.620 7 0.455 3 0.489 8
RandomForest 0.754 8 0.647 5 0.531 4 0.551 7
GradientBoost 0.962 6 0.802 3 0.798 8 0.800 2
XGBoost 0.983 4 0.901 1 0.923 0 0.906 7
Struct2Vec* LR 0.303 5 0.202 0 0.208 1 0.201 0
SVM 0.345 1 0.281 7 0.204 3 0.195 4
RandomForest 0.596 7 0.391 7 0.344 1 0.345 5
GradientBoost 0.973 0 0.894 2 0.918 2 0.899 9
XGBoost 0.979 2 0.985 2 0.939 1 0.951 1
), ArticleFig(id=1261377069804602147, tenantId=1146029695717560320, journalId=1146123166801305609, articleId=1261267656154406975, language=CN, label=表4, caption=

不同分类模型的分类性能对比

, figureFileSmall=null, figureFileBig=null, tableContent=
节点特征
提取方法
分类方法 Accuracy Precision Recall F1-score
DeepWalk* LR 0.696 5 0.685 2 0.609 2 0.630 1
SVM 0.719 3 0.689 2 0.572 5 0.607 6
RandomForest 0.779 6 0.694 5 0.547 0 0.588 1
GradientBoost 0.962 6 0.802 6 0.803 9 0.802 7
XGBoost 0.977 1 0.922 4 0.887 6 0.894 8
LINE* LR 0.634 1 0.558 7 0.525 0 0.534 3
SVM 0.700 6 0.665 4 0.516 8 0.549 1
RandomForest 0.742 2 0.620 1 0.530 2 0.552 8
GradientBoost 0.983 4 0.859 1 0.870 2 0.862 5
XGBoost 0.970 9 0.940 3 0.913 4 0.920 9
Node2Vec* LR 0.663 2 0.554 2 0.517 1 0.526 2
SVM 0.719 3 0.663 1 0.540 7 0.571 9
RandomForest 0.756 8 0.732 7 0.598 0 0.638 8
GradientBoost 0.962 6 0.838 2 0.815 7 0.824 1
XGBoost 0.973 0 0.916 7 0.880 2 0.882 4
SDNE* LR 0.667 4 0.611 6 0.548 3 0.569 0
SVM 0.638 3 0.620 7 0.455 3 0.489 8
RandomForest 0.754 8 0.647 5 0.531 4 0.551 7
GradientBoost 0.962 6 0.802 3 0.798 8 0.800 2
XGBoost 0.983 4 0.901 1 0.923 0 0.906 7
Struct2Vec* LR 0.303 5 0.202 0 0.208 1 0.201 0
SVM 0.345 1 0.281 7 0.204 3 0.195 4
RandomForest 0.596 7 0.391 7 0.344 1 0.345 5
GradientBoost 0.973 0 0.894 2 0.918 2 0.899 9
XGBoost 0.979 2 0.985 2 0.939 1 0.951 1
)], attaches=null, journal=Journal(id=1146119176004939786, delFlag=0, nameCn=科学技术与工程, nameEn=Science Technology and Engineering, nameHistory1=null, nameHistory2=null, issn=1671-1815, eissn=, cn=11-4688/T, coden=null, periodic=4, language=CN, oaType=是, ccby=null, superviseOffice=null, ownerOffice=null, pubOffice=null, editorOffice=null, officeType=null, aims=null, clcCode=null, officeProv=null, officeCity=null, officeAddr=null, officeZip=null, officeEmail=null, officePhone=null, editDirector=null, officeDirector=null, officeDirectorPhone=null, officeStaffNum=null, officeEmpNum=null, coverPicUrl=UKU/O7GSka5polgCTkbIIw==, journalPrice=null, startedYear=null, abbrevIsoEn=Sci Technol Eng, journalRemark=null, publicationField=null, createdTime=null, updatedTime=1754445529766, createdBy=null, updatedBy=13701087609, firstLetterCn=S, firstLetterEn=S, subjectCode=Natural Sciences, subjectName=自然科学, subjectCodeEn=Natural Sciences, subjectNameEn=null, picCn=UKU/O7GSka5polgCTkbIIw==, picEn=5hwlULoNwcbj3xUmVi9MAQ==, jcr=null, cjcr=null, exts=[JournalExt(id=1159791870395564357, language=CN, name=科学技术与工程, nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=null, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.stae.com.cn/jsygc/home, createdTime=1754445529793, updatedTime=1754445529793, createdBy=13701087609, updatedBy=13701087609, submissionGuidelinesUrl=http://www.stae.com.cn/jsygc/site/menus/20090429150146001, submissionAuthorUrl=http://www.stae.com.cn/jsygc/author/login, submissionEditorUrl=http://www.stae.com.cn/jsygc/editor/login, submissionReviewUrl=http://www.stae.com.cn/jsygc/reviewer/login, submissionCeEditorUrl=, submissionAeEditorUrl=, option={"copyright":""}), JournalExt(id=1159791870441701702, language=EN, name=Science Technology and Engineering, nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=null, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.stae.com.cn/jsygc/home, createdTime=1754445529804, updatedTime=1754445529804, createdBy=13701087609, updatedBy=13701087609, submissionGuidelinesUrl=, submissionAuthorUrl=http://www.stae.com.cn/jsygc/author/login, submissionEditorUrl=http://www.stae.com.cn/jsygc/editor/login, submissionReviewUrl=http://www.stae.com.cn/jsygc/reviewer/login, submissionCeEditorUrl=, submissionAeEditorUrl=, option={"copyright":""})], databaseList=null, tenantJournalId=1146123166801305609, websiteList=[Website(id=1148243202391400884, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146123166801305609, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kxjsygc/CN, language=CN, createTime=1751692112777, createBy=18614031015, updateTime=1753520965431, updateBy=18614031015, name=科学技术与工程-中文站点, tplId=1146099689490845704, title=科学技术与工程, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1148622798802673703, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1148243202391400884, code=articleTextType, value=kx, createTime=1751782615614, updateTime=1751782615614, creator=18614031015, updator=18614031015), WebsiteProps(id=1148622798781702180, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1148243202391400884, code=banner, value=null, createTime=1751782615609, updateTime=1751782615609, creator=18614031015, updator=18614031015), WebsiteProps(id=1148622798769119267, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1148243202391400884, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic?fileId=j86gbwi+p0Idkyl5SzIlmQ==, createTime=1751782615606, updateTime=1751782615606, creator=18614031015, updator=18614031015), WebsiteProps(id=1148622798794285094, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1148243202391400884, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic, createTime=1751782615612, updateTime=1751782615612, creator=18614031015, updator=18614031015), WebsiteProps(id=1148622798790090789, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1148243202391400884, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1751782615611, updateTime=1751782615611, creator=18614031015, updator=18614031015)]), Website(id=1155914124811976731, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146123166801305609, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kxjsygc/EN, language=EN, createTime=1753521003206, createBy=18614031015, updateTime=1753521003206, updateBy=18614031015, name=科学技术与工程-英文站点, tplId=1146101810881728533, title=Science Technology and Engineering, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1155914371227308235, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1155914124811976731, code=articleTextType, value=kx, createTime=1753521061952, updateTime=1753521061952, creator=18614031015, updator=18614031015), WebsiteProps(id=1155914371210531016, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1155914124811976731, code=banner, value=null, createTime=1753521061947, updateTime=1753521061947, creator=18614031015, updator=18614031015), WebsiteProps(id=1155914371202142407, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1155914124811976731, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic?fileId=j86gbwi+p0Idkyl5SzIlmQ==, createTime=1753521061945, updateTime=1753521061945, creator=18614031015, updator=18614031015), WebsiteProps(id=1155914371223113930, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1155914124811976731, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic, createTime=1753521061950, updateTime=1753521061950, creator=18614031015, updator=18614031015), WebsiteProps(id=1155914371218919625, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1155914124811976731, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1753521061949, updateTime=1753521061949, creator=18614031015, updator=18614031015)])], journalTitle=科学技术与工程, weixinUrl=null, journalUrl=null, iacademicId=null, status=0, seqNo=null, journalTitleEn=Science Technology and Engineering, journalPhotoCn=UKU/O7GSka5polgCTkbIIw==, journalPhotoEn=5hwlULoNwcbj3xUmVi9MAQ==, journalFirstLetter=S, journalRecommend=null, journalNew=null, journalCollection=null, jcrJf=null, cjcrJf=null, jcrJfStr=null, cjcrJfStr=null, submissionFirstDecision=null, sciSubjectClassification=null, casSubjectClassification=null, citeScore=null, totalCitationFrequency=null, icpCode=null, psCode=null, advertisingLicenseCode=null, copyrightInformation=null, country=null, option=null, provinceCode=null, provinceName=null, collectFlag=false), detailUrlCn=https://castjournals.cast.org.cn/joweb/kxjsygc/CN/10.12404/j.issn.1671-1815.2405389, detailUrlEn=https://castjournals.cast.org.cn/joweb/kxjsygc/EN/10.12404/j.issn.1671-1815.2405389, pdfUrlCn=https://castjournals.cast.org.cn/joweb/kxjsygc/CN/PDF/10.12404/j.issn.1671-1815.2405389, pdfUrlEn=https://castjournals.cast.org.cn/joweb/kxjsygc/EN/PDF/10.12404/j.issn.1671-1815.2405389, aliStartDate=null, aliEndDate=null, collectionFlag=false, citedCount=null, citedUrl=null, reference=null)
收藏切换
基于图嵌入与集成分类算法的内容特征缺失网页多分类预测方法
收藏切换
PDF下载
张陶 1, 2 , 廖彬 3, * , 于炯 2
科学技术与工程 | 论文·自动化技术、计算机技术 2025,25(20): 8604-8614
收起
收藏切换
科学技术与工程 | 论文·自动化技术、计算机技术 2025, 25(20): 8604-8614
基于图嵌入与集成分类算法的内容特征缺失网页多分类预测方法
全屏
张陶1, 2 , 廖彬3, * , 于炯2
作者信息
  • 1 贵州中医药大学信息工程学院, 贵阳 550025
  • 2 新疆大学信息科学与工程学院, 乌鲁木齐 830046
  • 3 贵州财经大学大数据统计学院, 贵阳 550025
  • 张陶(1988—),女,汉族,安徽阜阳人,博士,讲师。研究方向:机器学习、数据挖掘与复杂网络分析。E-mail:

通讯作者:

* 廖彬(1986—),男,汉族,四川内江人,博士,副教授。研究方向:机器学习、数据挖掘及大数据计算模型等。E-mail:
Multi-Classification Prediction of Web Pages with Missing Content Features Based on Graph Embedding and Ensemble Classification Algorithm
Tao ZHANG1, 2 , Bin LIAO3, * , Jiong YU2
Affiliations
  • 1 College of Information Engineering, Guizhou University of Traditional Chinese Medicine, Guiyang 550025, China
  • 2 School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • 3 College of Big Data Statistics, Guizhou University of Finance and Economics, Guiyang 550025, China
出版时间: 2025-07-18 doi: 10.12404/j.issn.1671-1815.2405389
文章导航
收藏切换

由于噪声(如广告)、权限不足、隐私保护或恶意伪装等原因,造成大量网页显式内容特征不能被及时、全面的获取。在此背景下,为解决在网页内容特征严重缺失情况下如何对网页有效分类的问题,提出一种基于图嵌入与集成分类算法XGBoost(extreme gradient boosting)利用网页链接网络中隐含关系特征进行网页多分类的方法。首先,利用网页及网页间的超链接关系,构造出网页链接网络;然后,通过图嵌入模型抽取节点(网页)在链接网络中的隐含关系特征;其次,提取节点的集聚系数、PageRank值等统计学结构特征,共同构成节点的稠密特征向量;最后,利用基于XGBoost等集成学习模型构建节点分类预测模型对网页进行分类预测。在真实维基百科网页链接数据集上的实验结果表明:在完全缺乏网页显式内容特征情况下,所提出的Struct2Vec*+XGBoost组合方案实现了良好的网页分类效果,在准确率、精准率、查全率及F1值4项指标上分别达到0.987 5、0.965 9、0.971 3和0.964 1。

内容特征缺失  /  图嵌入  /  网页链接网络  /  网页多分类

Explicit content features of webpages are often unavailable due to distractions such as commercials, insufficient permissions, privacy protection, or deceptive disguises. To address the challenge of classifying webpages with severe content feature deficiency, a method combining graph embedding and extreme gradient boosting(XGBoost) was proposed. This method leveraged implicit relational features in webpage hyperlink networks for multi-classification. Firstly, a hyperlink network was constructed using relationships between webpages. Then, node features were extracted using graph embedding models, and statistical structural features such as clustering coefficients and PageRank values were concatenated to form dense feature vectors. Finally, ensemble learning models, including XGBoost, were trained to classify webpages for prediction. Experiments on a real Wikipedia dataset show that the Struct2Vec*+XGBoost approach achieves excellent classification results, with accuracy, precision, recall, and F1-score metrics reaching 0.987 5, 0.965 9, 0.971 3, and 0.964 1, respectively. These results are superior to those of comparison models. The findings demonstrate the effectiveness of using implicit link-based features for webpage classification in scenarios with content feature deficiency.

missing content features  /  graph embedding  /  webpage hyperlink network  /  webpage multi-classification
张陶, 廖彬, 于炯. 基于图嵌入与集成分类算法的内容特征缺失网页多分类预测方法. 科学技术与工程, 2025 , 25 (20) : 8604 -8614 . DOI: 10.12404/j.issn.1671-1815.2405389
Tao ZHANG, Bin LIAO, Jiong YU. Multi-Classification Prediction of Web Pages with Missing Content Features Based on Graph Embedding and Ensemble Classification Algorithm[J]. Science Technology and Engineering, 2025 , 25 (20) : 8604 -8614 . DOI: 10.12404/j.issn.1671-1815.2405389
互联网的迅速发展带来了海量网页数据,这些数据中蕴含着巨大价值,对学者而言,如何有效管理和挖掘这些数据已成为重要课题。网页分类作为网页挖掘的基础,对于识别欺诈网页、排序网页重要性等应用至关重要[1-2]。现有绝大部分网页分类主要基于预设主题,通过从网页的文本、网页结构和超链接URL等显式信息提取特征,并运用多种机器学习模型,如逻辑回归(logistic regression, LR)、支持向量机(support vector machine, SVM)、随机森林(random forest,RF)和神经网络等进行类别判定。然而,网络环境日益复杂,噪声(如广告)、权限限制、隐私保护和恶意伪装等因素导致大量网页的显式特征难以及时全面地获取。在这种情况下,使得传统网页分类方法难以有效分类网页。
为应对网页显式内容特征缺失的挑战,提出一种新思路:将网页分类问题转化为基于网页链接网络的节点分类问题。网页链接网络由网页间的超链接构成,每个网页对应一个节点,节点属性代表网页内容,节点间的边表示链接关系。因此,即使在网页属性信息不完整的情况下,也能利用网页间的链接关系和网络拓扑结构进行分类,将网页视为无属性图中的节点,通过节点间的链接关系和网络拓扑结构进行分类。
基于此,现利用网页及网页间的超链接关系,构造出网页链接网络,并在此基础上利用5种主流图嵌入模型获取节点(网页)在网页链接网络中的隐含关系特征,得到每个节点的图嵌入特征向量。提取出节点的度、入度、出度、集群系数、网页重要性及PageRank值等多维统计学结构特征,与图嵌入特征拼接成为节点组合特征;在此基础上利用XGBoost训练出最佳多分类模型进行网页分类。通过在真实维基百科网络数据集进行实证,探索网页链接网络中的隐含关系特征对网页分类的影响,为后续相关研究提供参考。
针对网页分类问题,研究者们提出了多种基于网页显式特征的方法,包括网页文本内容特征(如标题、作者、摘要、主题内容等)和结构特征(如页面布局、HTML标签结构等)。基于文本内容的方法[3-5]起源于文本分类,通过上下文分析、语义分析和摘要技术提取网页文本信息,然后再应用文本分类算法进行网页分类。但由于网页的半结构化特性,如广告等噪声会干扰有效文本信息的提取,导致仅依赖文本内容特征的分类效果不佳。相比之下,基于网页自身结构特征的方法[6-8]结合了网页文本特征和结构特征,通过传统机器学习算法,有效提升了网页分类效果。然而,上述两类方法的准确性受限于网页显式内容特征的提取,尤其在面对以图片为主、内容隐藏或受权限限制的网页时。提取URL信息以标识网页功能或主题提供了新的分类方法,但传统基于URL的方法[9-11]主要依赖于表面的语法或结构信息,未能充分利用网页间的隐含关系特征,导致分类效果受限。
随着深度学习技术的进步,图嵌入[12-13]作为一种新兴的表示学习方法,在多个领域显示出巨大潜力。图嵌入技术将图中的节点(如网页)映射到低维向量空间,保留节点间的结构信息和关系,如相似性、链接模式和社区结构等,这些是传统内容特征难以捕获的。同时这种低维向量表示便于用于机器学习任务,如节点分类。如DeepWalk[14]、Node2Vec[15]等图嵌入模型,首先在网络上以某种策略随机游走,捕获网络结构信息并学习节点,然后用于下游任务。Cai等[16]提出了一种针对无属性图分类任务的基线方法,利用Local Degree Profile挖掘节点结构特征,利用SVM进行节点分类,但在COLLAB等数据集上的精度基本低于80%。Rozemberczki等[17]使用DeepWalk,Role2Vec学习节点特征向量,在利用逻辑回归训练分类预测。文献[18]提出的BANE(binarized attributed network embedding)模型在复杂网络节点特征提取上,相较于DeepWalk等模型,F1指标提升了约3%。
尽管图神经网络[19]端到端模式在隐含特征学习与分类上具有协同优势;但其性能受限于节点属性的完整性。相比之下,图嵌入算法如DeepWalk、Node2Vec等对节点属性的完整性几乎没有要求。因此,提出一种基于图嵌入与集成算法XGBoost的网页分类方法。区别在于:①区别于传统基于网页显式内容特征的分类方法,所提方法不依赖于网页的显式内容特征,而是利用网页在链接网络中的隐含关系特征进行分类,为无法获取显式内容特征的网页提供了分类的新途径;②不同于文献[16-17],采用混合特征提取策略,除了图嵌入特征外,还提取节点的统计学特征,如连接性、聚集性、重要性和中心性等,以更全面地描述网页间的关系特性,提高分类准确性;③以往的研究主要集中于生成节点嵌入特征,而在分类模型的选择上较为简单(如SVM模型[16,18]、逻辑回归模型[17]等),缺少对更优分类模型的探索和尝试。将XGBoost等集成学习模型应用到网页链接网络节点分类任务中,并通过实验验证适配集成分类模型的有效性。
定义1 基于网页链接数据构建的网页链接网络G=(V,E,Ψ,Ω),其中,V为所有节点(网页)的集合,若网页总数为n,则$\left|V\right|$=n,第i个网页viV(1≤in);E为所有边(网页链接)的集合,若链接总数为m,则$\left|E\right|$=m,两个网页间的链接eij=edge(vi,vj)∈E(1≤in,1≤jn),其中edge(vi, vj)为网页vi和网页vj 间的链接(边);Ψ为网页属性特征集合,节点vi的属性特征为Ψ(vi);Ω为网页链接关系属性特征集合,边eij的具体属性特征为Ω(eij)。当网页显式内容信息难以提取,导致节点属性特征缺失,即ΨΩ为空时,网页链接网络G为无属性图。
定义2 无属性网页链接网络节点分类模型是,在无属性的网页链接网络G中,部分节点(网页)附有类别标签,而其他节点的标签未知,该任务旨在利用已知标签节点的连接关系和网络拓扑结构信息来训练模型,以便预测未标记网页的类别标签。设Y为网页标签集合, 其中,yiY(1≤in)为节点(网页)vi所对应的类别标签,y∈{c1,c2,…,ck}为类别标签的总数。当k=2时,表示网页二分类问题,当k>2时,表示网页多分类问题。
定义3 损失函数。在给定网页链接网络G和网页标签集合yiY(1≤in)的情况下,分类模型的目标是最小化损失函数L,其表达式为
L=$\stackrel{n}{\sum _{i=1}}$l[yi,f(vi)]
式(1)中:yi为节点vi的真实标签;f(vi)为网页分类模型预测的标签;l为损失函数,用于衡量预测标签与真实标签之间的差异;n为网页链接网络中所有网页的总数。
为提取网页链接网络中的深层隐含关系特征。通过DeepWalk、Node2Vec等图嵌入技术,能够捕获网页间的复杂隐含关系,并将高维网络数据降维为低维嵌入向量,同时保留节点间的隐含关系特征。图嵌入映射函数可表示为

f(X)→Mn×s

式(2)中:在图嵌入模型中,X为输入网络,即网页链接网络G;Mn×s为训练结果,是一个n×s的矩阵,其中,n为网页链接网络中网页的数量,s(本文默认为128)为节点嵌入向量的维度。
这些向量综合了节点的局部特征以及其在整个网络中的位置与角色。
为捕捉节点在网页链接网络中的结构、位置、重要性等特征,提取包括度(degree,记为deg)、入度(in_degree,记为indeg)、出度(out_degree,记为outdeg)和PageRank(记为PR)的四维统计特征,以及集群系数(clustering coefficient,记为clu)来衡量节点的局部邻域拓扑特征,并将这些特征与节点嵌入向量结合,形成节点的特征向量。
将网页链接网络G中每个节点的统计特征记为Αn×5,其中任意节点viΑ(vi)可表示为

Α(vi)=[deg(vi),indeg(vi),outdeg(vi),clu(vi),pg(vi)]

式(3)中:节点vi的集聚系数通过式(4)计算。
clu(vi)=$\frac{\left|\left\{{e}_{ij}\right\}\right|}{{d}_{\mathrm{e}\mathrm{g}}\left({v}_{i}\right)\left[{d}_{\mathrm{e}\mathrm{g}}\right({v}_{i})-1]}$
网页链接网络G中,节点viPR(PageRank)值通过式(5)计算。
PR(vi)=α$\sum _{{v}_{i}\in {o}_{\mathrm{u}\mathrm{t}}\left({v}_{i}\right)}\frac{{P}_{\mathrm{R}}\left({v}_{i}\right)}{\left|{o}_{\mathrm{u}\mathrm{t}\mathrm{d}\mathrm{e}\mathrm{g}}\left({v}_{i}\right)\right|}$+$\frac{1-\alpha }{n}$
式(5)中:out(vi)为节点vi的所有出链节点集合;$\left|{o}_{\mathrm{u}\mathrm{t}\mathrm{d}\mathrm{e}\mathrm{g}}\left({v}_{i}\right)\right|$为节点vi的出链个数;α为阻尼系数,α取值为0.85。
将网页链接网络G的节点嵌入特征矩阵Mn×s、节点统计结构特征矩阵An×5和节点标签集Y按节点ID编号关联,形成最终训练数据集D,可表示为

D={(x1,y1),(x2,y2),…,(xn,yn)}

式(6)中:$\left|D\right|$=$\left|V\right|$=n为所有节点的数目;xi=(${x}_{i}^{1}$,${x}_{i}^{2}$,…,${x}_{i}^{s+5}$)为节点vi最终的特征向量,其中s+5为节点特征的维度;yi∈{c1,c2,…,ck}为网页类别标签的取值范围。
XGBoost是一种基于梯度提升决策树的集成学习方法,由Chen等[20]提出,因其出色的防过拟合能力和泛化性能而受到广泛认可。利用3.2节中挖掘出的网页隐含关系特征,训练一个性能可靠的XGBoost模型。该模型的目标是提高网页链接网络中网页分类的精确度。
本实验环境配置如下:操作系统为ubuntu 18,Python版本3.7,tensorflow版本1.14,networkX版本2.2,scikit-learn版本0.23.1。硬件配置包括core i7-8750h 2.20 GHz CPU和16 GB RAM。实验数据及代码已开源,网址为:https://gitee.com/zhangtao66/wiki_networks_classification
构建网页链接网络[21],采取以下步骤。
步骤1 使用Python爬虫从维基百科(http://en.wikipedia.org/)上爬取工程、技术与应用科学大类下共计17个类别(标签)的网页,包括交通、建筑学、土木工程、电气工程等多个学科。
步骤2 将爬取到的网页之间的链接关系通过图的边相互关联,并使用networkX构建网页链接网络。
步骤3 将网页的第一个分类标签作为该内容的标签字段。
步骤4 最终构建出的网页链接网络节点规模为2 405个,边规模为17 981条,节点的平均度值为13.74,平均入度与出度均为6.87,节点平均聚集系数为0.323 8,默认参数下PageRank计算均值为0.000 415 8,图密度为0.002 858。
网络节点的度、入度和出度分布如图1所示,均呈现正偏和厚尾特征,具体偏度与峰度值分别为(5.561,58.095)、(8.623,126.606)和(2.285,7.932)。节点聚类系数和PageRank值的分布如图2所示,聚类系数偏度和峰度为(0.711,-0.385),分布较均匀;PageRank值偏度和峰度为(7.124,75.294),表现出明显的正偏和厚尾特征,超90%节点PageRank值集中在[0,0.003]区间。
选取准确性(Accuracy)、精确度(Precision)、召回率(Recall)及F1分数(F1-score)作为评估网页分类性能的评价指标。
对比的基准方法详细信息如下。
(1)传统方法[22]。使用PageRank基于网络链接结构进行网页分类。
(2)无属性图分类的基线方法[16]。通过节点的局部度特征和SVM进行分类。
(3)用DeepWalk、Role2Vec等图嵌入模型学习节点特征,再用逻辑回归进行分类[17]
(4)BANE模型结合Weisfeiler-Lehman邻近矩阵捕获节点链接和属性特征,用SVM进行分类预测[18]
使用5种图嵌入模型来提取网页隐含关系特征,详细信息如下。
(1)DeepWalk[14]。算法分为两步,首先随机游走捕捉网络隐含信息,然后基于捕获的信息生成节点的低维特征向量。
(2)Node2Vec[15]。随机游走时同时考虑深度优先和广度优先邻域,以捕捉网络中的隐含信息。
(3)LINE[23]。通过学习网络中节点一阶和二阶近邻关系获取节点嵌入特征,以用于下游任务。
(4)SDNE[24]。使用深度自动编码器优化节点的一阶和二阶相似度,保留图的局部和全局结构,适用于稀疏网络。
(5)Struct2Vec[25]。不依赖近邻相似性,而是依据节点的空间结构相似性来学习网络中的隐含信息。
5种图嵌入模型超参数设置如表1所示。分类模型参数配置如表2所示。
本实验在真实网页链接数据集上进行,并采用5折交叉验证,实验结果如表3所示。其中最后5行展示了本文方法的性能。以DeepWalk*+XGBoost为例,表示使用DeepWalk模型输出的128维节点embedding特征加上*5维统计结构特征作为XGBoost模型的输入特征。
表3可以发现,所有方法在Accuracy、Precision、Recall及F1-score这4个评价指标上都取得了0.5以上的表现。这验证了利用隐含关系特征进行网页分类的有效性。同时,本文方法在网页分类的整体效果上展现出了更优的性能。其中,Struct2Vec*+XGBoost方法在Accuracy、Recall及F1-score这3个指标上均取得了最佳成绩,而LINE*+XGBoost方法则在Precision上表现最为突出。与其他基线方法相比,提出的方法在4个核心指标上均实现了显著的进步,这证明即使在缺乏网页显式特征的情况下,通过深入挖掘和利用节点在网页链接网络中的隐含关系特征信息,依然能够取得出色的网页分类效果。
通过XGBoost训练5种图嵌入模型得到的特征矩阵,可以确定每个特征维度的重要性,如图3图4所示。分析这些指标有助于直观理解不同特征提取方法中前几位关键数据特征与网页分类结果的联系。图3以DeepWalk和DeepWalk*为例,展示了网页隐含关系特征的重要性分布。两者都显示出均匀的特征重要性分布,呈现正偏厚尾型。这与3.1节讨论的原网页链接网络特征分布一致(图1图2),这表明图嵌入模型将高维网页链接网络映射到低维空间时,保证了隐含关系特征分布的一致性。
图4比较了DeepWalk和Struct2Vec两种特征提取方法的特征重要性。通过排序影响网页分类的特征,可以观察到关键特征与分类结果的关系。图4(a)图4(b)对比了DeepWalk和DeepWalk*的前20个重要特征,而图4(c)图4(d)对比了Struct2Vec及Struct2Vec*的前20个重要特征。结果显示,DeepWalk和Struct2Vec的前20个特征完全不同,反映了它们在训练数据抽取时的不同策略,导致两者之间特征分布的差异性,但两者都实现了将节点在图中高维隐含语义映射到低维空间的功能。同时,在缺乏网页显式内容特征的情况下,出度、聚集系数、度和PageRank等特征对分类结果影响显著,这些特征分别表示网页在链接网络中的位置、重要性和网页与其他相邻网页的链接情况。此外,相较于原始DeepWalk和Struct2Vec,DeepWalk*和Struct2Vec*通过结合节点连接性、重要性、中心性等统计特征,提供了更丰富的特征向量,增强了特征对网页分类结果的可解释性。
本实验探讨了不同分类模型对网页分类效果的影响,同时保持节点特征提取方案一致,以增强实验结果的可比性。实验中,除文献[17-18]采用的传统LR和SVM分类模型,还引入XGBoost、GradientBoost等集成学习模型进行网页链接网络节点分类。图5展示了各模型学习曲线,表4图6详细对比模型性能。
图5可以看出,集成模型XGBoost和GradientBoost在分类效果上显著优于SVM、LR和RandomForest。具体来说, SVM、LR和RandomForest在样本量超过2 000时收敛效果不佳,训练集与交叉验证集性能差异显著。特别是LR,在样本量增加时训练集性能下降。而XGBoost和GradientBoost在样本量500~750时已展现良好拟合效果,且随样本量增加交叉验证性能稳定,未出现过拟合,对样本量依赖低。进一步对比XGBoost和GradientBoost,两者训练集性能均达100%,且XGBoost拟合速度更快,交叉验证性能更优。这表明在网页链接网络节点分类任务中,XGBoost比GradientBoost效率和准确性更高。
经过对表4图6的细致分析,可以发现:在统一的节点特征提取方案下,不同分类模型的分类性能确实存在显著差异。从图6(a)可以看出,在此网页链接网络数据集上,XGBoost模型表现最佳,其次是GradientBoost模型,然后是RandomForest、SVM和LR。具体来说,在DeepWalk*、LINE*、Node2Vec和SDNE 4种特征提取方案中,LR的性能指标普遍在[0.5+,0.6+]区间,而SVM和RandomForest模型性能稍高,处于[0.5+,0.7+]区间。相比之下,GradientBoost和XGBoost这两个基于boosting的集成模型性能显著优于其他模型,大部分指标位于[0.8+,0.9+]区间,显示出对传统模型的明显优势。然而,对比表4的数据,注意到一个异常现象:如图6(b)所示,LR和SVM模型在DeepWalk*、LINE*、Node2Vec和SDNE 4种特征提取方案上的性能约为0.6+,但当应用到Struct2Vec*特征提取方案时,性能骤降至0.2+。Struct2Vec+XGBoost的组合则取得0.93+的优异性能。这一结果促使进一步探究其背后的原因。
通过将t-SNE降维技术[26-27]应用于DeepWalk、LINE、Node2Vec、SDNE和Struct2Vec共5种图嵌入方法提取的节点特征,得到了二维空间的散点图,如图7所示。图7(a)~图7(d)显示,属于不同类别(共17类)的节点特征呈现出明显的聚集性,这表明即使经过降维处理,DeepWalk、LINE、Node2Vec和SDNE提取的节点特征在二维空间中仍然保持着较为明显的空间分类特征。与此相反,图7(e)显示Struct2Vec提取的节点特征在二维空间上分布并不呈现出明显的聚集规律,不同类别的节点特征往往无规律地聚集在一起,缺乏明显的分类界限。这一现象解释了Logistic Regression和SVM在Struct2Vec*特征上的性能会大幅下降的原因在于这些模型更适合处理具有明显线性或空间分布规律的特征,当面对Struct2Vec*这类提取的节点特征不具有明显线性或空间分布规律时,这些模型的预测效果会受到较大影响,导致性能骤降。
DeepWalk、LINE、Node2Vec和SDNE 4种方法在原理上相似,主要区别在于训练样本的抽取方式,且在模型训练阶段多采用Skip-gram或CBOW结构,导致它们提取的特征在二维空间中的分布特征差异不大。因此,这些特征与SVM及Logistic Regression结合时,模型性能差异不大(表4)。面对Struct2Vec特征,SVM和Logistic Regression的分类效果显著下降,而XGBoost取得0.93+的优异性能。是因为XGBoost通过迭代构建多个弱基模型,并利用加法模型与前向分步算法优化学习过程,使其能更好地适应非线性或非空间分布规律的特征,因此在Struct2Vec特征上表现出色。
实验结果表明,在真实网页链接网络节点分类任务中,集成模型尤其是XGBoost展现出卓越的分类性能。同时,实验也提示,在相同的节点关系特征背景下,通过大量实验尝试不同的特征提取和分类模型组合,是提升网页链接网络节点分类性能的关键。
(1)针对缺乏显式内容特征背景下的网页多分类问题进行了研究。通过graph embedding模型抽取节点(网页)在网页链接网络中的隐含关系特征,并与节点的集聚系数、PageRank值等统计学结构特征拼接,共同构成节点的稠密特征向量,达到保留节点隐含关系特征同时将高维网页链接数据转化为低维特征向量的目的;其次,在构建节点组合特征的基础上,利用XGBoost等集成学习模型对节点进行分类预测,并分析了模型拟合效果及泛化能力,验证了适配合适分类模型的必要性。
(2)在真实的网页链接网络数据集上实验结果表明:在缺乏网页显式内容特征情况下,利用节点在网页链接网络中的隐含关系特征信息,取得了较好的网页分类效果,在准确率、精准率、查全率及F1值4项指标上均优于已有方法,同时模型拟合收敛速度较快,也能够很好的适应样本量较小的应用场景。研究成果探索了网页链接网络中的隐含关系特征对网页分类的影响,尤其为无法获取到网页显式内容特征情况下的网页分类提供了新的视角。在隐私保护日益受到重视的今天,本文方法能够在不依赖于个人隐私信息的情况下进行网页分类,可以更准确地识别过滤网页,对推动中国在数据挖掘、网络安全、信息检索等领域的发展具有重要意义。
(3)由于数据源获取方式及模型离线训练模式的限制,导致模型迭代更新的计算成本较高,所以模型部署后需要更多的关注特征、标签漂移对模型性能的影响。在未来的研究中,计划尝试将模型在不同类型的网页链接数据集上进行测试,以验证模型的泛化性。
  • 国家自然科学基金(61562078)
  • 贵州中医药大学博士启动基金([2024]07号)
参考文献 引证文献
排序方式:
[1]
王法玉, 于晓文, 陈洪涛. 基于欠采样和多层集成学习的恶意网页识别[J]. 计算机工程与设计, 2024, 45(3): 669-675.
Wang Fayu, Yu Xiaowen, Chen Hongtao. Malicious web page recognition based on undersampling and multi-layer ensemble learning[J]. Computer Engineering and Design, 2024, 45(3): 669-675.
[2]
张明杰, 肖奇荣, 朱烨行. 基于XGBoost模型的融合多特征微博信息传播预测方法[J]. 科学技术与工程, 2023, 23(10): 4279-4285.
Zhang Mingjie, Xiao Qirong, Zhu Yehang. Prediction method of microblog information dissemination based on XGBoost model and multi-feature fusion[J]. Science Technology and Engineering, 2023, 23(10): 4279-4285.
[3]
翁彬月, 秦永彬, 黄瑞章, . NEMTF: 基于多维度文本特征的新闻网页信息提取方法[J]. 计算机应用研究, 2022, 39(4): 1043-1048.
Weng Binyue, Qin Yongbin, Huang Ruizhang, et al. NEMTF: method of news Web content extraction based on multi-dimensional text features[J]. Application Research of Computers, 2022, 39(4): 1043-1048.
[4]
周文文, 韩斌, 黄树成. 结合文本语义图和词频统计的网页分类算法研究[J]. 计算机与数字工程, 2020, 48(6): 1265-1268, 1313.
Zhou Wenwen, Han Bin, Huang Shucheng. Research on web page classification algorithm combining text semantic graph and word frequency statistics[J]. Computer and Digital Engineering, 2020, 48(6): 1265-1268, 1313.
[5]
耿宜鹏, 鞠时光, 蔡文鹏, . 基于Skip-PTM的网页主题分类与主题变迁的研究[J]. 小型微型计算机系统, 2020, 41(7): 1395-1399.
Geng Yipeng, Ju Shiguang, Cai Wenpeng, et al. Research on topic classification and topic change of web pages based on Skip-PTM[J]. Journal of Chinese Computer Systems, 2020, 41(7): 1395-1399.
[6]
冯健, 张莹. 基于文档对象模型结构聚类的钓鱼网页检测方法[J]. 科学技术与工程, 2018, 18(23): 81-89.
Feng Jian, Zhang Ying. A detection method for phishing webpage based on DOM structure clustering[J]. Science Technology and Engineering, 2018, 18(23): 81-89.
[7]
Deng L, Du X, Shen J Z. Web page classification based on heterogeneous features and a combination of multiple classifiers[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 7: 995-1004.
[8]
淮晓永, 韩晓东, 高若辰, . 一种自适应网页结构化信息提取方法[J]. 电子技术应用, 2020, 46(12): 97-102.
Huai Xiaoyong, Han Xiaodong, Gao Ruochen, et al. An adaptive web page structured information extraction method[J]. Application of Electronic Technique, 2020, 46(12): 97-102.
[9]
洪良怡, 朱松林, 王轶骏, . 基于卷积神经网络的暗网网页分类研究[J]. 计算机应用与软件, 2023, 40(2): 320-325, 330.
Hong Liangyi, Zhu Songlin, Wang Yijun, et al. Darknet web page classification based on convolutional neural network[J]. Computer Applications and Software, 2023, 40(2): 320-325, 330.
[10]
张紫妍, 韩斌, 姜元昊, . 融合差分进化的网页暗链集成分类检测方法[J]. 计算机仿真, 2024, 41(4): 391-396.
Zhang Ziyan, Han Bin, Jiang Yuanhao, et al. Integrated Classification and detection method of web page hidden hyperlink based on differential evolution[J]. Computer Simulation, 2024, 41(4): 391-396.
[11]
杨胜杰, 陈朝阳, 徐逸, . 基于深度学习与特征融合的恶意网页识别方法研究[J]. 信息安全学报, 2024, 9(3): 176-190.
Yang Shengjie, Chen Zhaoyang, Xu Yi, et al. Research on malicious web page identification method based on deep learning and feature fusion[J]. Journal of Cyber Security, 2024, 9(3): 176-190.
[12]
Giamphy E, Guillaume J L, Doucet A, et al. A survey on bipartite graphs embedding[J]. Social Network Analysis and Mining, 2023, 13(1): 54.
[13]
李青, 王一晨, 杜承烈. 图表示学习方法研究综述[J]. 计算机应用研究, 2023, 40(6): 1601-1613.
Li Qing, Wang Yichen, Du Chenglie. Survey on graph representation learning methods[J]. Application Research of Computers, 2023, 40(6): 1601-1613.
[14]
Perozzi B, AL-Rfou R, Skiena S. DeepWalk: online learning of social representations[C]// Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 701-710.
[15]
Grover A, Leskovec J. Node2Vec: scalable feature learning for networks[C]// Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 855-864.
[16]
Cai C, Wang Y. A simple yet effective baseline for non-attributed graph classification[C]// Proceedings of the International Conference on Learning Representation. New York: ACM, 2018: 701-710.
[17]
Rozemberczki B, Kiss O, Sarkar R. Karate club: an api oriented open-source python framework for unsupervised learning on graphs[C]// Proceedings of the 29th ACM International Conference on Information & Knowledge Management. New York: ACM, 2020: 3125-3132.
[18]
Yang H, Pan S, Zhang P, et al. Binarized attributed network embedding[C]// Proceedings of the International Conference on Data Mining. Piscataway, NJ: IEEE, 2018: 1476-1481.
[19]
张子威, 王鑫, 朱文武. 图神经架构搜索综述[J]. 计算机学报, 2023, 46(7): 1532-1552.
Zhang Ziwei, Wang Xin, Zhu Wenwu. Graph neural architecture search: a survey[J]. Chinese Journal of Computers. 2023, 46(7): 1532-1552.
[20]
Chen T, Guestrin C. XGBoost: a scalable tree boosting system[C]// Proceedings of the 22th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 785-794.
[21]
孙辰星, 刘伟, 卢彬, . 多视角网页分类数据集构建及性能评估[J]. 南京大学学报(自然科学), 2024, 60(3): 406-415.
Sun Chenxing, Liu Wei, Lu Bin, et al. Multi-View webpage classification dataset construction and evaluation[J]. Journal of Nanjing University(Natural Science), 2024, 60(3): 406-415.
[22]
Pedroche F. A Model to Classify users of social networks based on PageRank[J]. International Journal of Bifurcation and Chaos, 2012, 22(7): 93-106.
[23]
Tang J, Qu M, Wang M, et al. LINE: large-scale information network embedding[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 1067-1077.
[24]
Wang D, Cui P, Zhu W. Structural Deep network embedding[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2016: 1225-1234.
[25]
Ribeiro L F R, Saverese P H P, Figueiredo D R. Struct2Vec: learning node representations from structural identity[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2017: 385-394.
[26]
Ruitmark V D, Billeter M, Eisemann E. An efficient dual-hierarchy t-SNE minimization[J]. IEEE Transactions on Visualization and Computer Graphics, 2022, 28(1): 614-622.
[27]
谢斌, 徐燕, 王冠超, . t-SNE最大化的自适应彩色图像灰度化方法[J]. 中国图象图形学报, 2024, 29(8): 2333-2349.
Xie Bin, Xu Yan, Wang Guanchao, et al. Adaptive decolorization method based on t-SNE maximization[J]. Journal of Image and Graphics, 2024, 29(8): 2333-2349.
2025年第25卷第20期
PDF下载
57
20
引用本文
BibTeX
文章信息
doi: 10.12404/j.issn.1671-1815.2405389
  • 接收时间:2024-07-17
  • 首发时间:2026-05-13
  • 出版时间:2025-07-18
补充材料
相关文章
文章信息
作者
出版历史
  • 收稿日期:2024-07-17
  • 修回日期:2025-04-12
基金
国家自然科学基金(61562078)
贵州中医药大学博士启动基金([2024]07号)
作者信息
    1 贵州中医药大学信息工程学院, 贵阳 550025
    2 新疆大学信息科学与工程学院, 乌鲁木齐 830046
    3 贵州财经大学大数据统计学院, 贵阳 550025

通讯作者:

* 廖彬(1986—),男,汉族,四川内江人,博士,副教授。研究方向:机器学习、数据挖掘及大数据计算模型等。E-mail:
参考文献
分享链接
https://castjournals.cast.org.cn/joweb/kxjsygc/CN/10.12404/j.issn.1671-1815.2405389
分享至
全文二维码

扫描看全文

引用本文
BibTeX
本文的引用情况
2种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏