Article(id=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, issueId=1242134390566887892, articleNumber=null, orderNo=54, doi=10.3981/j.issn.1000-7857.2016.2.048, pmid=null, cstr=null, oa=null, hot=null, price=null, onlineType=0, articleFormat=0, articleType=null, articleTypeStr=null, receivedDate=1429718400000, receivedDateStr=2015-04-23, revisedDate=1433692800000, revisedDateStr=2015-06-08, acceptedDate=null, acceptedDateStr=null, onlineDate=1454555982870, onlineDateStr=2016-02-04, pubDate=1453910400000, pubDateStr=2016-01-28, doiRegisterDate=null, doiRegisterDateStr=null, onlineIssueDate=1454555982870, onlineIssueDateStr=2016-02-04, onlineJustAcceptDate=null, onlineJustAcceptDateStr=null, onlineFirstDate=null, onlineFirstDateStr=null, sourceXml=null, magXml=null, createTime=1774077543688, creator=sys-migrate, updateTime=1774077543688, updator=sys-migrate, issue=Issue{id=1242134390566887892, tenantId=1146029695717560320, journalId=1146031591421210625, year='2016', volume='34', issue='2', pageStart='1', pageEnd='332', issueExtLink='null', onlineDate='null', pubDate='1453910400000', pubDateStr='2016-01-28', beforeIssueId=null, nextIssueId=null, price=null, status=1, issueComplete=1, articleOrder=3, issueType=-1, specialIssue=null, createTime=1774077516449, creator='sys-migrate', updateTime=1774077516449, updator='sys-migrate', preIssue=null, nextIssue=null, articleTotal=null, ext=null, issueFiles=null, downloadFileDto=null}, startPage=282, endPage=286, ext={EN=ArticleExt(id=1242134508284223609, articleId=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, language=EN, title=Weibo topic detection based on improved TF-IDF algorithm, columnId=1242116810380743325, journalTitle=Science & Technology Review, columnName=Reviews, runingTitle=null, highlight=null, articleAbstract=The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT., authors=CHEN Shuoying1, JIN Zhensheng2, authorsList=CHEN Shuoying, JIN Zhensheng, authorCompany=1. Department of Network Information Center, Beijing Institute of Technology, Beijing 100081, China;
2. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China, correspAuthors=null, authorNote=null, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=lR2rLutGobTd22Zknx4WGQ==, pdfFileSize=961040, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=null), CN=ArticleExt(id=1242134507118207075, articleId=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, language=CN, title=基于改进的TF-IDF算法的微博话题检测, columnId=1242116809164390686, journalTitle=科技导报, columnName=综述文章, runingTitle=null, highlight=null, articleAbstract=中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。, authors=陈朔鹰1, 金镇晟2, authorsList=陈朔鹰, 金镇晟, authorCompany=1. 北京理工大学网络信息中心, 北京 100081;
2. 北京理工大学计算机学院, 北京 100081, correspAuthors=null, authorNote=陈朔鹰,副教授,研究方向为计算机网络、数据挖掘、物联网,电子信箱:chensy@bit.edu.cn, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=z3TrFH0dHJu6rt+SjG3KjQ==, pdfFileSize=961040, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=null)}, authors=null, keywords=[Keyword(id=1242134506564558930, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=微博), Keyword(id=1242134506661032081, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=TF-IDF), Keyword(id=1242134506778472594, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=话题检测), Keyword(id=1242134506887520349, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=TDT), Keyword(id=1242134506967212127, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=文本聚类), Keyword(id=1242134507646693533, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=Weibo), Keyword(id=1242134507776716958, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=TF-IDF), Keyword(id=1242134507906740386, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=topic detection), Keyword(id=1242134508015792290, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=TDT), Keyword(id=1242134508129038500, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=text clustering)], refs=null, funds=null, companyList=null, figs=null, attaches=null, journal=Journal(id=1125356956822126595, delFlag=0, nameCn=科技导报, nameEn=Science & Technology Review, nameHistory1=null, nameHistory2=null, issn=1000-7857, eissn=, cn=11-1421/N, coden=null, periodic=3, language=CN, oaType=0, ccby=null, superviseOffice=null, ownerOffice=null, pubOffice=null, editorOffice=null, officeType=null, aims=null, clcCode=null, officeProv=null, officeCity=null, officeAddr=null, officeZip=null, officeEmail=null, officePhone=null, editDirector=null, officeDirector=null, officeDirectorPhone=null, officeStaffNum=null, officeEmpNum=null, coverPicUrl=wfghvu3bhh/dKxuZ+ucVHA==, journalPrice=null, startedYear=null, abbrevIsoEn=Sci Technol Rev, journalRemark=null, publicationField=null, createdTime=null, updatedTime=1774230116083, createdBy=null, updatedBy=13041195026, firstLetterCn=S, firstLetterEn=S, subjectCode=Natural Sciences, subjectName=自然科学, subjectCodeEn=Natural Sciences, subjectNameEn=null, picCn=wfghvu3bhh/dKxuZ+ucVHA==, picEn=yjSfclmpNm7ihn9NbTZ69g==, jcr=null, cjcr=null, exts=[JournalExt(id=1242774439910290156, language=CN, name=科技导报, nameHistory1=null, nameHistory2=null, managedBy=中国科学技术协会, sponsoredBy=中国科学技术协会, publishedBy=科技导报社, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/CN/home, createdTime=1774230116107, updatedTime=1774230116107, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/CN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.cast.org.cn/webm, submissionEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionReviewUrl=https://kjdbauthor.cast.org.cn/webm, submissionCeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionAeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, option={"copyright":""}), JournalExt(id=1242774439960621805, language=EN, name=Science & Technology Review, nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/EN/home, createdTime=1774230116119, updatedTime=1774230116119, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/EN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionReviewUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionCeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionAeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, option={"copyright":""})], databaseList=null, tenantJournalId=1146031591421210625, websiteList=[Website(id=1146104741081231361, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/CN, language=CN, createTime=1751182263881, createBy=18614031015, updateTime=1751778001962, updateBy=18614031015, name=科技导报, tplId=1146099689490845704, title=科技导报, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1148021146403992296, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=articleTextType, value=kx, createTime=1751639170504, updateTime=1751639170504, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146378826469, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=banner, value=null, createTime=1751639170498, updateTime=1751639170498, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146366243556, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1751639170495, updateTime=1751639170495, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146395603687, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic, createTime=1751639170502, updateTime=1751639170502, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146387215078, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1751639170500, updateTime=1751639170500, creator=18614031015, updator=18614031015)]), Website(id=1146105254833139715, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/EN, language=EN, createTime=1751182386363, createBy=18614031015, updateTime=1753500121937, updateBy=18614031015, name=科技导报, tplId=1146101810881728533, title=Science & Technology Review, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1155838567709528217, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=articleTextType, value=kx, createTime=1753502988984, updateTime=1753502988984, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567692750998, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=banner, value=null, createTime=1753502988980, updateTime=1753502988980, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567688556693, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1753502988979, updateTime=1753502988979, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567705333912, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic, createTime=1753502988983, updateTime=1753502988983, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567701139607, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_en_623/, createTime=1753502988982, updateTime=1753502988982, creator=18614031015, updator=18614031015)])], journalTitle=科技导报, weixinUrl=null, journalUrl=null, iacademicId=null, status=1, seqNo=null, journalTitleEn=Science & Technology Review, journalPhotoCn=wfghvu3bhh/dKxuZ+ucVHA==, journalPhotoEn=yjSfclmpNm7ihn9NbTZ69g==, journalFirstLetter=S, journalRecommend=null, journalNew=null, journalCollection=1, jcrJf=null, cjcrJf=0.91, jcrJfStr=null, cjcrJfStr=null, submissionFirstDecision=null, sciSubjectClassification=null, casSubjectClassification=null, citeScore=null, totalCitationFrequency=null, icpCode=null, psCode=null, advertisingLicenseCode=null, copyrightInformation=null, country=null, option=, provinceCode=null, provinceName=null, collectFlag=false, interPubPlatform=null, interPubPlatformUrl=null), detailUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/10.3981/j.issn.1000-7857.2016.2.048, detailUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/10.3981/j.issn.1000-7857.2016.2.048, pdfUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/PDF/10.3981/j.issn.1000-7857.2016.2.048, pdfUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/PDF/10.3981/j.issn.1000-7857.2016.2.048, aliStartDate=null, aliEndDate=null, collectionFlag=false, citedCount=null, citedUrl=null, previewStatus=0, delFlag=0, hasFullText=0, orderTime=1453910400000, fullTextJson=null, articleText=null, reference=null)
收藏切换
基于改进的TF-IDF算法的微博话题检测
收藏切换
PDF下载
科技导报 | 综述文章 2016,34(2): 282-286
收起
收藏切换
科技导报 | 综述文章 2016, 34(2): 282-286
基于改进的TF-IDF算法的微博话题检测
全屏
陈朔鹰1, 金镇晟2
作者信息
    1. 北京理工大学网络信息中心, 北京 100081;
    2. 北京理工大学计算机学院, 北京 100081
Weibo topic detection based on improved TF-IDF algorithm
Affiliations
出版时间: 2016-01-28 doi: 10.3981/j.issn.1000-7857.2016.2.048
文章导航
收藏切换
中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。
微博  /  TF-IDF  /  话题检测  /  TDT  /  文本聚类
The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT.
Weibo  /  TF-IDF  /  topic detection  /  TDT  /  text clustering
陈朔鹰, 金镇晟. 基于改进的TF-IDF算法的微博话题检测. 科技导报, 2016 , 34 (2) : 282 -286 . DOI: 10.3981/j.issn.1000-7857.2016.2.048
CHEN Shuoying, JIN Zhensheng. Weibo topic detection based on improved TF-IDF algorithm[J]. Science & Technology Review, 2016 , 34 (2) : 282 -286 . DOI: 10.3981/j.issn.1000-7857.2016.2.048
2016年第34卷第2期
PDF下载
351
87
引用本文
BibTeX
文章信息
doi: 10.3981/j.issn.1000-7857.2016.2.048
  • 接收时间:2015-04-23
  • 首发时间:2016-02-04
  • 出版时间:2016-01-28
补充材料
相关文章
文章信息
作者
出版历史
  • 收稿日期:2015-04-23
  • 修回日期:2015-06-08
基金
作者信息
参考文献
分享链接
https://castjournals.cast.org.cn/joweb/kjdb/CN/10.3981/j.issn.1000-7857.2016.2.048
分享至
全文二维码

扫描看全文

引用本文
BibTeX
本文的引用情况
2种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏