Article(id=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, issueId=1242134390566887892, articleNumber=null, orderNo=54, doi=10.3981/j.issn.1000-7857.2016.2.048, pmid=null, cstr=null, oa=null, hot=null, price=null, onlineType=0, articleFormat=0, articleType=null, articleTypeStr=null, receivedDate=1429718400000, receivedDateStr=2015-04-23, revisedDate=1433692800000, revisedDateStr=2015-06-08, acceptedDate=null, acceptedDateStr=null, onlineDate=1454555982870, onlineDateStr=2016-02-04, pubDate=1453910400000, pubDateStr=2016-01-28, doiRegisterDate=null, doiRegisterDateStr=null, onlineIssueDate=1454555982870, onlineIssueDateStr=2016-02-04, onlineJustAcceptDate=null, onlineJustAcceptDateStr=null, onlineFirstDate=null, onlineFirstDateStr=null, sourceXml=null, magXml=null, createTime=1774077543688, creator=sys-migrate, updateTime=1774077543688, updator=sys-migrate, issue=Issue{id=1242134390566887892, tenantId=1146029695717560320, journalId=1146031591421210625, year='2016', volume='34', issue='2', pageStart='1', pageEnd='332', issueExtLink='null', onlineDate='null', pubDate='1453910400000', pubDateStr='2016-01-28', beforeIssueId=null, nextIssueId=null, price=null, status=1, issueComplete=1, articleOrder=3, issueType=-1, specialIssue=null, createTime=1774077516449, creator='sys-migrate', updateTime=1774077516449, updator='sys-migrate', preIssue=null, nextIssue=null, articleTotal=null, ext=null, issueFiles=null, downloadFileDto=null}, startPage=282, endPage=286, ext={EN=ArticleExt(id=1242134508284223609, articleId=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, language=EN, title=Weibo topic detection based on improved TF-IDF algorithm, columnId=1242116810380743325, journalTitle=Science & Technology Review, columnName=Reviews, runingTitle=null, highlight=null, articleAbstract=The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT., authors=CHEN Shuoying1, JIN Zhensheng2, authorsList=CHEN Shuoying, JIN Zhensheng, authorCompany=1. Department of Network Information Center, Beijing Institute of Technology, Beijing 100081, China;
2. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China, correspAuthors=null, authorNote=null, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=lR2rLutGobTd22Zknx4WGQ==, pdfFileSize=961040, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=null), CN=ArticleExt(id=1242134507118207075, articleId=1242134504840704125, tenantId=1146029695717560320, journalId=1146031591421210625, language=CN, title=基于改进的TF-IDF算法的微博话题检测, columnId=1242116809164390686, journalTitle=科技导报, columnName=综述文章, runingTitle=null, highlight=null, articleAbstract=中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。, authors=陈朔鹰1, 金镇晟2, authorsList=陈朔鹰, 金镇晟, authorCompany=1. 北京理工大学网络信息中心, 北京 100081;
2. 北京理工大学计算机学院, 北京 100081, correspAuthors=null, authorNote=陈朔鹰,副教授,研究方向为计算机网络、数据挖掘、物联网,电子信箱:chensy@bit.edu.cn, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=z3TrFH0dHJu6rt+SjG3KjQ==, pdfFileSize=961040, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=null)}, authors=null, keywords=[Keyword(id=1242134506564558930, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=微博), Keyword(id=1242134506661032081, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=TF-IDF), Keyword(id=1242134506778472594, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=话题检测), Keyword(id=1242134506887520349, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=TDT), Keyword(id=1242134506967212127, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=CN, orderNo=1, keyword=文本聚类), Keyword(id=1242134507646693533, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=Weibo), Keyword(id=1242134507776716958, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=TF-IDF), Keyword(id=1242134507906740386, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=topic detection), Keyword(id=1242134508015792290, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=TDT), Keyword(id=1242134508129038500, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242134504840704125, language=EN, orderNo=1, keyword=text clustering)], refs=null, funds=null, companyList=null, figs=null, attaches=null, journal=Journal(id=1125356956822126595, delFlag=0, nameCn=科技导报, nameEn=Science & Technology Review, nameHistory1=null, nameHistory2=null, issn=1000-7857, eissn=, cn=11-1421/N, coden=null, periodic=3, language=CN, oaType=0, ccby=null, superviseOffice=null, ownerOffice=null, pubOffice=null, editorOffice=null, officeType=null, aims=null, clcCode=null, officeProv=null, officeCity=null, officeAddr=null, officeZip=null, officeEmail=null, officePhone=null, editDirector=null, officeDirector=null, officeDirectorPhone=null, officeStaffNum=null, officeEmpNum=null, coverPicUrl=wfghvu3bhh/dKxuZ+ucVHA==, journalPrice=null, startedYear=null, abbrevIsoEn=Sci Technol Rev, journalRemark=null, publicationField=null, createdTime=null, updatedTime=1774230116083, createdBy=null, updatedBy=13041195026, firstLetterCn=S, firstLetterEn=S, subjectCode=Natural Sciences, subjectName=自然科学, subjectCodeEn=Natural Sciences, subjectNameEn=null, picCn=wfghvu3bhh/dKxuZ+ucVHA==, picEn=yjSfclmpNm7ihn9NbTZ69g==, jcr=null, cjcr=null, exts=[JournalExt(id=1242774439910290156, language=CN, name=科技导报, nameHistory1=null, nameHistory2=null, managedBy=中国科学技术协会, sponsoredBy=中国科学技术协会, publishedBy=科技导报社, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/CN/home, createdTime=1774230116107, updatedTime=1774230116107, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/CN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.cast.org.cn/webm, submissionEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionReviewUrl=https://kjdbauthor.cast.org.cn/webm, submissionCeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionAeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, option={"copyright":""}), JournalExt(id=1242774439960621805, language=EN, name=Science & Technology Review, nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/EN/home, createdTime=1774230116119, updatedTime=1774230116119, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/EN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionReviewUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionCeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionAeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, option={"copyright":""})], databaseList=null, tenantJournalId=1146031591421210625, websiteList=[Website(id=1146104741081231361, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/CN, language=CN, createTime=1751182263881, createBy=18614031015, updateTime=1751778001962, updateBy=18614031015, name=科技导报, tplId=1146099689490845704, title=科技导报, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1148021146403992296, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=articleTextType, value=kx, createTime=1751639170504, updateTime=1751639170504, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146378826469, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=banner, value=null, createTime=1751639170498, updateTime=1751639170498, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146366243556, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1751639170495, updateTime=1751639170495, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146395603687, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic, createTime=1751639170502, updateTime=1751639170502, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146387215078, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1751639170500, updateTime=1751639170500, creator=18614031015, updator=18614031015)]), Website(id=1146105254833139715, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/EN, language=EN, createTime=1751182386363, createBy=18614031015, updateTime=1753500121937, updateBy=18614031015, name=科技导报, tplId=1146101810881728533, title=Science & Technology Review, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1155838567709528217, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=articleTextType, value=kx, createTime=1753502988984, updateTime=1753502988984, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567692750998, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=banner, value=null, createTime=1753502988980, updateTime=1753502988980, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567688556693, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1753502988979, updateTime=1753502988979, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567705333912, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic, createTime=1753502988983, updateTime=1753502988983, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567701139607, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_en_623/, createTime=1753502988982, updateTime=1753502988982, creator=18614031015, updator=18614031015)])], journalTitle=科技导报, weixinUrl=null, journalUrl=null, iacademicId=null, status=1, seqNo=null, journalTitleEn=Science & Technology Review, journalPhotoCn=wfghvu3bhh/dKxuZ+ucVHA==, journalPhotoEn=yjSfclmpNm7ihn9NbTZ69g==, journalFirstLetter=S, journalRecommend=null, journalNew=null, journalCollection=1, jcrJf=null, cjcrJf=0.91, jcrJfStr=null, cjcrJfStr=null, submissionFirstDecision=null, sciSubjectClassification=null, casSubjectClassification=null, citeScore=null, totalCitationFrequency=null, icpCode=null, psCode=null, advertisingLicenseCode=null, copyrightInformation=null, country=null, option=, provinceCode=null, provinceName=null, collectFlag=false, interPubPlatform=null, interPubPlatformUrl=null), detailUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/10.3981/j.issn.1000-7857.2016.2.048, detailUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/10.3981/j.issn.1000-7857.2016.2.048, pdfUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/PDF/10.3981/j.issn.1000-7857.2016.2.048, pdfUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/PDF/10.3981/j.issn.1000-7857.2016.2.048, aliStartDate=null, aliEndDate=null, collectionFlag=false, citedCount=null, citedUrl=null, previewStatus=0, delFlag=0, hasFullText=0, orderTime=1453910400000, fullTextJson=null, articleText=null, reference=null)
中文微博具有更新快、时效性强等特点,产生的热点话题均具有一定的突发性,与此同时文本中有代表性的特征词也会随之激增。利用这一特性,在传统的TF-IDF(term frequency-inverse document frequency)基础上提出一种改进的特征权重算法,称之为TF-IDF-KE(term frequency-inverse document frequency-kinetic energy),用以解决突发性热点话题在聚类时特征不明显的问题。该算法结合物体的动能原理,将特征项的突发值用动能的概念进行描述,加入权值计算,提高突发性特征项的权重,最后使用CURE(clustering using representatives)算法,实现微博的话题检测。该方法描述了文本和特征项所具有的动态属性,实验结果表明,该方法能够有效地提高话题检测的效果。
关键词
微博
/
TF-IDF
/
话题检测
/
TDT
/
文本聚类
Abstract
收起
The topic detection and tracking (TDT) is an issue of natural language processing, which concerns with solving the problem of information explosion. The Weibo TDT is a central issue in recent years. A bad performance is usually achieved for Weibo with a short text, while the topic detection of a long text is widely used in the industry with better results. Weibo's features of short text and not very clear meaning make the clustering algorithms' effect not ideal in topic detection. So this paper focuses on finding a new way to improve the effect of clustering for Weibo. Weibo features fast renewal and strong timeliness. Hot topics produced by Weibo show burstiness, and their representative words increase in a great extent. With this feature in mind, improving the representative word's weight to a certain degree is a good way to give a prominence to the feature of short text. The burstiness of the words is a thing to consider, similar to the kinetic theory of the object. The formula of the kinetic energy theorem is used in this paper. Then an improved feature extraction algorithm named the TFIDF-KE (term frequency-inverse document frequency-kinetic energy) is proposed. The new algorithm consists of the kinetic energy and the TF-IDF (term frequency-inverse document frequency). The formula of the kinetic energy theorem is used to evaluate the burstiness of the words and add the value to the formula. Then, the weight of some important words can be improved when extracting features. Finally, the implementation of the CURE (clustering using representatives) algorithm completes the Weibo topic detection task. The method presented in this paper describes burstiness of text and feature and solves the problem that the feature of bursty hot topics is not obvious, when clustering in a certain extent. The experimental results show that the method can effectively improve the effect of topic detection in some degree and a better accuracy rate P can be achieved, as well as the R and F values of the recall rate. So TF-IDF-KE is an effective optimization method and can well be used for the task of the TDT.