Article(id=1242141990951919771, tenantId=1146029695717560320, journalId=1146031591421210625, issueId=1242141284979257419, articleNumber=null, orderNo=null, doi=10.3981/j.issn.1000-7857.2023.05.00812, pmid=null, cstr=null, oa=null, hot=null, price=null, onlineType=0, articleFormat=0, articleType=null, articleTypeStr=null, receivedDate=1685289600000, receivedDateStr=2023-05-29, revisedDate=1698595200000, revisedDateStr=2023-10-30, acceptedDate=null, acceptedDateStr=null, onlineDate=1736092800000, onlineDateStr=2025-01-06, pubDate=1734019200000, pubDateStr=2024-12-13, doiRegisterDate=null, doiRegisterDateStr=null, onlineIssueDate=1736092800000, onlineIssueDateStr=2025-01-06, onlineJustAcceptDate=null, onlineJustAcceptDateStr=null, onlineFirstDate=null, onlineFirstDateStr=null, sourceXml=null, magXml=null, createTime=1774079328522, creator=sys-migrate, updateTime=1774079328522, updator=sys-migrate, issue=Issue{id=1242141284979257419, tenantId=1146029695717560320, journalId=1146031591421210625, year='2024', volume='42', issue='23', pageStart='1', pageEnd='144', issueExtLink='null', onlineDate='null', pubDate='1734019200000', pubDateStr='2024-12-13', beforeIssueId=null, nextIssueId=null, price=null, status=1, issueComplete=1, articleOrder=null, issueType=-1, specialIssue=null, createTime=1774079160206, creator='sys-migrate', updateTime=1774079160206, updator='sys-migrate', preIssue=null, nextIssue=null, articleTotal=null, ext=null, issueFiles=null, downloadFileDto=null}, startPage=135, endPage=144, ext={EN=ArticleExt(id=1242141993539805351, articleId=1242141990951919771, tenantId=1146029695717560320, journalId=1146031591421210625, language=EN, title=Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model, columnId=1150494644690366681, journalTitle=Science & Technology Review, columnName=Papers, runingTitle=null, highlight=null, articleAbstract=Information processing of ancient Chinese seldom uses unearthed documents as corpus to carry out relevant research. The number of Liye Qin bamboo manuscripts reached ten times that of all the Qin slips unearthed before, which can fill many gaps in the historical records of the Qin Dynasty. In this paper, we used them as experimental corpus and explored the automatic sentence segmentation and word segmentation of unearthed documents based on the CRF model. We combined the actual characteristics of the corpus and set up different feature templates to verify the generalization ability of model sequence labeling on different tasks. We set up a joint approach to sentence segmentation and word segmentation as comparative experiment to select a better performance processing plan. At the same time, a comparative experiment was designed between deep learning methods and pretrained models. The results proved that the overall performance of the joint approach in each task was improved and that the F1-score of automatic sentence segmentation and word segmentation reached 75.79% and 94.44%, respectively. Since it's faster and takes less time, this approach is more suitable for the Liye Qin bamboo slips. The research results can serve the proofreading work of the last three volumes of Liye Qin bamboo slips and the in-depth processing and construction of the corpus., authors=FENG Huimin1,2 , GUO Shuaishuai2 , LIU Ming2 , authorsList=FENG Huimin, GUO Shuaishuai, LIU Ming, authorCompany=1. Department of Basic Courses, Shandong Agricultural Engineering University, Jinan 250100, China; 2. Institute for Advanced Study in History of Science, Northwest University, Xi'an 710127, China, correspAuthors=null, authorNote=null, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=null, pdfFileSize=null, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=null), CN=ArticleExt(id=1242141992965185698, articleId=1242141990951919771, tenantId=1146029695717560320, journalId=1146031591421210625, language=CN, title=基于CRF模型的《里耶秦简》自动断句与分词研究, columnId=1242117055294537731, journalTitle=科技导报, columnName=论文, runingTitle=null, highlight=null, articleAbstract=里耶秦简的数量是之前出土秦简的10倍,填补了秦朝历史记载中的诸多空白。将《里耶秦简》作为实验语料,探索基于 CRF(条件随机场)模型的里耶秦简自动断句与分词方法。结合简文的实际特点,通过设置不同的特征模板,面向不同的任务验证模型序列标注的泛化能力;通过设置断句、分词一体化的对比实验,以选取性能更优的处理方案;同时设计了深度学习方法与预训练模型的对比试验。实验结果表明,CRF模型一体化的标注方案在各任务中的整体性能均有所提升,其中自动断句、分词的F 1 值分别达到75.79%与94.44%,且速度快用时少,更适用于里耶秦简。, authors=冯慧敏1,2 , 郭帅帅2 , 刘铭2 , authorsList=冯慧敏, 郭帅帅, 刘铭, authorCompany=1. 山东农业工程学院基础课教学部, 济南 250100; 2. 西北大学科学史高等研究院, 西安 710127, correspAuthors=null, authorNote=冯慧敏,讲师,研究方向为数字人文,电子信箱:gfhm_2013@163.com, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=EdUYr0K9Bxy0Y0AB4w+nDQ==, pdfFileSize=881144, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, fund=陕西省重点研发计划科研项目(2019ZDLGY17-03);西北大学研究生创新项目(CX2023045);山东农业工程学院科研启动经费项目(2024GCCZR-17))}, authors=[Author(id=1277714876638339589, tenantId=1146029695717560320, journalId=null, articleId=1242141990951919771, orderNo=null, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=null, email=null, emailSecond=null, emailThird=null, correspondingAuthor=null, authorType=null, ext={EN=AuthorExt(id=null, tenantId=null, journalId=1146031591421210625, articleId=1242141990951919771, authorId=1277714876638339589, language=EN, stringName=FENG Huimin, GUO Shuaishuai, LIU Ming, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=null, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=null, tenantId=null, journalId=1146031591421210625, articleId=1242141990951919771, authorId=1277714876638339589, language=CN, stringName=冯慧敏, 郭帅帅, 刘铭, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=null, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null)}, companyList=null)], keywords=[Keyword(id=1242141992604479735, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=CN, orderNo=1, keyword=CRF模型), Keyword(id=1242141992692560120, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=CN, orderNo=1, keyword=里耶秦简), Keyword(id=1242141992763859104, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=CN, orderNo=1, keyword=自动断句), Keyword(id=1242141992839356577, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=CN, orderNo=1, keyword=自动分词), Keyword(id=1242141993103597731, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=EN, orderNo=1, keyword=CRF model), Keyword(id=1242141993225232548, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=EN, orderNo=1, keyword=Liye Qin bamboo manuscripts), Keyword(id=1242141993309118629, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=EN, orderNo=1, keyword=automatic sentence segmentation), Keyword(id=1242141993380421798, tenantId=1146029695717560320, journalId=1146031591421210625, articleId=1242141990951919771, language=EN, orderNo=1, keyword=automatic word segmentation)], refs=null, funds=null, companyList=null, figs=null, attaches=null, journal=Journal(id=1125356956822126595, delFlag=0, nameCn=科技导报, nameEn=Science & Technology Review, nameHistory1=null, nameHistory2=null, issn=1000-7857, eissn=, cn=11-1421/N, coden=null, periodic=3, language=CN, oaType=0, ccby=null, superviseOffice=null, ownerOffice=null, pubOffice=null, editorOffice=null, officeType=null, aims=null, clcCode=null, officeProv=null, officeCity=null, officeAddr=null, officeZip=null, officeEmail=null, officePhone=null, editDirector=null, officeDirector=null, officeDirectorPhone=null, officeStaffNum=null, officeEmpNum=null, coverPicUrl=wfghvu3bhh/dKxuZ+ucVHA==, journalPrice=null, startedYear=null, abbrevIsoEn=Sci Technol Rev, journalRemark=null, publicationField=null, createdTime=null, updatedTime=1774230116083, createdBy=null, updatedBy=13041195026, firstLetterCn=S, firstLetterEn=S, subjectCode=Natural Sciences, subjectName=自然科学, subjectCodeEn=Natural Sciences, subjectNameEn=null, picCn=wfghvu3bhh/dKxuZ+ucVHA==, picEn=yjSfclmpNm7ihn9NbTZ69g==, jcr=null, cjcr=null, exts=[JournalExt(id=1242774439910290156, language=CN, name=科技导报, nameHistory1=null, nameHistory2=null, managedBy=中国科学技术协会, sponsoredBy=中国科学技术协会, publishedBy=科技导报社, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/CN/home, createdTime=1774230116107, updatedTime=1774230116107, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/CN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.cast.org.cn/webm, submissionEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionReviewUrl=https://kjdbauthor.cast.org.cn/webm, submissionCeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, submissionAeEditorUrl=https://kjdbeditor.cast.org.cn/webm/, option={"copyright":""}), JournalExt(id=1242774439960621805, language=EN, name=Science & Technology Review, nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=http://www.kjdb.org/EN/home, createdTime=1774230116119, updatedTime=1774230116119, createdBy=13041195026, updatedBy=13041195026, submissionGuidelinesUrl=http://www.kjdb.org/EN/column/column7.shtml, submissionAuthorUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionReviewUrl=https://kjdbauthor.manuscriptcloud.com/login, submissionCeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, submissionAeEditorUrl=https://kjdbeditor.manuscriptcloud.com/login, option={"copyright":""})], databaseList=null, tenantJournalId=1146031591421210625, websiteList=[Website(id=1146104741081231361, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/CN, language=CN, createTime=1751182263881, createBy=18614031015, updateTime=1751778001962, updateBy=18614031015, name=科技导报, tplId=1146099689490845704, title=科技导报, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1148021146403992296, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=articleTextType, value=kx, createTime=1751639170504, updateTime=1751639170504, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146378826469, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=banner, value=null, createTime=1751639170498, updateTime=1751639170498, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146366243556, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1751639170495, updateTime=1751639170495, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146395603687, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/CN/file/pic, createTime=1751639170502, updateTime=1751639170502, creator=18614031015, updator=18614031015), WebsiteProps(id=1148021146387215078, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146104741081231361, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1751639170500, updateTime=1751639170500, creator=18614031015, updator=18614031015)]), Website(id=1146105254833139715, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1146031591421210625, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/kjdb/EN, language=EN, createTime=1751182386363, createBy=18614031015, updateTime=1753500121937, updateBy=18614031015, name=科技导报, tplId=1146101810881728533, title=Science & Technology Review, delFlag=0, indexPage=/home, props=[WebsiteProps(id=1155838567709528217, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=articleTextType, value=kx, createTime=1753502988984, updateTime=1753502988984, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567692750998, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=banner, value=null, createTime=1753502988980, updateTime=1753502988980, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567688556693, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=logo, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic?fileId=9GHSf7eGlIPH0Tv/OOdstA==, createTime=1753502988979, updateTime=1753502988979, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567705333912, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/kjdb/EN/file/pic, createTime=1753502988983, updateTime=1753502988983, creator=18614031015, updator=18614031015), WebsiteProps(id=1155838567701139607, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1146105254833139715, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_en_623/, createTime=1753502988982, updateTime=1753502988982, creator=18614031015, updator=18614031015)])], journalTitle=科技导报, weixinUrl=null, journalUrl=null, iacademicId=null, status=1, seqNo=null, journalTitleEn=Science & Technology Review, journalPhotoCn=wfghvu3bhh/dKxuZ+ucVHA==, journalPhotoEn=yjSfclmpNm7ihn9NbTZ69g==, journalFirstLetter=S, journalRecommend=null, journalNew=null, journalCollection=1, jcrJf=null, cjcrJf=0.91, jcrJfStr=null, cjcrJfStr=null, submissionFirstDecision=null, sciSubjectClassification=null, casSubjectClassification=null, citeScore=null, totalCitationFrequency=null, icpCode=null, psCode=null, advertisingLicenseCode=null, copyrightInformation=null, country=null, option=, provinceCode=null, provinceName=null, collectFlag=false, interPubPlatform=null, interPubPlatformUrl=null), detailUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/10.3981/j.issn.1000-7857.2023.05.00812, detailUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/10.3981/j.issn.1000-7857.2023.05.00812, pdfUrlCn=https://castjournals.cast.org.cn/joweb/kjdb/CN/PDF/10.3981/j.issn.1000-7857.2023.05.00812, pdfUrlEn=https://castjournals.cast.org.cn/joweb/kjdb/EN/PDF/10.3981/j.issn.1000-7857.2023.05.00812, aliStartDate=null, aliEndDate=null, collectionFlag=false, citedCount=null, citedUrl=null, previewStatus=0, delFlag=0, hasFullText=0, orderTime=1734019200000, fullTextJson=null, articleText=null, reference=null)
科技导报
| 论文 2024, 42(23): 135-144
基于CRF模型的《里耶秦简》自动断句与分词研究
全屏
冯慧敏, 郭帅帅, 刘铭
作者信息
Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model
FENG Huimin, GUO Shuaishuai, LIU Ming
Affiliations
出版时间: 2024-12-13
doi: 10.3981/j.issn.1000-7857.2023.05.00812
文章导航
里耶秦简的数量是之前出土秦简的10倍,填补了秦朝历史记载中的诸多空白。将《里耶秦简》作为实验语料,探索基于 CRF(条件随机场)模型的里耶秦简自动断句与分词方法。结合简文的实际特点,通过设置不同的特征模板,面向不同的任务验证模型序列标注的泛化能力;通过设置断句、分词一体化的对比实验,以选取性能更优的处理方案;同时设计了深度学习方法与预训练模型的对比试验。实验结果表明,CRF模型一体化的标注方案在各任务中的整体性能均有所提升,其中自动断句、分词的F 1 值分别达到75.79%与94.44%,且速度快用时少,更适用于里耶秦简。
CRF模型
/
里耶秦简
/
自动断句
/
自动分词
Information processing of ancient Chinese seldom uses unearthed documents as corpus to carry out relevant research. The number of Liye Qin bamboo manuscripts reached ten times that of all the Qin slips unearthed before, which can fill many gaps in the historical records of the Qin Dynasty. In this paper, we used them as experimental corpus and explored the automatic sentence segmentation and word segmentation of unearthed documents based on the CRF model. We combined the actual characteristics of the corpus and set up different feature templates to verify the generalization ability of model sequence labeling on different tasks. We set up a joint approach to sentence segmentation and word segmentation as comparative experiment to select a better performance processing plan. At the same time, a comparative experiment was designed between deep learning methods and pretrained models. The results proved that the overall performance of the joint approach in each task was improved and that the F1-score of automatic sentence segmentation and word segmentation reached 75.79% and 94.44%, respectively. Since it's faster and takes less time, this approach is more suitable for the Liye Qin bamboo slips. The research results can serve the proofreading work of the last three volumes of Liye Qin bamboo slips and the in-depth processing and construction of the corpus.
CRF model
/
Liye Qin bamboo manuscripts
/
automatic sentence segmentation
/
automatic word segmentation
冯慧敏, 郭帅帅, 刘铭.
基于CRF模型的《里耶秦简》自动断句与分词研究.
科技导报,
2024
, 42
(23)
: 135
-144
.
DOI: 10.3981/j.issn.1000-7857.2023.05.00812
FENG Huimin, GUO Shuaishuai, LIU Ming.
Automatic sentence segmentation and word segmentation for Liye Qin Bamboo manuscripts based on CRF model[J].
Science & Technology Review ,
2024
, 42
(23)
: 135
-144
.
DOI: 10.3981/j.issn.1000-7857.2023.05.00812
2024年第42卷第23期
PDF下载
390
125
引用本文
BibTeX
文章信息
doi: 10.3981/j.issn.1000-7857.2023.05.00812
接收时间:2023-05-29
首发时间:2025-01-06
出版时间:2024-12-13
收稿日期:2023-05-29
修回日期:2023-10-30
https://castjournals.cast.org.cn/joweb/kjdb/CN/10.3981/j.issn.1000-7857.2023.05.00812
复制链接
引用本文
BibTeX
2种不同金属材料的力学参数
科 Family 属数 Number of genus 种数 Number of species 占总种数比例 Percentage of total species (%) 属 Genus 种数 Number of species 占总种数比例 Percentage of total species (%) 鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78 小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39 多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39 红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87 小菇属 Mycena 11 5.26 光柄菇属 Pluteus 5 2.39 红菇属 Russula 17 8.13 栓菌属 Trametes 5 2.39
关闭全屏
BibTeX
EndNote
RefWorks
TxT