Article(id=1251458155910217837, tenantId=1146029695717560320, journalId=1251194880429441115, issueId=1251458153020342360, articleNumber=null, orderNo=null, doi=10.3979/j.issn.1673-825X.202406270158, pmid=null, cstr=null, oa=null, hot=null, price=null, onlineType=0, articleFormat=0, articleType=null, articleTypeStr=null, receivedDate=1719417600000, receivedDateStr=2024-06-27, revisedDate=1757952000000, revisedDateStr=2025-09-16, acceptedDate=null, acceptedDateStr=null, onlineDate=1776300475336, onlineDateStr=2026-04-16, pubDate=null, pubDateStr=null, doiRegisterDate=null, doiRegisterDateStr=null, onlineIssueDate=1776300475336, onlineIssueDateStr=2026-04-16, onlineJustAcceptDate=null, onlineJustAcceptDateStr=null, onlineFirstDate=null, onlineFirstDateStr=null, sourceXml=null, magXml=null, createTime=1776300475336, creator=13041195026, updateTime=1776300475336, updator=13041195026, issue=Issue{id=1251458153020342360, tenantId=1146029695717560320, journalId=1251194880429441115, year='2025', volume='37', issue='5', pageStart='627', pageEnd='780', issueExtLink='null', onlineDate='null', pubDate='null', beforeIssueId=null, nextIssueId=null, price=null, status=1, issueComplete=1, articleOrder=1, issueType=1, specialIssue=null, createTime=1776300474648, creator=13041195026, updateTime=1776311939434, updator=13041195026, preIssue=null, nextIssue=null, ext={EN=IssueExt(id=1251506239914586238, tenantId=1146029695717560320, journalId=1251194880429441115, issueId=1251458153020342360, language=EN, specialIssueTitle=, coverIllustrator=null, specialIssueEditor=, specialIssueAbout=), CN=IssueExt(id=1251506239914586239, tenantId=1146029695717560320, journalId=1251194880429441115, issueId=1251458153020342360, language=CN, specialIssueTitle=, coverIllustrator=null, specialIssueEditor=, specialIssueAbout=)}, issueFiles=null}, startPage=741, endPage=747, ext={EN=ArticleExt(id=1251458157621493902, articleId=1251458155910217837, tenantId=1146029695717560320, journalId=1251194880429441115, language=EN, title=Distributed computing LSTM accelerator based on pulsating array architecture, columnId=1251458154354131041, journalTitle=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), columnName=Artificial Intelligenceand Big Data, runingTitle=null, highlight=null, articleAbstract=

A long short-term memory(LSTM)neural network edge computing accelerator based on distributed systolic array architecture was proposed on the resource limited edge computing devices. The design distributes input data storage to reduce data movement and power consumption, while data transmission in a systolic manner minimizes the idle rate of computing units and enhances computational efficiency. Experimental validation on a VU13P field-programmable gate array(FPGA)shows that the proposed LSTM accelerator achieves an effective computing power of 179.2 GOPS at an operating frequency of 200 MHz, with a dynamic power consumption of 0.343 W and an energy efficiency of 522.4 GOPS/W. Compared with typical existing designs, the proposed accelerator improves energy efficiency by more than 34%.

, correspAuthors=null, authorNote=null, correspAuthorsNote=null, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=null, magXml=null, pdfUrl=null, pdf=null, pdfFileSize=null, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=null, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=null, mapNumber=null, authorCompany=null, fund=null, authors=null, authorsList=Hongsheng ZHANG, Zhuoli CHENG), CN=ArticleExt(id=1251458163355107665, articleId=1251458155910217837, tenantId=1146029695717560320, journalId=1251194880429441115, language=CN, title=基于脉动阵列架构的分布式计算LSTM加速器, columnId=1251458154492543075, journalTitle=重庆邮电大学学报(自然科学版), columnName=人工智能与大数据, runingTitle=null, highlight=null, articleAbstract=

针对在资源有限的边缘计算端部署长短时记忆(long short-term memory,LSTM)神经网络遇到的计算效率低、功耗高的问题,提出一种基于脉动阵列架构的分布式计算LSTM加速器设计方案。通过将输入数据分布式存储,从而以减少数据的流动性并降低功耗;通过脉动的方式传递数据,从而减少计算单元的空置率并提高计算效率。在VU13P系列现场可编程门阵列(field programmable gate array,FPGA)的验证结果表明,所设计的LSTM加速器在200 MHz的工作频率下有效算力179.2 GOPS,动态功耗0.343 W,能效比522.4 GOPS/W,相较于当前典型设计,能效比提升34%以上。

, correspAuthors=null, authorNote=null, correspAuthorsNote=
张红升
, copyrightStatement=null, copyrightOwner=null, extLink=null, articleAbsUrl=null, sourceXml=9lyiWxrtBsCNYBohY7MkBQ==, magXml=X8pU+/V/TBXlYUO8oybWlA==, pdfUrl=null, pdf=P38Ash4xzzCCDInqYIn7pQ==, pdfFileSize=6892363, pdfExtLink=null, richHtmlUrl=null, mobilePdfUrl=null, reviewReport=null, pdfFirstPage=null, abstractGraph=+qb+XKSZ1VA8WFk1ooZWpg==, abstractGraphContent=null, abstractVideo=null, citation=null, cebUrl=null, magXmlContent=oUGB+rojR7EsiOv2P55CJg==, mapNumber=null, authorCompany=null, fund=null, authors=

张红升,教授,博士生导师,博士,主要研究方向为超大规模集成电路与Soc设计、神经网络加速器设计等。E-mail:

成卓立,硕士研究生,主要研究方向为神经网络加速器设计,可重构设计。E-mail:

, authorsList=张红升, 成卓立)}, authors=[Author(id=1251458163699040618, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, orderNo=0, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=zhanghs@cqupt.edu.cn, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1251458165322236273, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, authorId=1251458163699040618, language=EN, stringName=Hongsheng ZHANG, firstName=Hongsheng, middleName=null, lastName=ZHANG, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1251458165414510968, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, authorId=1251458163699040618, language=CN, stringName=张红升, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=重庆邮电大学 光电工程学院,重庆 400065, bio={"content":"

张红升,教授,博士生导师,博士,主要研究方向为超大规模集成电路与Soc设计、神经网络加速器设计等。E-mail:

"}, bioImg=null, bioContent=

张红升,教授,博士生导师,博士,主要研究方向为超大规模集成电路与Soc设计、神经网络加速器设计等。E-mail:

, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1251458163564822880, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, xref=null, ext=[AuthorCompanyExt(id=1251458163573211489, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China), AuthorCompanyExt(id=1251458163577405794, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=重庆邮电大学 光电工程学院,重庆 400065)])]), Author(id=1251458165498397055, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, orderNo=1, firstName=null, middleName=null, lastName=null, nameCn=null, orcid=null, stid=null, country=null, authorPic=null, dead=0, email=1390764213@qq.com, emailSecond=null, emailThird=null, correspondingAuthor=0, authorType=1, ext={EN=AuthorExt(id=1251458165594866052, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, authorId=1251458165498397055, language=EN, stringName=Zhuoli CHENG, firstName=Zhuoli, middleName=null, lastName=CHENG, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China, bio=null, bioImg=null, bioContent=null, aboutCorrespAuthor=null), CN=AuthorExt(id=1251458165682946437, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, authorId=1251458165498397055, language=CN, stringName=成卓立, firstName=null, middleName=null, lastName=null, prefix=null, suffix=null, authorComment=null, nameInitials=null, affiliation=null, department=null, xref=null, address=重庆邮电大学 光电工程学院,重庆 400065, bio={"content":"

成卓立,硕士研究生,主要研究方向为神经网络加速器设计,可重构设计。E-mail:

"}, bioImg=null, bioContent=

成卓立,硕士研究生,主要研究方向为神经网络加速器设计,可重构设计。E-mail:

, aboutCorrespAuthor=null)}, companyList=[AuthorCompany(id=1251458163564822880, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, xref=null, ext=[AuthorCompanyExt(id=1251458163573211489, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China), AuthorCompanyExt(id=1251458163577405794, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=重庆邮电大学 光电工程学院,重庆 400065)])])], keywords=[Keyword(id=1251458165808775563, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, orderNo=1, keyword=long short-term memory (LSTM)), Keyword(id=1251458165922021776, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, orderNo=2, keyword=field-programmable gate array (FPGA)), Keyword(id=1251458166018490771, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, orderNo=3, keyword=hardware accelerator), Keyword(id=1251458166106571159, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, orderNo=4, keyword=pulsating array), Keyword(id=1251458166186262940, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, orderNo=1, keyword=长短时记忆(LSTM)), Keyword(id=1251458166282731936, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, orderNo=2, keyword=现场可编程门阵列(FPGA)), Keyword(id=1251458166471475623, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, orderNo=3, keyword=硬件加速器), Keyword(id=1251458166731522478, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, orderNo=4, keyword=脉动阵列)], refs=[Reference(id=1251458170711917114, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2016, volume=null, issue=57, pageStart=345, pageEnd=420, url=null, language=null, rfNumber=[1], rfOrder=0, authorNames=GOLDBERG Y, journalName=Journal of Artificial Intelligence Research, refType=null, unstructuredReference=GOLDBERG Y. Aprimer on neural network models for natural language processing[J]. Journal of Artificial Intelligence Research, 2016(57): 345-420., articleTitle=Aprimer on neural network models for natural language processing, refAbstract=null), Reference(id=1251458170804191807, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2017, volume=null, issue=null, pageStart=75, pageEnd=84, url=null, language=null, rfNumber=[2], rfOrder=1, authorNames=HAN S, KANG J, MAO H, journalName=null, refType=null, unstructuredReference=HAN S, KANG J, MAO H, et al. Ese: Efficient speech recognition engine with sparselstm on fpga[C]//Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, NY, USA: ACM, 2017: 75-84., articleTitle=Ese: Efficient speech recognition engine with sparselstm on fpga, refAbstract=null), Reference(id=1251458170888077893, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2015, volume=null, issue=null, pageStart=2625, pageEnd=2634, url=null, language=null, rfNumber=[3], rfOrder=2, authorNames=DONAHUE J, ANNE H L, GUADARRAMA S, journalName=null, refType=null, unstructuredReference=DONAHUE J, ANNE H L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE, 2015: 2625-2634., articleTitle=Long-term recurrent convolutional networks for visual recognition and description, refAbstract=null), Reference(id=1251458170976158282, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2001, volume=null, issue=null, pageStart=null, pageEnd=null, url=null, language=null, rfNumber=[4], rfOrder=3, authorNames=HOCHREITER S, BENGIO Y, FRASCONI P, journalName=null, refType=null, unstructuredReference=HOCHREITER S, BENGIO Y,FRASCONI P, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies[C]//A Field Guide to Dynamical Recurrent Neural Networks. Piscataway. NJ, USA:IEEE, 2001., articleTitle=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, refAbstract=null), Reference(id=1251458171039072843, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2019, volume=9, issue=2, pageStart=280, pageEnd=291, url=null, language=null, rfNumber=[5], rfOrder=4, authorNames=WANG M, WANG Z, LU J, journalName=IEEE Journal on Emerging and Selected Topics in Circuits and Systems. USA: IEEE, refType=null, unstructuredReference=WANG M, WANG Z, LU J, et al. E-LSTM: An efficient hardware architecture for long short-term memory[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. USA: IEEE, 2019, 9(2): 280-291., articleTitle=E-LSTM: An efficient hardware architecture for long short-term memory, refAbstract=null), Reference(id=1251458171110376014, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=49, issue=2, pageStart=209, pageEnd=215, url=null, language=null, rfNumber=[6], rfOrder=5, authorNames=高琛, 张帆, 高彦钊, journalName=电子学报, refType=null, unstructuredReference=高琛,张帆,高彦钊.利用数据稀疏性的LSTM加速器设计[J].电子学报, 2021, 49(2): 209-215., articleTitle=利用数据稀疏性的LSTM加速器设计, refAbstract=null), Reference(id=1251458171190067794, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=49, issue=2, pageStart=209, pageEnd=215, url=null, language=null, rfNumber=[6], rfOrder=6, authorNames=GAO C, ZHANG F, GAO Y Z, journalName=Acta Electronica Sinica, refType=null, unstructuredReference=GAO C, ZHANG F, GAO Y Z. Design of LSTM accelerator using data sparsity[J]. Acta Electronica Sinica, 2021, 49(2): 209-215., articleTitle=Design of LSTM accelerator using data sparsity, refAbstract=null), Reference(id=1251458171278148183, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=10, issue=8, pageStart=882, pageEnd=null, url=null, language=null, rfNumber=[7], rfOrder=7, authorNames=ZHENG Y, YANG H, JIA Y, journalName=Electronics, refType=null, unstructuredReference=ZHENG Y, YANG H, JIA Y, et al.PermLSTM: A high energy-efficiency LSTM accelerator architecture[J]. Electronics, 2021, 10(8): 882., articleTitle=PermLSTM: A high energy-efficiency LSTM accelerator architecture, refAbstract=null), Reference(id=1251458171366228571, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2023, volume=null, issue=null, pageStart=42, pageEnd=48, url=null, language=null, rfNumber=[8], rfOrder=8, authorNames=LI S, ZHU S, LUO X, journalName=null, refType=null, unstructuredReference=LI S, ZHU S, LUO X, et al. An efficient sparselstm accelerator on embedded fpgas with bandwidth-oriented pruning[C]//2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). USA: IEEE, 2023: 42-48., articleTitle=An efficient sparselstm accelerator on embedded fpgas with bandwidth-oriented pruning, refAbstract=null), Reference(id=1251458171445920351, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=30, issue=2, pageStart=227, pageEnd=237, url=null, language=null, rfNumber=[9], rfOrder=9, authorNames=QUE Z, NAKAHARA H, NURVITADHI E, journalName=IEEE Transactions on Very Large Scale Integration (VLSI) Systems, refType=null, unstructuredReference=QUE Z, NAKAHARA H,NURVITADHI E, et al. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 30(2):227-237., articleTitle=Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs, refAbstract=null), Reference(id=1251458171534000739, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2023, volume=42, issue=11, pageStart=6660, pageEnd=6683, url=null, language=null, rfNumber=[10], rfOrder=10, authorNames=JOSEPH T, BINDIYA T S, journalName=Circuits, Systems, and Signal Processing, refType=null, unstructuredReference=JOSEPH T, BINDIYA T S. Performance-driven LSTM accelerator hardware using split-matrix-based MVM[J]. Circuits, Systems, and Signal Processing, 2023, 42(11): 6660-6683., articleTitle=Performance-driven LSTM accelerator hardware using split-matrix-based MVM, refAbstract=null), Reference(id=1251458171622081128, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=1997, volume=9, issue=8, pageStart=1735, pageEnd=1780, url=null, language=null, rfNumber=[11], rfOrder=11, authorNames=HOCHREITER S, SCHMIDHUBER J, journalName=Neural Computation, refType=null, unstructuredReference=HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J].Neural Computation,1997,9(8):1735-1780., articleTitle=Long short-term memory, refAbstract=null), Reference(id=1251458171697578605, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2008, volume=42, issue=9, pageStart=1611, pageEnd=1615, url=null, language=null, rfNumber=[12], rfOrder=12, authorNames=田翔, 周凡, 陈耀武, journalName=浙江大学学报:工学版, refType=null, unstructuredReference=田翔,周凡,陈耀武,.基于FPGA的实时双精度浮点矩阵乘法器设计[J].浙江大学学报:工学版, 2008, 42(9): 1611-1615., articleTitle=基于FPGA的实时双精度浮点矩阵乘法器设计, refAbstract=null), Reference(id=1251458171777270383, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2008, volume=42, issue=9, pageStart=1611, pageEnd=1615, url=null, language=null, rfNumber=[12], rfOrder=13, authorNames=TIAN X, ZHOU F, CHEN Y W, journalName=Journal of Zhejiang University: Engineering Edition, refType=null, unstructuredReference=TIAN X, ZHOU F, CHEN Y W, et al. Design of real-time double-precision floating-point matrix multiplier based on FPGA[J]. Journal of Zhejiang University: Engineering Edition,2008, 42(9):1611-1615., articleTitle=Design of real-time double-precision floating-point matrix multiplier based on FPGA, refAbstract=null), Reference(id=1251458171852767860, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2002, volume=86, issue=11, pageStart=2278, pageEnd=2324, url=null, language=null, rfNumber=[13], rfOrder=14, authorNames=LECUN Y, BOTTOU L, BENGIO Y, journalName=Proceedings of the IEEE, refType=null, unstructuredReference=LECUN Y, BOTTOU L, BENGIO Y, et al. Gradientbased learning applied to document recognition[J]. Proceedings of the IEEE, 2002, 86(11): 2278-2324., articleTitle=Gradientbased learning applied to document recognition, refAbstract=null), Reference(id=1251458171945042553, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=49, issue=4, pageStart=729, pageEnd=null, url=null, language=null, rfNumber=[14], rfOrder=15, authorNames=刘杰, 葛一凡, 田明, journalName=电子学报, refType=null, unstructuredReference=刘杰,葛一凡,田明,.基于ZYnQ的可重构卷积神经网络加速器[J].电子学报, 2021, 49(4): 729., articleTitle=基于ZYnQ的可重构卷积神经网络加速器, refAbstract=null), Reference(id=1251458172062483069, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2021, volume=49, issue=4, pageStart=729, pageEnd=null, url=null, language=null, rfNumber=[14], rfOrder=16, authorNames=LIU J, GE Y F, TIAN M, journalName=Acta Electronica Sinica, refType=null, unstructuredReference=LIU J, GE Y F, TIAN M, et al. Reconfigurable convolutional neural network accelerator based on ZYNQ[J]. Acta Electronica Sinica, 2021, 49(4):729., articleTitle=Reconfigurable convolutional neural network accelerator based on ZYNQ, refAbstract=null), Reference(id=1251458172133786241, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2025-09-18, volume=null, issue=null, pageStart=null, pageEnd=null, url=null, language=null, rfNumber=[15], rfOrder=17, authorNames=GHASEMZADEH S A, TAVAKOLI E B, KAMAL M, journalName=null, refType=null, unstructuredReference=GHASEMZADEH S A, TAVAKOLI E B, KAMAL M,et al. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratiosparsification[EB/OL]. (2021-01-07)[2025-09-18]. https://arxiv.org/abs/2101.02667., articleTitle=BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratiosparsification, refAbstract=null), Reference(id=1251458172209283717, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, doi=null, pmid=null, pmcid=null, year=2023, volume=12, issue=7, pageStart=1731, pageEnd=null, url=null, language=null, rfNumber=[16], rfOrder=18, authorNames=MAO N, YANG H, HUANG Z, journalName=Electronics, refType=null, unstructuredReference=MAO N, YANG H, HUANG Z. An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA[J].Electronics,2023,12(7):1731., articleTitle=An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA, refAbstract=null)], funds=null, companyList=[AuthorCompany(id=1251458163564822880, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, xref=null, ext=[AuthorCompanyExt(id=1251458163573211489, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=EN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China), AuthorCompanyExt(id=1251458163577405794, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, companyId=1251458163564822880, language=CN, country=null, province=null, city=null, postcode=null, companyName=null, departmentName=null, remark=重庆邮电大学 光电工程学院,重庆 400065)])], figs=[ArticleFig(id=1251458166937043383, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.1, caption=LSTM basic unit structure diagram, figureFileSmall=6bDuTev53/rmO06hszuY8A==, figureFileBig=+qb+XKSZ1VA8WFk1ooZWpg==, tableContent=null), ArticleFig(id=1251458167071261118, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图1, caption=LSTM基本单元结构, figureFileSmall=6bDuTev53/rmO06hszuY8A==, figureFileBig=+qb+XKSZ1VA8WFk1ooZWpg==, tableContent=null), ArticleFig(id=1251458167301947850, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.2, caption=Overall system architecture of LSTM accelerator in this article, figureFileSmall=faMKQG7kN0Svil6Rbu7d0g==, figureFileBig=tcgrTNqqv88XzTFaGsBJBA==, tableContent=null), ArticleFig(id=1251458167390028237, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图2, caption=本文LSTM加速器整体系统架构, figureFileSmall=faMKQG7kN0Svil6Rbu7d0g==, figureFileBig=tcgrTNqqv88XzTFaGsBJBA==, tableContent=null), ArticleFig(id=1251458167461331412, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.3, caption=Computational array(PE Array)architecture, figureFileSmall=ni1vfSi8F8QIGxA/S0R9jg==, figureFileBig=jTJH6whr3cYojuP1Ni7IqQ==, tableContent=null), ArticleFig(id=1251458167570383321, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图3, caption=计算阵列(PE Array)架构, figureFileSmall=ni1vfSi8F8QIGxA/S0R9jg==, figureFileBig=jTJH6whr3cYojuP1Ni7IqQ==, tableContent=null), ArticleFig(id=1251458167645880800, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.4, caption=Schematic diagram of vector multiplication operation, figureFileSmall=5aci4tP9xdpT16pcc++RHg==, figureFileBig=Ue1R9vKhz+j4D2Yw96UNHg==, tableContent=null), ArticleFig(id=1251458167750738408, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图4, caption=向量乘操作示意图, figureFileSmall=5aci4tP9xdpT16pcc++RHg==, figureFileBig=Ue1R9vKhz+j4D2Yw96UNHg==, tableContent=null), ArticleFig(id=1251458167826235884, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.5, caption=Weight parameter cutting method, figureFileSmall=PK4qzFypi5lwsmZFyZzd7A==, figureFileBig=NN5xRhWfjVkdfmHBJ/vTlw==, tableContent=null), ArticleFig(id=1251458167964647921, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图5, caption=权重参数切割方式, figureFileSmall=PK4qzFypi5lwsmZFyZzd7A==, figureFileBig=NN5xRhWfjVkdfmHBJ/vTlw==, tableContent=null), ArticleFig(id=1251458168056922614, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.6, caption=Distributed computing architecture diagram based on pulsating array, figureFileSmall=77b1tmWLqlmXinpmTa2jCg==, figureFileBig=QziVEcuVh3kxxl4PIEDG/Q==, tableContent=null), ArticleFig(id=1251458168132420089, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图6, caption=基于脉动阵列的分布式计算架构图, figureFileSmall=77b1tmWLqlmXinpmTa2jCg==, figureFileBig=QziVEcuVh3kxxl4PIEDG/Q==, tableContent=null), ArticleFig(id=1251458169726255613, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.7, caption=PE structure, figureFileSmall=pshEOnGrix/dI473RUNXHA==, figureFileBig=uf6WZDHqxyYH1cS7M9TO7A==, tableContent=null), ArticleFig(id=1251458169843696133, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图7, caption=PE结构, figureFileSmall=pshEOnGrix/dI473RUNXHA==, figureFileBig=uf6WZDHqxyYH1cS7M9TO7A==, tableContent=null), ArticleFig(id=1251458169961136651, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Fig.8, caption=Accelerator power consumption diagram and the proportion of power consumption of each resource in the calculation module, figureFileSmall=Hmyn70m5ZYpjcnk43a9wDA==, figureFileBig=hAftiC9J4eL3Sfj2uwGM5w==, tableContent=null), ArticleFig(id=1251458170095354384, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=图8, caption=加速器功耗及计算模块中各个资源功耗占比图, figureFileSmall=Hmyn70m5ZYpjcnk43a9wDA==, figureFileBig=hAftiC9J4eL3Sfj2uwGM5w==, tableContent=null), ArticleFig(id=1251458170200211991, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Tab.1, caption=

Hardware resource utilization

, figureFileSmall=null, figureFileBig=null, tableContent=
资源类型消耗量
LUT15257
Register16836
DSP507
BRAM256
), ArticleFig(id=1251458170279903775, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=表1, caption=

硬件资源消耗

, figureFileSmall=null, figureFileBig=null, tableContent=
资源类型消耗量
LUT15257
Register16836
DSP507
BRAM256
), ArticleFig(id=1251458170367984165, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Tab.2, caption=

Performance comparison between FPGA and CPU

, figureFileSmall=null, figureFileBig=null, tableContent=
类别VU13PCPU
单张图片计算时间/μs24.262630
静态功耗/W2.95814.2
峰值功耗/W3.30141.5
实际功耗/W0.34327.3
有效算力/GOPS179.21.07
能效比/(GOPS/W)522.40.039
), ArticleFig(id=1251458170447675947, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=表2, caption=

FPGA与CPU数据对比

, figureFileSmall=null, figureFileBig=null, tableContent=
类别VU13PCPU
单张图片计算时间/μs24.262630
静态功耗/W2.95814.2
峰值功耗/W3.30141.5
实际功耗/W0.34327.3
有效算力/GOPS179.21.07
能效比/(GOPS/W)522.40.039
), ArticleFig(id=1251458170531562033, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=EN, label=Tab.3, caption=

Performance comparison with prior LSTM accelerators

, figureFileSmall=null, figureFileBig=null, tableContent=
类别文献[7]文献[9]文献[15]文献[16]本文
FPGA型号Arria10Stratix10 GX2800XCKU9PAlevoU50VU13P
工作频率/MHz150250200280200
DSP/个0524516004224507
LUT/个257871487232560012293515257
有效算力/GOPS2220295742002036179.2
实际功耗/W5.7989.032.30.343
能效比/GOPS/W398.5302177.862.84522.4
), ArticleFig(id=1251458170607059510, tenantId=1146029695717560320, journalId=1251194880429441115, articleId=1251458155910217837, language=CN, label=表3, caption=

与其他文献LSTM加速器性能对比

, figureFileSmall=null, figureFileBig=null, tableContent=
类别文献[7]文献[9]文献[15]文献[16]本文
FPGA型号Arria10Stratix10 GX2800XCKU9PAlevoU50VU13P
工作频率/MHz150250200280200
DSP/个0524516004224507
LUT/个257871487232560012293515257
有效算力/GOPS2220295742002036179.2
实际功耗/W5.7989.032.30.343
能效比/GOPS/W398.5302177.862.84522.4
)], attaches=null, journal=Journal(id=1251194162217791579, delFlag=0, nameCn=重庆邮电大学学报(自然科学版), nameEn=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), nameHistory1=null, nameHistory2=null, issn=1673-825X, eissn=null, cn=50-1181/N, coden=null, periodic=1, language=CN, oaType=null, ccby=null, superviseOffice=null, ownerOffice=null, pubOffice=null, editorOffice=null, officeType=null, aims=null, clcCode=null, officeProv=null, officeCity=null, officeAddr=null, officeZip=null, officeEmail=null, officePhone=null, editDirector=null, officeDirector=null, officeDirectorPhone=null, officeStaffNum=null, officeEmpNum=null, coverPicUrl=eIjswk9Qxcq7+V27dEZ90g==, journalPrice=null, startedYear=null, abbrevIsoEn=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), journalRemark=null, publicationField=null, createdTime=1776237534337, updatedTime=1776238167705, createdBy=18614031015, updatedBy=13701087609, firstLetterCn=J, firstLetterEn=J, subjectCode=Natural Sciences, subjectName=null, subjectCodeEn=Natural Sciences, subjectNameEn=null, picCn=eIjswk9Qxcq7+V27dEZ90g==, picEn=PN1NU8XEwLhVlBzwQEoroA==, jcr=null, cjcr=null, exts=[JournalExt(id=1251196818869137642, language=CN, name=重庆邮电大学学报(自然科学版), nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=, createdTime=1776238167732, updatedTime=1776238167732, createdBy=13701087609, updatedBy=13701087609, submissionGuidelinesUrl=, submissionAuthorUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/author/login, submissionEditorUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/editor/login, submissionReviewUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/reviewer/login, submissionCeEditorUrl=, submissionAeEditorUrl=, option={"copyright":""}), JournalExt(id=1251196818936246507, language=EN, name=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), nameHistory1=null, nameHistory2=null, managedBy=, sponsoredBy=, publishedBy=, editorOffice=, officeProv=null, officeCity=null, officeAddr=, officeZip=, editDirector=, officeDirector=null, officePhone=null, coverPicUrl=null, journalRemark=, submitArticleUrl=null, websiteUrl=, createdTime=1776238167747, updatedTime=1776238167747, createdBy=13701087609, updatedBy=13701087609, submissionGuidelinesUrl=, submissionAuthorUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/author/login, submissionEditorUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/editor/login, submissionReviewUrl=http://journal2.cqupt.edu.cn/jcuptnsecn/jcuptnse/reviewer/login, submissionCeEditorUrl=, submissionAeEditorUrl=, option={"copyright":""})], databaseList=null, tenantJournalId=1251194880429441115, websiteList=[Website(id=1251197148419797361, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1251194880429441115, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/cqyddxxb/CN, language=CN, createTime=1776238246302, createBy=18614031015, updateTime=1776238652182, updateBy=18614031015, name=重庆邮电大学学报(自然科学版)-中文, tplId=1146099689490845704, title=重庆邮电大学学报(自然科学版), delFlag=0, indexPage=/home, props=[WebsiteProps(id=1251198981565530670, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=articleTextType, value=kx, createTime=1776238683358, updateTime=1776238683358, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981540364843, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=banner, value=null, createTime=1776238683352, updateTime=1776238683352, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981590696497, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=grayFlag, value=0, createTime=1776238683364, updateTime=1776238683364, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981531976234, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=logo, value=https://castjournals.cast.org.cn/joweb/cqyddxxb/CN/file/pic?fileId=X24gYotabwCg03WC1YllbA==, createTime=1776238683350, updateTime=1776238683350, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981607473715, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=minRunFlag, value=0, createTime=1776238683368, updateTime=1776238683368, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981557142061, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/cqyddxxb/CN/file/pic, createTime=1776238683356, updateTime=1776238683356, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981599085106, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=silenceFlag, value=0, createTime=1776238683366, updateTime=1776238683366, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981548753452, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_cn_619/, createTime=1776238683354, updateTime=1776238683354, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981578113583, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=themeColor, value=null, createTime=1776238683361, updateTime=1776238683361, creator=18614031015, updator=18614031015), WebsiteProps(id=1251198981582307888, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148419797361, code=themeStyle, value=null, createTime=1776238683362, updateTime=1776238683362, creator=18614031015, updator=18614031015)]), Website(id=1251197148516266372, webName=null, webTitle=null, webDomain=null, webCopyrigh=null, webIpcNo=null, seoTitle=null, seoKeywords=null, seoDescription=null, tenantJournalId=null, journalId=1251194880429441115, journalNameCn=null, journalNameEn=null, grayFlag=null, tenantId=1146029695717560320, platformId=null, journalGroupId=null, journalGroupNameCn=null, journalGroupNameEn=null, type=1, domain=https://castjournals.cast.org.cn/joweb/cqyddxxb/EN, language=EN, createTime=1776238246325, createBy=18614031015, updateTime=1776238647495, updateBy=18614031015, name=重庆邮电大学学报(自然科学版)-英文, tplId=1146101810881728533, title=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), delFlag=0, indexPage=/home, props=[WebsiteProps(id=1251199007897371192, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=articleTextType, value=kx, createTime=1776238689636, updateTime=1776238689636, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007872205365, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=banner, value=null, createTime=1776238689630, updateTime=1776238689630, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007914148411, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=grayFlag, value=0, createTime=1776238689640, updateTime=1776238689640, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007863816756, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=logo, value=https://castjournals.cast.org.cn/joweb/cqyddxxb/EN/file/pic?fileId=X24gYotabwCg03WC1YllbA==, createTime=1776238689628, updateTime=1776238689628, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007930925629, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=minRunFlag, value=0, createTime=1776238689644, updateTime=1776238689644, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007888982583, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=picServerUrl, value=https://castjournals.cast.org.cn/joweb/cqyddxxb/EN/file/pic, createTime=1776238689634, updateTime=1776238689634, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007922537020, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=silenceFlag, value=0, createTime=1776238689642, updateTime=1776238689642, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007880593974, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=staticResourcePath, value=https://castjournals.cast.org.cn/joweb/cast_kjdb_en_623/, createTime=1776238689632, updateTime=1776238689632, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007901565497, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=themeColor, value=null, createTime=1776238689637, updateTime=1776238689637, creator=18614031015, updator=18614031015), WebsiteProps(id=1251199007909954106, tenantId=1146029695717560320, journalId=null, journalGroupId=null, siteId=1251197148516266372, code=themeStyle, value=null, createTime=1776238689639, updateTime=1776238689639, creator=18614031015, updator=18614031015)])], journalTitle=重庆邮电大学学报(自然科学版), weixinUrl=null, journalUrl=http://journal2.cqupt.edu.cn/jcuptnsecn, iacademicId=null, status=1, seqNo=null, journalTitleEn=Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), journalPhotoCn=eIjswk9Qxcq7+V27dEZ90g==, journalPhotoEn=PN1NU8XEwLhVlBzwQEoroA==, journalFirstLetter=J, journalRecommend=null, journalNew=null, journalCollection=null, jcrJf=null, cjcrJf=null, jcrJfStr=null, cjcrJfStr=null, submissionFirstDecision=null, sciSubjectClassification=null, casSubjectClassification=null, citeScore=null, totalCitationFrequency=null, icpCode=null, psCode=null, advertisingLicenseCode=null, copyrightInformation=null, country=null, option=, provinceCode=null, provinceName=null, collectFlag=false), detailUrlCn=https://castjournals.cast.org.cn/joweb/cqyddxxb/CN/10.3979/j.issn.1673-825X.202406270158, detailUrlEn=https://castjournals.cast.org.cn/joweb/cqyddxxb/EN/10.3979/j.issn.1673-825X.202406270158, pdfUrlCn=https://castjournals.cast.org.cn/joweb/cqyddxxb/CN/PDF/10.3979/j.issn.1673-825X.202406270158, pdfUrlEn=https://castjournals.cast.org.cn/joweb/cqyddxxb/EN/PDF/10.3979/j.issn.1673-825X.202406270158, aliStartDate=null, aliEndDate=null, collectionFlag=false, citedCount=null, citedUrl=null, reference=null)
收藏切换
基于脉动阵列架构的分布式计算LSTM加速器
收藏切换
PDF下载
张红升 , 成卓立
重庆邮电大学学报(自然科学版) | 人工智能与大数据 2025,37(5): 741-747
收起
收藏切换
重庆邮电大学学报(自然科学版) | 人工智能与大数据 2025, 37(5): 741-747
基于脉动阵列架构的分布式计算LSTM加速器
全屏
张红升 , 成卓立
作者信息
  • 重庆邮电大学 光电工程学院,重庆 400065
  • 张红升,教授,博士生导师,博士,主要研究方向为超大规模集成电路与Soc设计、神经网络加速器设计等。E-mail:

    成卓立,硕士研究生,主要研究方向为神经网络加速器设计,可重构设计。E-mail:

通讯作者:

Distributed computing LSTM accelerator based on pulsating array architecture
Hongsheng ZHANG , Zhuoli CHENG
Affiliations
  • School of Optoelectronic Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P R China
doi: 10.3979/j.issn.1673-825X.202406270158
文章导航
收藏切换

针对在资源有限的边缘计算端部署长短时记忆(long short-term memory,LSTM)神经网络遇到的计算效率低、功耗高的问题,提出一种基于脉动阵列架构的分布式计算LSTM加速器设计方案。通过将输入数据分布式存储,从而以减少数据的流动性并降低功耗;通过脉动的方式传递数据,从而减少计算单元的空置率并提高计算效率。在VU13P系列现场可编程门阵列(field programmable gate array,FPGA)的验证结果表明,所设计的LSTM加速器在200 MHz的工作频率下有效算力179.2 GOPS,动态功耗0.343 W,能效比522.4 GOPS/W,相较于当前典型设计,能效比提升34%以上。

长短时记忆(LSTM)  /  现场可编程门阵列(FPGA)  /  硬件加速器  /  脉动阵列

A long short-term memory(LSTM)neural network edge computing accelerator based on distributed systolic array architecture was proposed on the resource limited edge computing devices. The design distributes input data storage to reduce data movement and power consumption, while data transmission in a systolic manner minimizes the idle rate of computing units and enhances computational efficiency. Experimental validation on a VU13P field-programmable gate array(FPGA)shows that the proposed LSTM accelerator achieves an effective computing power of 179.2 GOPS at an operating frequency of 200 MHz, with a dynamic power consumption of 0.343 W and an energy efficiency of 522.4 GOPS/W. Compared with typical existing designs, the proposed accelerator improves energy efficiency by more than 34%.

long short-term memory (LSTM)  /  field-programmable gate array (FPGA)  /  hardware accelerator  /  pulsating array
张红升, 成卓立. 基于脉动阵列架构的分布式计算LSTM加速器. 重庆邮电大学学报(自然科学版), 2025 , 37 (5) : 741 -747 . DOI: 10.3979/j.issn.1673-825X.202406270158
Hongsheng ZHANG, Zhuoli CHENG. Distributed computing LSTM accelerator based on pulsating array architecture[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2025 , 37 (5) : 741 -747 . DOI: 10.3979/j.issn.1673-825X.202406270158
近年来,循环神经网络(recurrent neural net work,RNN)因其优异的性能在自然语言处理(natural language processing,NLP)[1]、语音转换为文本[2]和视频分析处理[3]等应用场景中取得了显著成功。但由于RNN本身的结构特点,在进行大模型训练中存在长程依赖问题[4],导致难以处理长序列任务。长短时记忆(long short-term memory,LSTM)神经网络作为RNN最流行的变体之一,通过引入门控机制来控制信息的积累速度,能处理更加复杂的长序列任务场景[5],但同时也带了更大的计算量,所以如何对LSTM进行硬件加速引起了众多学者的研究。
Foundation Item:Major Project of Chongqing Technology Innovation and Application Development Special Project(CSTB2023TIADSTX0023)
LSTM在边缘端的应用场景通常对硬件的计算速度和功耗有着较高的要求。中央处理器(central processing unit,CPU)作为指令流处理器,并行计算效率低,而图形处理器(graphics processing unit,GPU)的运行功耗较高,两者均难以满足边缘应用场景的要求。现场可编程门阵列(field programmable gate array,FPGA)由于其可编程、低功耗等特点,在LSTM硬件加速领域得到了人们的广泛关注。
目前基于FPGA的LSTM加速器研究可以分为2个方向。一是利用LSTM模型的数据稀疏性,通过算法减少其权重参数,以降低模型的计算量和功耗。其中,高琛等[6]基于Delta网络算法,对输入序列的稀疏性进行构建,在避免数据不规则加载的前提下,对冗余矩阵向量乘法运算进行过滤,对LSTM前向传输运算进行加速;Zheng等[7]采用排列块对角掩码矩阵来生成稀疏模型,极大地降低了剪枝后索引功耗;Li等[8]提出了一种按列修剪策略,删除了剩余权重的所有列索引和大部分行索引,降低了权重索引读取功耗,提高了加速器的整体性能。这种方法引入了剪枝等新的操作,势必导致电路的运行时间增加,尤其是在模型规模庞大时,增加的时间甚至可能超过计算本身所需的时间,因此,这种方法主要适用于小规模模型。另一研究方向是针对LSTM模型的计算特性,设计特殊计算架构以提高加速器的计算效率。其中,Que等[9]使用列式矩阵向量乘法代替行式操作,消除了RNN推理时的数据依赖,从而提高系统的吞吐量,但是在将行式操作映射到列式矩阵操作的过程中高度依赖于详细的层形状,这使得当需要计算的数据量过大的时候将耗费过多的数据排列时间;T. Joseph等[10]提出了一种利用收缩阵列结构和并行数据处理单元加速矩阵向量乘法的硬件架构,采用分裂矩阵方法,将较大的矩阵分解为较小的矩阵,从而降低了并行计算的硬件复杂度,但是在对数据进行并行处理时,需要考虑数据缓存占用,当模型数据过大时,这将对硬件资源有更大的要求。
为了避免在利用数据稀疏性减少数据量所带来的索引延迟问题,同时也为了减轻矩阵式计算架构中存在的数据缓存压力,本文根据LSTM算法的计算特点,设计并实现了一种基于脉动阵列的分布式计算LSTM加速器。首先,本文通过分析传统LSTM加速器的向量乘操作,提出一种权重参数切割方式,通过分布式存储输入数据并流动权重参数进行两者的耦合相乘,以减少数据流动性并降低读取功耗;然后,设计并实现了一种基于脉动阵列的分布式计算架构,通过脉动的方式传递数据,从而减少计算单元的空置率并提高计算效率。最后,在VU13P系列FPGA板上进行验证,实验结果显示,本设计相较于当前其他典型设计均有不同程度的提升。
LSTM模型最早由S. Hochreiter和J. Schmidhuber[11]于1997年提出,通过引入门控机制可以有效地解决RNN的梯度爆炸或消失问题。LSTM基本单元结构如图1所示。基于传统RNN模型,LSTM网络改进主要在以下2个方面。
一个是引入4个门控机制来控制信息传递的路径,具体计算可用式(1)—(4)表示,分别为输入门it、输出门ot、遗忘门ft和更新门gt。输入门it控制当前时刻的更新状态有多少信息需要保存,输出门ot控制当前时刻的更新状态gt有多少信息需要输出给外部状态ht,遗忘门ft控制上一个时刻的内部状态ct-1需要遗忘多少信息,更新门gt使用非线性函数得到当前时刻的更新状态。
另一个是引入一个新的内部状态ct专门进行线性的循环信息传递,由式(5)计算得到;同时非线性的输出信息给隐藏层的外部状态ht,由式(6)表示。
式(1)—(6)中:σ表示Sigmoid函数;tanh表示Tanh函数;⊗表示向量乘;×表示向量点乘;xt表示当前的输入数据;ht表示外部状态;W表示权重参数矩阵;b表示偏置参数。
传统RNN只具有短期记忆h这一个状态信息,而LSTM网络新引入状态信息c作为长期记忆,这使得在面对长序列任务时,LSTM网络的预测精度会更高。
图2展示了本文所设计的LSTM加速器整体系统架构。由顶层控制模块(Top Control)、全局寄存器(Global REG)以及4个计算核心(PE Core)3部分组成。本文设计的LSTM加速器的输入数据x和中间数据的位宽均为16 bit,权重参数W_in的位宽为8 bit。存储在DRAM中的权重参数W_in通过64 bit位宽的数据线每次送入8个数据到计算核心中,然后,输入数据x和LSTM的外部状态ht经全局寄存器切割重排后通过128 bit位宽的数据线以xtht-1形式每次送入8个数据到计算核心。在计算中,顶层控制模块实时调控W_in的流动与xtht-1耦合相乘,并将计算结果ht送回全局寄存器中,在计算结束后,将最终输出结果h送回到DRAM中。4个计算核心分别对应LSTM模型中的4个门结构,每个计算核心由2个8×7的计算阵列(PE Array)、加法器(ADD)和激活函数(Sig/Tanh)组成,计算阵列实现的是式(1)—(4)中的向量乘⊗操作。
计算阵列(PE Array)架构如图3所示,主要包括2个输入端口、7个权重存储(WH0-6)模块、56个PE组成的PE阵列和1个加法树(ADD Tree)模块,其中,2个输入端口分别连接输入数据x或外部状态h,及其对应的权重参数矩阵WxWh。计算阵列运行过程为①每次输入8个权重参数,按顺序将所有参数存储到WH0-6中;②每次输入8个数据xh,按顺序存储到每一列的8个PE当中;③PE阵列流动每一列的权重参数W_in与预先存储的输入数据x或外部状态h进行耦合相乘,并以脉动的方式将结果传递到下一列;④通过加法树模块将结果相加,以实现向量乘操作。
传统的LSTM模型中向量乘操作如图4所示。以输入数据维度和LSTM模型隐藏维度均是56为例,每个时间步输入一维的1×56的输入数据,与二维的56×56的权重参数矩阵进行向量乘,即输入数据遍历权重参数矩阵中的每一行进行乘累加操作,每一行输出1个数据,最后共输出一维的56×1的数据。具体计算流程如算法1所示。
算法1 传统LSTM模型中的向量乘计算流程伪代码
在计算每一个时间步的loop1中,需要中间缓存空间56×16 bit,消耗56个乘法器和55个加法器。生成计算结果所消耗的循环周期数可以用式(7)计算,N表示输入数据维度和LSTM模型隐藏维度56,1表示56个乘法器并行进行需要1个周期,lb(N)表示计算N个数据需要的加法树级数。由式(7)可得,传统LSTM模型中向量乘计算共需要消耗392个循环周期数。
权重参数切割方式如图5所示。分析向量乘操作的计算过程可知,输入数据中每个数据对应相乘的总是权重参数矩阵中的对应位置的一列,由于输入数据远少于权重参数矩阵,则可以通过固定输入数据,将权重参数进行流动遍历输入数据的方式进行计算,以减少数据流动性,降低数据读取功耗;权重参数在模型进行推理过程中是固定不变的,则可以通过将其切割分布存储的方式,以减轻流动时对硬件的带宽压力。在本文设计中将56×56的权重参数矩阵W按照列维度,以8列为一组进行切割,分别存储在WH0-6中。
图6展示了8×7脉动计算阵列的详细架构,主体由8行7列共56个PE单元组成。计算阵列提前将输入数据按顺序依次存入0-55序号PE单元中,计算开始后,第1个周期从WH0中读出第1行,分割输入到PE0-7中,分别与输入数据X0-7进行乘操作;第2个周期,PE0-7输出相乘结果并输送给PE8-15,WH1读出第2行,分割输入到PE8-15,与输入数据X8-15和PE0-7输出的相乘结果进行乘加操作,同时WH0中读出第2行,分割输入到PE0-7中进行乘操作;以此类推;直到第8个周期开始,PE48-55输出数据到加法数模块中。加法数模块有3级加法树,共7个加法器组成,对PE阵列中的8行输出的8个数据进行相加操作。具体计算流程如算法2所示。
算法2 基于脉动阵列架构的分布式计算流程伪代码
在计算每一个时间步的loop1中,由于是并行流水线计算,不需要中间缓存空间,共消耗56个乘加器和7个加法器。生成计算结果所消耗的循环周期数可以用式(8)计算,N表示输入数据维度和LSTM模型隐藏维度56,C表示PE阵列的列维度,lb(H)表示计算PE阵列的行维度H个数据需要的加法树级数。由式(8)可得,基于脉动阵列架构的分布式计算共需要消耗66个循环周期数。
本文设计的基于脉动阵列的分布式计算架构,66个循环周期可以完成1×56的输入数据与56×56的权重参数矩阵进行的向量乘操作,对比传统LSTM模型中向量乘需要的392个周期,在计算速度上有5.9倍的提升。而且,由于中间数据是通过脉动方式传递到下一级,不需要中间缓存数据,减轻了部署模型的边缘计算端设备的内存压力。
图7展示了PE的结构。PE Control模块调控PE有3个工作模式:待机模式、存储模式和计算模式。待机模式,PE计算单元处于待机状态,无输出;存储模式,受x_en控制,接收输入数据X并存储在Input_x/h REG寄存器中;计算模式,受MAC_en控制,接收累加数据D和权重参数W_in,一个周期进行乘累加操作X×W_in+D,并在下一周期输出乘累加结果。其中,XD均是16 bit数据,在计算过程是补码表示的定点数,1 bit符号位,4 bit整数位以及11 bit的小数位;而W_in是8 bit数据,1 bit符号位和7 bit小数位,所以在送入MAC前需要对其末位补0至12 bit数;而接收乘累加结果需要32 bit位宽,为了控制运算过程的位宽统一,使用Cropping模块裁切至16 bit数。
本文基于Xilinx公司的Vivado2022.2开发环境进行功能仿真和综合验证,选择Xilinx公司的VirtexUltraScale+VU13P系列xcvu13p-fhga2104-2-e芯片FPGA开发板作为测试载体,完全使用Verilog HDL语言对设计的加速器进行编码。硬件资源消耗如表1所示。其中,VU13P系列FPGA板载DSP支持定点16 bit的乘加运算,BRAM资源用于模拟外部存储DRAM存储输入数据以及权重参数。
同时,本文在PC端基于Pytorch库搭建了标准LSTM模型作为本设计的CPU实验平台对比,PC端使用Intel酷睿i5-12400F处理器,主频为2.5 GHz,DDR3内存大小为48 GByte,操作系统为Win11。
本文使用MNIST数据集[13]对所设计的LSTM加速器进行验证,由于数据集中的图片维度是28× 28,而本文设计的加速器的输入数据维度是56,故对MNIST数据集图片扩展至56×56后再放入标准LSTM模型中进行训练与推理验证。
衡量硬件加速器性能的一个重要指标是其峰值计算能力。在理想情况下,可以通过每个时钟周期可执行的计算次数(包括乘法和加法)来评估硬件加速器的峰值计算性能,即算力。具体的计算公式[12]
式(9)中:P表示加速器中PE的数量;系数2表示每个PE在一个周期内执行一次乘法和一次加法操作;f为时钟频率。
本文设计的加速器,共有4个PE核心,每个PE核心有2个MAC阵列,每个MAC阵列有56个PE,合计4×2×56=448个PE。当时钟频率为200 MHz时,由式(7)可得,加速器算力为179.2 GOPS,处理1张56×56图片需要4852个周期,即4852/200 MHz=24.26 μs。
在PC端使用CPU作为推理设备,基于Pytorch库调用time.time()函数监控运行时间,得到推理10000张图片,用时26.3 s,平均处理1张图片用时2.63 ms。本文使用的标准LSTM模型运算量为2819040,2819040/2.63 ms=1.07GOPS[14],CPU有效算力为1.07 GOPS。表2给出了基于FPGA(VU13P)与CPU在图像计算任务中的性能与能效对比,该加速器的计算速度是CPU平台的108倍,功耗为CPU的1/79,算力是CPU的167倍。
图8展示了加速器各模块的功耗以及计算模块中各资源的功耗占比。功耗报告是通过Vivado 2022.2软件对LSTM加速器源代码进行实现(IMPLEMENTATION)后生成的。在图8中,左侧柱状图显示了加速器3个模块的功耗分布,其中,由4个计算核心组成的计算模块占总功耗的84.55%。右侧饼状图则展示了计算模块中各资源的功耗占比,Logic的功耗远小于DSP的功耗,这是因为加速器设计中所有的乘加运算均通过DSP资源直接完成,而没有使用LUT查找表进行替代。此外,由于BRAM的功耗低于0.001 W,因此未单独列出。表3展示了本文设计的LSTM加速器与其他4种文献中加速器的性能对比,本文设计的加速器能效比相较于文献[7],文献[9]、文献[15]、文献[16]分别提高了34%、72%、193%、731%,但有效算力均低于其他参考文献。这一差距源自其他文献所采用的LSTM模型规模较大,因此耗费的资源也更为庞大,这不利于在边缘计算端设备部署,所以本文选取验证的LSTM模型相对较小。文献[7]消耗的DSP为0,是因为它采用LUT查找表替代DSP计算乘加,因此,LUT消耗的资源较多,而文献[15]中LUT消耗较少的原因是其主要利用DSP进行计算,所以DSP消耗的资源较多。
本文设计了一种基于脉动阵列架构的分布式计算LSTM加速器,加速器的分布式计算阵列通过脉动的方式进行数据传输与计算,在减少中间数据缓存占用的同时还提高了计算速度。在VU13P系列FPGA进行实验的结果表明,本文设计的加速器在200 MHz的频率下有效算力179.2 GOPS,动态功耗0.343 W,能效比522.4 GOPS/W,计算速度和有效算力分别是CPU的108倍和167倍,功耗是其1/79。相较于当前典型设计,能效比分别有731%,193%,72%,34%的提升。实验表明,本文设计的LSTM加速器适合应用于资源受限和对功耗要求严格的边缘计算端设备,同时在后续研究工作中,可以基于本文设计的脉动阵列架构,加入可重构的思想,实时调控PE的计算组合,以适配不同规模及种类的神经网络并对其加速。
参考文献 引证文献
排序方式:
[1]
GOLDBERG Y. Aprimer on neural network models for natural language processing[J]. Journal of Artificial Intelligence Research, 2016(57): 345-420.
[2]
HAN S, KANG J, MAO H, et al. Ese: Efficient speech recognition engine with sparselstm on fpga[C]//Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, NY, USA: ACM, 2017: 75-84.
[3]
DONAHUE J, ANNE H L, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ, USA: IEEE, 2015: 2625-2634.
[4]
HOCHREITER S, BENGIO Y,FRASCONI P, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies[C]//A Field Guide to Dynamical Recurrent Neural Networks. Piscataway. NJ, USA:IEEE, 2001.
[5]
WANG M, WANG Z, LU J, et al. E-LSTM: An efficient hardware architecture for long short-term memory[J]. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. USA: IEEE, 2019, 9(2): 280-291.
[6]
高琛,张帆,高彦钊.利用数据稀疏性的LSTM加速器设计[J].电子学报, 2021, 49(2): 209-215.
GAO C, ZHANG F, GAO Y Z. Design of LSTM accelerator using data sparsity[J]. Acta Electronica Sinica, 2021, 49(2): 209-215.
[7]
ZHENG Y, YANG H, JIA Y, et al.PermLSTM: A high energy-efficiency LSTM accelerator architecture[J]. Electronics, 2021, 10(8): 882.
[8]
LI S, ZHU S, LUO X, et al. An efficient sparselstm accelerator on embedded fpgas with bandwidth-oriented pruning[C]//2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). USA: IEEE, 2023: 42-48.
[9]
QUE Z, NAKAHARA H,NURVITADHI E, et al. Recurrent neural networks with column-wise matrix-vector multiplication on FPGAs[J]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2021, 30(2):227-237.
[10]
JOSEPH T, BINDIYA T S. Performance-driven LSTM accelerator hardware using split-matrix-based MVM[J]. Circuits, Systems, and Signal Processing, 2023, 42(11): 6660-6683.
[11]
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[12]
田翔,周凡,陈耀武,.基于FPGA的实时双精度浮点矩阵乘法器设计[J].浙江大学学报:工学版, 2008, 42(9): 1611-1615.
TIAN X, ZHOU F, CHEN Y W, et al. Design of real-time double-precision floating-point matrix multiplier based on FPGA[J]. Journal of Zhejiang University: Engineering Edition,2008, 42(9):1611-1615.
[13]
LECUN Y, BOTTOU L, BENGIO Y, et al. Gradientbased learning applied to document recognition[J]. Proceedings of the IEEE, 2002, 86(11): 2278-2324.
[14]
刘杰,葛一凡,田明,.基于ZYnQ的可重构卷积神经网络加速器[J].电子学报, 2021, 49(4): 729.
LIU J, GE Y F, TIAN M, et al. Reconfigurable convolutional neural network accelerator based on ZYNQ[J]. Acta Electronica Sinica, 2021, 49(4):729.
[15]
GHASEMZADEH S A, TAVAKOLI E B, KAMAL M,et al. BRDS: An FPGA-based LSTM accelerator with row-balanced dual-ratiosparsification[EB/OL]. (2021-01-07)[2025-09-18]. https://arxiv.org/abs/2101.02667.
[16]
MAO N, YANG H, HUANG Z. An Instruction-Driven Batch-Based High-Performance Resource-Efficient LSTM Accelerator on FPGA[J].Electronics,2023,12(7):1731.
2025年第37卷第5期
PDF下载
143
73
引用本文
BibTeX
文章信息
doi: 10.3979/j.issn.1673-825X.202406270158
  • 接收时间:2024-06-27
  • 首发时间:2026-04-16
补充材料
相关文章
文章信息
作者
出版历史
  • 收稿日期:2024-06-27
  • 修回日期:2025-09-16
基金
作者信息
    重庆邮电大学 光电工程学院,重庆 400065

通讯作者:

参考文献
分享链接
https://castjournals.cast.org.cn/joweb/cqyddxxb/CN/10.3979/j.issn.1673-825X.202406270158
分享至
全文二维码

扫描看全文

引用本文
BibTeX
本文的引用情况
2种不同金属材料的力学参数

Family
属数
Number of
genus
种数
Number of
species
占总种数比例
Percentage of
total species (%)

Genus
种数
Number of
species
占总种数比例
Percentage of total
species (%)
鹅膏菌科Amanitaceae 2 11 5.26 鹅膏菌属 Amanita 10 4.78
小菇科 Mycenaceae 2 12 5.74 丝盖伞属 Inocybe 5 2.39
多孔菌科 Polyporaceae 8 14 6.70 蜡蘑属 Laccaria 5 2.39
红菇科 Russulaceae 3 23 11.00 小皮伞属 Marasmius 6 2.87
小菇属 Mycena 11 5.26
光柄菇属 Pluteus 5 2.39
红菇属 Russula 17 8.13
栓菌属 Trametes 5 2.39
关闭全屏