(國防科學技術大學,湖南省長沙市 410073)
摘 要:在信息技術及計算機日益普及的今天,利用OCR(光學字符識別)技術將圖片類型的文字材料方便、快捷地輸入到計算機中并轉化為文字已經廣泛應用于各行各業,但隨著文件數量的急劇增長,在面對海量圖片數據時,逐個逐頁文件的單機識別模式已逐漸不能滿足使用者的需求,分布式系統是解決海量信息存儲及處理的有效方式。通過分析HDFS的特點,使用MapReduce這一機制,提出利用Tesseract-OCR引擎對文字圖像進行并行識別處理的方法,為今后使用OCR技術進行海量圖片識別處理提供了借鑒參考。
關鍵詞: HDFS;MapReduce;OCR;并行;識別
中圖分類號:TP391 文獻標識碼:A 文章編號:
Study of the parallel character recognition technology under Hadoop platform
MENG Shuai
(The national defense science and Technology University, Changsha, 410073, China)
Abstract:in the information technology and the increasing popularity of computers today, OCR (optical character recognition) technology can make the picture typewriting material convenient, quick input to and converted into text ,that has been widely used in many fields of computer, but with the rapid growth of the number of files, in the face of massive image data, the identify patterns of one by one single page by page file has gradually can not meet the needs, a distributed system is the effective way to solve the massive information storageand and processing. Through analyzing the characteristic of HDFS and use of the MapReduce mechanism, propose the method of using Tesseract-OCR engine to parallel processing character image, it provides reference for images recognition and processing in the future of OCR technology.
Key words:HDFS;MapReduce;OCR;parallel;recognition
參考文獻
[1] 張旋.OCR技術研究進展及前瞻[J].中國科技縱橫,2010,8:27.
[2] 沙建輝.無處不在的OCR[J].中國計算機用戶,2004,6:58.
[3] 魏惠軍.OCR技術的昨天、今天和明天[J].Postal Technology,1999:43-45.
[4] DEAN J,GHEMAWAT S. MapReduce:Simplified Data Processing onLarge Clusters[C].San Francisco CA:[s.n.],2004.
[5] 張青陽.提高OCR識別率的訣竅[J].電腦愛好者,2005,2,8:46.
[6] 白樺.提高OCR識別率[J].電腦知識與技術,2004,(34):52.
[7] 楊超,王凱東,基于Hadoop平臺的字符識別的研究[D].西安:西安電子科技大學,2012.
[8] 劉剛,侯賓,翟周偉. Hadoop開源云計算平臺[M].北京:北京郵電大學出版社,2011.
[9] 谷歌tesseract-ocr網站.[EB/OL].http://code.google.com/p/tesseract-ocr/.
作者簡介:
孟帥(1980-),男,遼寧省錦州市人,碩士研究生,主要研究方向為并行與分布式處理。