java中如何实现OCR文字提取_java orc
目前已知免费工具:Tesseract 、PaddleOCR、EasyOCR、RapidOCR ,收费工具:Microsoft Read API / Google Cloud Vision
以下实现主要针对Tesseract 、PaddleOCR、RapidOCR
以下实现环境基于windows本地开发
Tesseract
1、本地安装 Tesseract OCR 引擎。原因主要是因为它采用了 JNI(Java Native Interface) 的方式调用本地库,而不是纯 Java 实现。 https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.3.0.20221214.exe
2、训练模型下载 https://github.com/tesseract-ocr/tessdata_best
3、pom配置
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>5.8.0</version> <exclusions> <exclusion> <groupId>net.java.dev.jna</groupId> <artifactId>jna</artifactId> </exclusion> <exclusion> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> </exclusion> </exclusions> </dependency> <dependency> <groupId>net.java.dev.jna</groupId> <artifactId>jna</artifactId> <version>5.13.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.7.0</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.29</version> </dependency>
4、java基本实现
//对传入的文件进行处理 public File imageCheck(MultipartFile multipartFile) { File tempFile = null; File reNameFile = null; try { // 使用UUID生成唯一字符串 String uuid = UUID.randomUUID().toString(); tempFile = new File(SAVE_PATH+multipartFile.getOriginalFilename()); multipartFile.transferTo(tempFile); //获取实际类型;举例传入图片后缀为png但实际图片类型为jpg,此时需要对图片后缀更正,若不进行更正识别时会存在异常 String mimeType = new Tika().detect(tempFile); System.out.println(mimeType); String[] type = mimeType.split(\"/\"); String fileName = SAVE_PATH+uuid+\".\"+type[1]; reNameFile = new File(fileName); tempFile.renameTo(reNameFile); //openVC(fileName); //图片预处理 } catch (Exception e) { if (reNameFile != null) { reNameFile.delete(); } throw new RuntimeException(\"文件生成异常\", e); } finally { if (tempFile != null) { tempFile.delete(); } } return reNameFile; } //pdf文字提取 public String convertPdfToImage(Tesseract tesseract, File pdfFile, int dpi) { try (PDDocument document = PDDocument.load(pdfFile)) { List<Word> allWorld = new ArrayList<>(); for (int page = 0; page < document.getNumberOfPages(); page++) { PDFRenderer renderer = new PDFRenderer(document); // 渲染指定页面为图像,dpi参数控制图像质量 BufferedImage bf = renderer.renderImageWithDPI(page, dpi); List<Word> words = tesseract.getWords(bf, ITessAPI.TessPageIteratorLevel.RIL_WORD); allWorld.addAll(words); } return JSONObject.toJSONString(allWorld); } catch (Exception e) { e.printStackTrace(); return null; } } //此处为开始,开始传入图片,对图片进行文字识别 public String ocrCore(MultipartFile multipartFile) { Tesseract tesseract = new Tesseract(); File reNameFile = null; String result = \"\"; try { reNameFile = imageCheck(multipartFile); tesseract.setDatapath(TEST_DATA_PATH); // 设置训练数据路径 tesseract.setLanguage(\"chi_sim+chi_tra+eng\"); // 同时识别中文和英文 BufferedImage bf = null; if (reNameFile.getAbsolutePath().endsWith(\".pdf\")) { result= convertPdfToImage(tesseract,reNameFile, 300); } else { bf = ImageIO.read(reNameFile); List<Word> words = tesseract.getWords(bf, ITessAPI.TessPageIteratorLevel.RIL_WORD); result = JSONObject.toJSONString(words); } //result = tesseract.doOCR(reNameFile); System.out.println(result); } catch (Exception e) { e.printStackTrace(); } finally { if (reNameFile != null) { reNameFile.delete(); } } return result; }
PaddleOCR
1、Python 安装
pip install paddlepaddle paddleocr
若需要识别pdf文件需要安装以下python组件,windows需手动安装poppler(https://github.com/oschwartz10612/poppler-windows/releases/tag/v24.08.0-0/ 下载后bin路径需要配置到环境变量中)
##若需要识别pdf文件需要安装以下python组件pip install pdf2image poppler-utils
2、Python实现,创建orc接口
from flask import Flask, request, jsonifyfrom paddleocr import PaddleOCRfrom pdf2image import convert_from_pathimport numpy as npimport jsonapp = Flask(__name__)ocr = PaddleOCR(use_angle_cls=True, lang=\"ch\",use_gpu=True)@app.route(\'/ocr\', methods=[\'POST\'])def ocr_api(): img_path = request.json[\'image_path\'] if is_pdf(img_path) : # 将 PDF 转换为图像列表(每页一张图) images = convert_from_path(img_path, dpi=300) # dpi 越高越清晰,但速度越慢 resp = [] # 遍历所有页面识别文字 for i, image in enumerate(images): # 将 PIL.Image 转为 numpy 数组 img_np = np.array(image) result = ocr.ocr(img_np) resp.append(result) return json.dumps(resp, ensure_ascii=False) else: result = ocr.ocr(img_path, cls=True) return json.dumps(result, ensure_ascii=False)def is_pdf(file_path): return file_path.lower().endswith(\'.pdf\')if __name__ == \'__main__\': app.run(host=\'0.0.0.0\', port=5000, threaded=False)###多线程情况下 连续解析同一张图片可能会存在 OpenCV/Pillow 的图像解码冲突,可能导致间歇性失败
# pps 会对复杂文本进行布局分析,支持识别分栏样式文档的识别 table_engine = PPStructure(recovery=True, use_pdf2docx_api=True) # 处理图片或PDF result = table_engine(img_path) #result = ocr.ocr(img_path, cls=True) texts = [] for region in result: if region[\'type\'] in (\'text\', \'equation\',\'title\', \'header\',\'figure\',\'figure_caption\'): # 只保留文本类型区域 texts.append(one_row(region[\'res\'])) if region[\'type\'] == \'table\': texts.append(region[\'res\'][\'html\']) return json.dumps(texts, ensure_ascii=False)
3、java 实现 调用Python 实现的ORC接口
请求示例:
OkHttpClient client = new OkHttpClient().newBuilder() .build();MediaType mediaType = MediaType.parse(\"application/json\");RequestBody body = RequestBody.create(mediaType, \"{\\r\\n \\\"image_path\\\":\\\"D:/Users/Administrator/Document/personal/ocr/结构化文档样式/记录纸.pdf\\\"\\r\\n}\\r\\n\");Request request = new Request.Builder() .url(\"http://localhost:5000/ocr\") .method(\"POST\", body) .addHeader(\"Content-Type\", \"application/json\") .build();Response response = client.newCall(request).execute();
RapidOCR
1、pom文件
<dependency> <groupId>io.github.mymonstercat</groupId> <artifactId>rapidocr</artifactId> <version>0.0.7</version> </dependency> <dependency> <groupId>io.github.mymonstercat</groupId> <artifactId>rapidocr-onnx-platform</artifactId> <version>0.0.7</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.29</version> </dependency> <dependency> <groupId>io.github.mymonstercat</groupId> <artifactId>rapidocr-onnx-linux-x86_64</artifactId> <version>1.2.2</version> </dependency>
2、java实现
//这里开始public String rapidocrORC(MultipartFile multipartFile) { File reNameFile = null; String result = \"\"; try { reNameFile = imageCheck(multipartFile);//见Tesseract 部分 InferenceEngine engine = InferenceEngine.getInstance(Model.ONNX_PPOCR_V3); if (reNameFile.getAbsolutePath().endsWith(\".pdf\")) { result = convertPdfToImage(engine, reNameFile, 300); } else { OcrResult ocrResult = engine.runOcr(reNameFile.getAbsolutePath()); result = JSONObject.toJSONString(ocrResult); } System.out.println(result); } catch (Exception e) { e.printStackTrace(); } finally { if (reNameFile != null) { reNameFile.delete(); } } return result; } //pdf 文字提取 public String convertPdfToImage(InferenceEngine engine, File pdfFile, int dpi) { try (PDDocument document = PDDocument.load(pdfFile)) { List<OcrResult> allWorld = new ArrayList<>(); for (int page = 0; page < document.getNumberOfPages(); page++) { PDFRenderer renderer = new PDFRenderer(document); // 渲染指定页面为图像,dpi参数控制图像质量 BufferedImage bf = renderer.renderImageWithDPI(page, dpi); String fileName = pdfFile.getAbsolutePath().replace(\".pdf\",\"_\"+page+\".png\"); File image = new File(fileName); try { ImageIO.write(bf, \"png\", image); OcrResult ocrResult = engine.runOcr(fileName); allWorld.add(ocrResult); } catch (Exception e) { throw new RuntimeException(e); } finally { if (image != null) { image.delete(); } } } return JSONObject.toJSONString(allWorld); } catch (Exception e) { e.printStackTrace(); return null; } }