我想用 python 抓取印地语(印度语)pdf 文件 [英] I want to scrape a Hindi(Indian Langage) pdf file with python

查看：60 发布时间：2021/6/12 18:35:14 python pdf ocr pdfminer pdf-scraping

本文介绍了我想用 python 抓取印地语(印度语)pdf 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经编写了从 PDF 文件中抓取所有数据的 Python 代码.这里的问题是，一旦被刮掉，单词就会失去语法.如何解决这些问题?我附上了代码.

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

这是PDF的屏幕截图.

and here is the screenshot of PDF.

我想用 python 抓取印地语(印度语)pdf 文件 [英] I want to scrape a Hindi(Indian Langage) pdf file with python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

我想用 python 抓取印地语(印度语)pdf 文件 [英] I want to scrape a Hindi(Indian Langage) pdf file with python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭