我想用 python 抓取印地语(印度语)pdf 文件 [英] I want to scrape a Hindi(Indian Langage) pdf file with python

查看:60
本文介绍了我想用 python 抓取印地语(印度语)pdf 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了从 PDF 文件中抓取所有数据的 Python 代码.这里的问题是,一旦被刮掉,单词就会失去语法.如何解决这些问题?我附上了代码.

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams)
   with open(path, 'rb') as fp:
         interpreter = PDFPageInterpreter(rsrcmgr, device)
         password = ""
         caching = True
         pagenos = set()

         for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
             interpreter.process_page(page)
         text = retstr.getvalue()
  device.close()
  retstr.close()
  return text
print convert_pdf_to_txt("S24A276P001.pdf")

这是PDF的屏幕截图.

and here is the screenshot of PDF.

推荐答案

解决问题的最佳方法是使用 python 中的 textract 模块并从其 github 存储库加载印地语测试数据并编写提取的文本到一个txt文件.这解决了我的问题.

Best way to solve the problem is use textract module from python and load hindi test data from its github repository and write the extracted text to a txt file. This solved my problem.

这篇关于我想用 python 抓取印地语(印度语)pdf 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆