在python中从pdf中提取印地文编写的文本 [英] Extracting text written in hindi from pdf in python

查看:106
本文介绍了在python中从pdf中提取印地文编写的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从pdf文档中提取以印地文输入的文本.我已经附加了图像我正在处理的示例页面.

I want to extract text typed in hindi from a pdf document.I've attached the image of the sample page I am dealing with.

我曾尝试使用pdfminer从中获取文本,但文本出现乱码(可能是由于印地语字体所致)

I've tried using pdfminer to get text from it but the text is garbled (may be due to hindi fonts)

现在我正在考虑将页面分为三部分,然后将每一部分分为两部分(分别是英文和印地文文本),然后在每半部分上运行ocr以获取文本,但是唯一的问题是我不知道所使用的字体印地文,所以我可能会再次收到乱码.

Now I am thinking of splitting the page in three parts and then splitting each part into two parts (seperating english and hindi text) then running ocr on each half to get text but only issue is I don't know the font used for hindi so I might again get garbled text.

我的问题是,有没有更好的方法来处理印地语字体?如何找到字体名称?

推荐答案

我已经在您的PDF上尝试了以下内容,它似乎提取了很多文本,我想它可能不是最佳布局,但是我无法分辨.

I have tried the following on your PDF and it appears to extract a lot of the text, I am guessing it might not be in the best layout but I am not able to tell.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    with open(path, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        caching = True
        pagenos = set()

        for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()

    device.close()
    retstr.close()
    return text

print convert_pdf_to_txt("Electoral roll - Faizabad.pdf")

它显示为utf-8,因此必须确保输出控制台能够使用它显示.

It displays as utf-8 so you must make sure your output console is capable to displaying using this.

例如:

भभग ससखखभककल मतदभतभ 11 1.रजजरभ आसशशकपपथममक ववददपलद रजजरप - सपमपनद779 420 359 0 779ननरभरचक नभमभरलल 2014 0S24उततर पददशवरधभन सभभ कदत कक ससखखभ ,नभम र आरकण सससनत:ललक सभभ कदत कक ससखखभ ,नभम र आरकण सससनत: 1 . पकनरलकण कभ वरररणपकनरलकण कभ ररर : 2014अहतभर कक नतथस: 01.01.2014पकनरलकण कभ सररप: ससककपत पकनरलकणपकभशन कक नतथस: 01.10.2013पकनरमकदण कक नतथस : 15.03.2014

要确定其使用的字体列表,只需将PDF加载到PDF阅读器(例如Adobe ReaderFoxit Reader)中,然后从文件"菜单中选择Properties.从这里您应该可以选择Fonts.当我尝试使用Foxit Reader时,它显示了以下字体:

To determine the list of fonts that it is using, you can simply load the PDF into a PDF reader such as Adobe Reader or Foxit Reader and select Properties from the File menu. From here you should be able to select Fonts. When I tried this with Foxit Reader it displayed the following fonts:

Mangal-Bold
Arial
Mangal
Arial Bold
Times-New-Roman-Bold

这篇关于在python中从pdf中提取印地文编写的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆