将PDF转换为PNG以供Tesseract处理 [英] Converting PDF to PNG for Tesseract to process

查看:205
本文介绍了将PDF转换为PNG以供Tesseract处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此刻,我在使用Imagemagick和Tesseract时遇到问题.

I'm having an issue at the moment with Imagemagick and Tesseract.

我正在使用PHP的文档的命令行分类器.这个想法是,它可以接收PDF文档,并使用 League Pipeline软件包将其传递给许多步骤.我确定的必要步骤如下:

I'm working on a command-line classifier for documents in PHP. The idea is that it takes in PDF documents and uses the League Pipeline package to pass it through numerous steps. The steps I've identified as necessary are as follows:

  1. 将PDF转换为PNG文件
  2. 从PNG文件中提取文本
  3. 通过机器学习库运行文本以对其进行分类

主要命令如下:

<?php

namespace Matthewbdaly\LetterClassifier\Commands;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Input\InputArgument;
use League\Pipeline\Pipeline;
use Matthewbdaly\LetterClassifier\Stages\ConvertPdfToPng;
use Matthewbdaly\LetterClassifier\Stages\ReadFile;

class Processor extends Command
{
    protected function configure()
    {
        $this->setName('process')
            ->setDescription('Processes a file')
            ->setHelp('This command processes a file')
            ->addArgument('file', InputArgument::REQUIRED, 'File to process');
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $file = $input->getArgument('file');
        $pipeline = (new Pipeline)
            ->pipe(new ConvertPdfToPng)
            ->pipe(new ReadFile);
        $pipeline->process($file);
    }
}

如您所见,它接受文件名作为第一个参数,然后在将文件传递到管道之前为所需步骤定义管道.

As you can see, it accepts a filename as the first argument, then defines a pipeline for the required steps, before passing the file to the pipeline.

转换PDF的步骤如下:

The step for converting the PDF looks like this:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use Imagick;

class ConvertPdfToPng
{
    public function __invoke($file)
    {
        $tmp = tmpfile();
        $uri = stream_get_meta_data($tmp)['uri'];
        $img = new Imagick($file);
        $img->setResolution(300, 300);
        $img->setImageDepth(8);
        $img->setImageFormat('png');
        $img->writeImage($uri);
        return $tmp;
    }
}

它将PDF的PNG版本写入为临时文件.至少在我看来,生成的文件看起来还可以,但是Tesseract无法正确读取.这是Tesseract应该处理文件的第二步:

It writes a PNG version of the PDF as a temporary file. The generated file looks OK, at least to my eye, but it can't be read correctly by Tesseract. Here's the second step where Tesseract should process the file:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use thiagoalessio\TesseractOCR\TesseractOCR;

class ReadFile
{

    public function __invoke($file)
    {
        $uri = stream_get_meta_data($file)['uri'];
        $ocr = new TesseractOCR($uri);
        $output = $ocr->lang('eng')->run();
        eval(\Psy\Sh());
    }
}

Psysh的输出如下:

The output from Psysh looks like this:

=> """
   Am sum\n
   \n
   mm" m mun SuHrkw-l\n
   n m 51mm\n
   \n
   mm\n
   \n
   um um\n
   \n
   ms Murine\n
   1 Elm: 51mm\n
   Emnuumn\n
   \n
   a mu\n
   \n
   m Mm 2m-\n
   Dav st-n-m.\n
   \n
   P‘Eualanfl ma lumnflarvlmamrmy "Hay "mum-m-\n
   we we "mum-m n: "mum," m mun\n
   \n
   vm [harem\n
   \n
   Am smrm
   """

这不是我要分类的字母的内容-文本被弄乱了.如果我从外壳程序运行以下命令,它们将按预期工作,以将字母的文本转换并将其写入输出文件:

This is not the content of the letter I'm trying to classify - the text is getting mangled. If I run the following commands from the shell, they work as expected to convert and write the letter's text to the output file:

convert -density 300 Quote.pdf output.png
tesseract output.png output

如果在Tesseract阶段将文件的路径硬编码为指向使用convert命令生成的output.png,则该方法有效.因此,我非常有信心问题在于生成PNG文件的步骤.我对使用Imagemagick的经验不是很丰富,所以我不确定为什么无法处理该文件,但是似乎缺少某种设置.

And if I hardcode the path to the file in the Tesseract stage to point at the output.png generated using the convert command, that works. So I'm fairly confident the issue is with the step to generate the PNG file. I'm not that experienced with using Imagemagick, so I'm unsure why the file can't be processed, but it seems like there's a setting of some kind that I'm missing.

任何人都可以提出问题所在吗?

Can anyone suggest what the problem might be?

推荐答案

我怀疑问题是Imagick在调用setResolution()之前先读取了PDF.

I suspect the problem is that Imagick reads the PDF before you call setResolution().

尝试实例化一个空的IMagick对象,设置分辨率,然后读取文件:

Try instantiating an empty IMagick object, setting the resoltion and then reading the file:

$img = new Imagick();
$img->setResolution(300, 300);
$img->readImage($file);

这篇关于将PDF转换为PNG以供Tesseract处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆