将PDF转换为PNG以供Tesseract处理 [英] Converting PDF to PNG for Tesseract to process

查看：205 发布时间：2020/11/27 1:48:56 php imagemagick tesseract

本文介绍了将PDF转换为PNG以供Tesseract处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此刻，我在使用Imagemagick和Tesseract时遇到问题.

I'm having an issue at the moment with Imagemagick and Tesseract.

我正在使用PHP的文档的命令行分类器.这个想法是，它可以接收PDF文档，并使用 League Pipeline软件包将其传递给许多步骤.我确定的必要步骤如下:

I'm working on a command-line classifier for documents in PHP. The idea is that it takes in PDF documents and uses the League Pipeline package to pass it through numerous steps. The steps I've identified as necessary are as follows:

将PDF转换为PNG文件
从PNG文件中提取文本
通过机器学习库运行文本以对其进行分类

主要命令如下:

<?php

namespace Matthewbdaly\LetterClassifier\Commands;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Input\InputArgument;
use League\Pipeline\Pipeline;
use Matthewbdaly\LetterClassifier\Stages\ConvertPdfToPng;
use Matthewbdaly\LetterClassifier\Stages\ReadFile;

class Processor extends Command
{
    protected function configure()
    {
        $this->setName('process')
            ->setDescription('Processes a file')
            ->setHelp('This command processes a file')
            ->addArgument('file', InputArgument::REQUIRED, 'File to process');
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $file = $input->getArgument('file');
        $pipeline = (new Pipeline)
            ->pipe(new ConvertPdfToPng)
            ->pipe(new ReadFile);
        $pipeline->process($file);
    }
}

如您所见，它接受文件名作为第一个参数，然后在将文件传递到管道之前为所需步骤定义管道.

As you can see, it accepts a filename as the first argument, then defines a pipeline for the required steps, before passing the file to the pipeline.

转换PDF的步骤如下:

The step for converting the PDF looks like this:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use Imagick;

class ConvertPdfToPng
{
    public function __invoke($file)
    {
        $tmp = tmpfile();
        $uri = stream_get_meta_data($tmp)['uri'];
        $img = new Imagick($file);
        $img->setResolution(300, 300);
        $img->setImageDepth(8);
        $img->setImageFormat('png');
        $img->writeImage($uri);
        return $tmp;
    }
}

它将PDF的PNG版本写入为临时文件.至少在我看来，生成的文件看起来还可以，但是Tesseract无法正确读取.这是Tesseract应该处理文件的第二步:

It writes a PNG version of the PDF as a temporary file. The generated file looks OK, at least to my eye, but it can't be read correctly by Tesseract. Here's the second step where Tesseract should process the file:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use thiagoalessio\TesseractOCR\TesseractOCR;

class ReadFile
{

    public function __invoke($file)
    {
        $uri = stream_get_meta_data($file)['uri'];
        $ocr = new TesseractOCR($uri);
        $output = $ocr->lang('eng')->run();
        eval(\Psy\Sh());
    }
}

Psysh的输出如下:

The output from Psysh looks like this:

=> """
   Am sum\n
   \n
   mm" m mun SuHrkw-l\n
   n m 51mm\n
   \n
   mm\n
   \n
   um um\n
   \n
   ms Murine\n
   1 Elm: 51mm\n
   Emnuumn\n
   \n
   a mu\n
   \n
   m Mm 2m-\n
   Dav st-n-m.\n
   \n
   P‘Eualanﬂ ma lumnﬂarvlmamrmy "Hay "mum-m-\n
   we we "mum-m n: "mum," m mun\n
   \n
   vm [harem\n
   \n
   Am smrm
   """

这不是我要分类的字母的内容-文本被弄乱了.如果我从外壳程序运行以下命令，它们将按预期工作，以将字母的文本转换并将其写入输出文件:

This is not the content of the letter I'm trying to classify - the text is getting mangled. If I run the following commands from the shell, they work as expected to convert and write the letter's text to the output file:

convert -density 300 Quote.pdf output.png
tesseract output.png output

如果在Tesseract阶段将文件的路径硬编码为指向使用convert命令生成的output.png，则该方法有效.因此，我非常有信心问题在于生成PNG文件的步骤.我对使用Imagemagick的经验不是很丰富，所以我不确定为什么无法处理该文件，但是似乎缺少某种设置.

And if I hardcode the path to the file in the Tesseract stage to point at the output.png generated using the convert command, that works. So I'm fairly confident the issue is with the step to generate the PNG file. I'm not that experienced with using Imagemagick, so I'm unsure why the file can't be processed, but it seems like there's a setting of some kind that I'm missing.

任何人都可以提出问题所在吗?

Can anyone suggest what the problem might be?

将PDF转换为PNG以供Tesseract处理 [英] Converting PDF to PNG for Tesseract to process

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

将PDF转换为PNG以供Tesseract处理 [英] Converting PDF to PNG for Tesseract to process

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭