将扫描的PDF转换为文本 [英] convert a scanned PDF to text

查看:80
本文介绍了将扫描的PDF转换为文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在分类系统上工作,我们需要将一些源代码复制到一个未分类的机器上。禁止使用除打印副本以外的任何介质(无光盘,拇指驱动器等)。我尝试过PDFXchange,它非常可怕。我从Visual Studio复制了文本,将其全部转换为OCR字体,将字体大小设置为14,打印,扫描,并且PDFXchange出现了错误。

您有任何建议吗? />


编辑:Sergey Alexandrovich Kryukov建议我扫描文字。好点子。我的Epson 3640打印机不扫描到文本,只是pdf,jpeg,bitmap和tiff。如果有人知道可以使用的实用程序,我会尝试它。

I work on classified systems and we need to copy some source code to an unclassified machine. Using any media other than printed copy is forbidden (no cdrom, thumb drive, etc). I tried PDFXchange and it was out right terrible. I had copied the text from Visual Studio, converted it all to an OCR font, set font size to 14, printed, scanned, and the PDFXchange was riddled with errors.
Have you any suggestions?

Sergey Alexandrovich Kryukov suggested I scan to text. Good Point. My Epson 3640 printer does not scan to text, just pdf, jpeg, bitmap, and tiff. If someone knows of a utility that will work with those I will try it.

推荐答案

在互联网之前的一段时间,我记得Apple通过打印的2d条码在Mac之间交换程序在简单的纸上。



它是否适合您的需要?



无论如何你需要扫描到无损图像格式,避免jpeg降级扫描以改善压缩,这使得OCR更难以处理它。
In a time before internet, I remember Apple exchanging programs between Macs via printed 2d barcode on simple sheets of paper.

Would it fit your need ?

In any case you need to scan to a lossless picture format, avoid jpeg which downgrade the scan to improve compression, which turn makes it more difficult to the OCR to deal with it.


我建​​议你扫描到TIFF(最好)或JPEG。然后,您可以使用一些OCR软件将这些位图识别为纯文本。 PDF是最糟糕的选择。有可能,您的扫描仪不提供任何OCR并将数据保存为PDF中的位图。如果它的软件真的执行OCR,并且你没有选择,那么从PDF获取文本是很痛苦的。手动,您可以简单地将数据复制/粘贴为文本,但可以杀死那些可怕的Adobe软件。对于Windows,我会推荐Sumatra PDF,在Linux上使用的就好了。



如果你真的想以编程方式解析PDF,你必须指定你的平台,你使用的语言,等等。



我不知道你有什么OCR,所以进一步的细节取决于你。远离PDF将是最好的选择。



-SA
I suggest you scan to TIFF (best) or JPEG. Then you can use some OCR software to recognize those bitmap into plain text. PDF is the worst option. Chances are, your scanner does not provide any OCR and save data as bitmap inside PDF. If its software really performs OCR, and you have no options, getting text from PDF is painful. Manually, you can simply copy/paste data as text, but kill that dreadful Adobe software. For Windows, I would recommend Sumatra PDF, the ones used on Linux are just fine.

If you really want to parse PDF programmatically, you have to specify your platform, the languages you use, and so on.

I don't know what OCR you have, so further detail depends on that. Getting away from PDF would be the best option.

—SA


这篇关于将扫描的PDF转换为文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆