PDF文本提取问题 [英] PDF Text Extraction Problem

查看:63
本文介绍了PDF文本提取问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试从PDF中提取纯文本时,它给了我一些不清楚的数据,而不是确切的文本.对于该PDF,字体类似于TT222FO00嵌入式子集,并且编码是自定义的.

有人可以帮我吗?

提前谢谢.

[从评论中移出]

这就是我的操作方式:

When I''m trying to extract plain text from a PDF it is giving me some unclear data instead of exact text. For that PDF the fonts are something like TT222FO00 embedded subset and encoding is custom.

Can anybody help me with this?

Thanks in advance.

[moved up from comment]

This is how I''m doing it:

推荐答案

也许您想在这里尝试以下免费库之一:http://java-source.net/open-source/pdf-libraries [
Maybe you''d want to try one these free libraries here: http://java-source.net/open-source/pdf-libraries[^].

Hope you''ll find something appropriate there :).

Cheers!

—MRB


我可以推荐 PDF小丑 [ ^ ]

有据可查,工作正常.
I can recommend PDF Clown[^]

well documented, works fine.


Itext是大多数开发人员使用的第三方库.对于提取,请参见以下讨论: http://stackoverflow.com/questions/4026614/extract -text-from-pdf-files [ ^ ]
Itext is the 3rd party library that most developers used. And for extraction, please see this discussion: http://stackoverflow.com/questions/4026614/extract-text-from-pdf-files[^]


这篇关于PDF文本提取问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆