如何使用java和PDFBox从PDF获取字符的Unicode [英] How to get Unicode of the characters from PDF using java and PDFBox

查看:766
本文介绍了如何使用java和PDFBox从PDF获取字符的Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Apache PDFBox和Java来解析PDF并从中获取所有信息。提取文本仅适用于英语。对于其他语言,我只获得一些特殊字符。例如,提取阿拉伯字符ش将给出字符串:?on printing。当我将计算机的区域和语言从英语更改为阿拉伯语时,它工作正常。所以我认为提取字符的Unicode将解决这个问题问题。请帮我从PDF中获取字符的Unicode或建议我解决这个问题的一些解决方案。

I am using Apache PDFBox and Java to parse the PDFs and get all the information from it. Extracting text is working fine for English only. For other languages I get only some special-characters. For example extracting the Arabic character ش will give the String :"? on printing. It is working fine when I change the "Region and language" of my computer from English to Arabic. So I think extracting the Unicode of the characters will solve this problem. Please help me to get the Unicode of the characters from PDF or suggest me some solutions to solve this problem.

推荐答案

< a href =http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.6.0/org/apache/pdfbox/util/PDFText2HTML.java =nofollow > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.6.0/org/apache/pdfbox/util/PDFText2HTML.java

私有字符串转义(字符串字符)将字符转换为unicode。

The private String escape(String chars) converts characters to unicode.

这篇关于如何使用java和PDFBox从PDF获取字符的Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆