使用iText获取行位置 [英] Getting line locations with iText
问题描述
如何找到带有iText的文档中的行位置?
How can one find where are lines located in a document with iText?
假设我在PDF文档中有一个表,并且想要阅读其内容;我想找到细胞的确切位置。为了做到这一点,我想我可能会找到线的交叉点。
Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.
推荐答案
我认为使用iText的唯一选择是手动解析PDF令牌。在此之前,我会有一份副本PDF规范方便。
I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.
(我是.Net家伙所以我使用iTextSharp但除了一些大写差异和属性声明之外他们是几乎100%相同。)
您可以使用 PRTokeniser
对象获取单个令牌你在 PdfReader
上调用 getPageContent(pageNum)
来输入字节。
You can get the individual tokens using the PRTokeniser
object which you feed bytes into from calling getPageContent(pageNum)
on your PdfReader
.
//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
然后循环浏览 PRTokeniser
:
PRTokeniser.TokType tokenType;
string tokenValue;
while (tokeniser.nextToken()) {
tokenType = tokeniser.tokenType;
tokenValue = tokeniser.stringValue;
//...check tokenValue, do something with it
}
至于 tokenValue
,您可能希望找到 re
和 l
矩形和直线的值。如果您看到 re
,那么您想要查看 之前的 4值,如果您看到 l
然后 之前 2个值。这也意味着您需要将每个 tokenValue
存储在一个数组中,以便稍后再回顾。
As far a tokenValue
, you'd want to probably look for re
and l
values for rectangle and line. If you see an re
then you want to look at the previous 4 values and if you see an l
then previous 2 values. This also means that you need to store each tokenValue
in an array so you can look back later.
取决于你用来创建PDF的东西可能会得到一些有趣的结果。例如,我使用Microsoft Word创建了一个4单元格表并保存为PDF。由于某种原因,有两组10个矩形,有许多重复,但一般的想法仍然有效。
Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.
下面是针对iTextSharp 5.1.1.0的C#代码。您应该能够非常轻松地将其转换为Java和iText,我注意到一行具有.Net特定代码,需要从通用列表进行调整( List< string>
)到Java等价物,可能是 ArrayList
。你还需要调整一些外壳,。Net使用 Object.Method()
而Java使用 Object.method()
。最后,.Net访问没有gets和sets的属性,所以 Object.Property
既是getter又是setter,而不是Java的 Object.getProperty
和 Object.setProperty
。
Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>
) to a Java equivalent, probably an ArrayList
. You'll also need to adjust some casing, .Net uses Object.Method()
whereas Java uses Object.method()
. Lastly, .Net accesses properties without gets and sets, so Object.Property
is both the getter and setter compared to Java's Object.getProperty
and Object.setProperty
.
希望这至少让你开始!
//Source file to read from
string sourceFile = "c:\\Hello.pdf";
//Bind a reader to our PDF
PdfReader reader = new PdfReader(sourceFile);
//Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
List<string> buf = new List<string>();
//Get the raw bytes for the page
byte[] pageBytes = reader.GetPageContent(1);
//Get the raw tokens from the bytes
PRTokeniser tokeniser = new PRTokeniser(pageBytes);
//Create some variables to set later
PRTokeniser.TokType tokenType;
string tokenValue;
//Loop through each token
while (tokeniser.NextToken()) {
//Get the types and value
tokenType = tokeniser.TokenType;
tokenValue = tokeniser.StringValue;
//If the type is a numeric type
if (tokenType == PRTokeniser.TokType.NUMBER) {
//Store it in our buffer for later user
buf.Add(tokenValue);
//Otherwise we only care about raw commands which are categorized as "OTHER"
} else if (tokenType == PRTokeniser.TokType.OTHER) {
//Look for a rectangle token
if (tokenValue == "re") {
//Sanity check, make sure we have enough items in the buffer
if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
//Read and convert the values
float x = float.Parse(buf[buf.Count - 4]);
float y = float.Parse(buf[buf.Count - 3]);
float w = float.Parse(buf[buf.Count - 2]);
float h = float.Parse(buf[buf.Count - 1]);
//..do something with them here
}
}
}
这篇关于使用iText获取行位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!