使用iText获取行位置 [英] Getting line locations with iText

查看:134
本文介绍了使用iText获取行位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何找到带有iText的文档中的行位置?

How can one find where are lines located in a document with iText?

假设我在PDF文档中有一个表,并且想要阅读其内容;我想找到细胞的确切位置。为了做到这一点,我想我可能会找到线的交叉点。

Suppose say I have a table in a PDF document, and want to read its contents; I would like to find where exactly the cells are located. In order to do that I thought I might find the intersections of lines.

推荐答案

我认为使用iText的唯一选择是手动解析PDF令牌。在此之前,我会有一份副本PDF规范方便

I think your only option using iText will be to parse the PDF tokens manually. Before doing that I would have a copy of the PDF spec handy.

(我是.Net家伙所以我使用iTextSharp但除了一些大写差异和属性声明之外他们是几乎100%相同。)

您可以使用 PRTokeniser 对象获取单个令牌你在 PdfReader 上调用 getPageContent(pageNum)来输入字节。

You can get the individual tokens using the PRTokeniser object which you feed bytes into from calling getPageContent(pageNum) on your PdfReader.

//Get bytes for page 1
byte[] pageBytes = reader.getPageContent(1);
//Get the tokens for page 1
PRTokeniser tokeniser = new PRTokeniser(pageBytes);

然后循环浏览 PRTokeniser

PRTokeniser.TokType tokenType;
string tokenValue;

while (tokeniser.nextToken()) {
    tokenType = tokeniser.tokenType;
    tokenValue = tokeniser.stringValue;
    //...check tokenValue, do something with it
}

至于 tokenValue ,您可能希望找到 re l 矩形和直线的值。如果您看到 re ,那么您想要查看 之前的 4值,如果您看到 l 然后 之前 2个值。这也意味着您需要将每个 tokenValue 存储在一个数组中,以便稍后再回顾。

As far a tokenValue, you'd want to probably look for re and l values for rectangle and line. If you see an re then you want to look at the previous 4 values and if you see an l then previous 2 values. This also means that you need to store each tokenValue in an array so you can look back later.

取决于你用来创建PDF的东西可能会得到一些有趣的结果。例如,我使用Microsoft Word创建了一个4单元格表并保存为PDF。由于某种原因,有两组10个矩形,有许多重复,但一般的想法仍然有效。

Depending on what you used to create the PDF with you might get some interesting results. For instance, I created a 4 cell table with Microsoft Word and saved as a PDF. For some reason there are two sets of 10 rectangles with many duplicates, but the general idea still works.

下面是针对iTextSharp 5.1.1.0的C#代码。您应该能够非常轻松地将其转换为Java和iText,我注意到一行具有.Net特定代码,需要从通用列表进行调整( List< string> )到Java等价物,可能是 ArrayList 。你还需要调整一些外壳,。Net使用 Object.Method()而Java使用 Object.method()。最后,.Net访问没有gets和sets的属性,所以 Object.Property 既是getter又是setter,而不是Java的 Object.getProperty Object.setProperty

Below is C# code targeting iTextSharp 5.1.1.0. You should be able to convert it to Java and iText very easily, I noted the one line that has .Net-specific code that needs to be adjusted from a Generic List (List<string>) to a Java equivalent, probably an ArrayList. You'll also need to adjust some casing, .Net uses Object.Method() whereas Java uses Object.method(). Lastly, .Net accesses properties without gets and sets, so Object.Property is both the getter and setter compared to Java's Object.getProperty and Object.setProperty.

希望这至少让你开始!

        //Source file to read from
        string sourceFile = "c:\\Hello.pdf";

        //Bind a reader to our PDF
        PdfReader reader = new PdfReader(sourceFile);

        //Create our buffer for previous token values. For Java users, List<string> is a generic list, probably most similar to an ArrayList
        List<string> buf = new List<string>();

        //Get the raw bytes for the page
        byte[]  pageBytes = reader.GetPageContent(1);
        //Get the raw tokens from the bytes
        PRTokeniser tokeniser = new PRTokeniser(pageBytes);

        //Create some variables to set later
        PRTokeniser.TokType tokenType;
        string tokenValue;

        //Loop through each token
        while (tokeniser.NextToken()) {
            //Get the types and value
            tokenType = tokeniser.TokenType;
            tokenValue = tokeniser.StringValue;
            //If the type is a numeric type
            if (tokenType == PRTokeniser.TokType.NUMBER) {
                //Store it in our buffer for later user
                buf.Add(tokenValue);
            //Otherwise we only care about raw commands which are categorized as "OTHER"
            } else if (tokenType == PRTokeniser.TokType.OTHER) {
                //Look for a rectangle token
                if (tokenValue == "re") {
                    //Sanity check, make sure we have enough items in the buffer
                    if (buf.Count < 4) throw new Exception("Not enough elements in buffer for a rectangle");
                    //Read and convert the values
                    float x = float.Parse(buf[buf.Count - 4]);
                    float y = float.Parse(buf[buf.Count - 3]);
                    float w = float.Parse(buf[buf.Count - 2]);
                    float h = float.Parse(buf[buf.Count - 1]);
                    //..do something with them here
                }
            }
        }

这篇关于使用iText获取行位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆