阅读PDF文件? [英] Reading PDF file?

查看:163
本文介绍了阅读PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这将是我第一次阅读PDF文件。

This will be my first time reading a PDF.

我在寻找了一圈,发现这样的选择这样做与C#和选择使用iTextSharp的。

I was searching around and found so options to do that with C# and choose to use iTextSharp.

到目前为止,我所做的只是基本的喜欢阅读的文件,并获取内容没有问题。

So far I've done just the basic like reading the file and getting the content without issues.

PdfReader reader = new PdfReader(iPDF.Text);
for (int x = 2; x <= reader.NumberOfPages; x++)
{
    iResult.Text = Encoding.UTF8.GetString(reader.GetPageContent(x));
    break;
}

正如你可以看到它是一个非常非常基本的code只是为了阅读PDF格式的第2页到一个文本文件,但是,我已经看到了很多code到文本文件中,我有点失去了如何解析只有我需要的数据。

As you can see it is a very very basic code just to read the 2nd page of the PDF into a text file but, I've notice a lot of code into the text file and I am a bit lost on how to parse only the data I need.

我在想,如果有一个模式或东西,这将帮助我得到的PDF的那一部分。看着似乎纯文本文件有事情定义开始/行结束,颜色等。

What I was wondering is, if there is a pattern or something that will help me get only that part of the PDF. Looking at the plain text file it seems there are things that defines begin/end of lines, colors, etc.

部分提取数据的:

1 0 0 1 0 612 cm 0 0 0 rg
0 0 0 RG
28.35 -28.35 735.3 -526.95 re
W
n
0 0 0.502 sc
28.35 -65.5 735.3 -12.75 re
f
28.35 -543.9 735.3 -11.4 re
f
q
92.25 -28.35 560.9 -18 re
W
n
1 1 1 sc
92.25 -28.35 560.9 -18 re
f
BT
1 0 0 1 95.25 -39.1 Tm
0 0 0 sc
/i 10.75 Tf
(Name - Live) T

注:以上只是部分地从第2页的初始数据指出了我previously意味着

NOTE: the above is just partially the initial data from the page 2 to point out what I previously meant.

是,在制表的事情的数据和我怎么能只提取?

Is that data in a tabulation sort of thing and how could I extract only that ?

推荐答案

尝试使用 PdfTextExtractor ,因为它会给你更多的人类可读的文本输出的PDF:

Try using a PdfTextExtractor as it will give you a little more human readable text out of the pdf:

for (int page = 2; page <= reader.NumberOfPages; page++)
{
    var strategy = new SimpleTextExtractionStrategy();
    string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
    iResult.Text = text;
}

这篇关于阅读PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆