iTextSharp将包装的单元格内容提取到新行中-您如何确定给定的包装数据现在属于哪一列? [英] iTextSharp extracts wrapped cell contents into new lines - how do you identify to which column a given wrapped piece of data belongs now?

查看:122
本文介绍了iTextSharp将包装的单元格内容提取到新行中-您如何确定给定的包装数据现在属于哪一列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iTextSharp从pdf提取数据. 我偶然发现了以下场景所描述的以下问题:

I am using iTextSharp to extract data from pdfs. I stumbled across the following problem, depicted by the scenario below:

我创建了一个示例excel文件来说明.看起来是这样的:

I created a sample excel file to illustrate. Here is what it looks like:

我使用那里提供的许多免费在线转换器之一将其转换为pdf,生成的pdf看起来像(当我生成pdf时,我并未将样式应用于excel):

I convert it to a pdf, using one of the many free online converters available out there, which generates a pdf looking like (when I generated the pdf I did not apply the styling to the excel):

现在,使用iTextSharp从pdf中提取数据,将以下字符串作为提取的数据返回给我:

Now, using iTextSharp to extract the data from the pdf, returns me the following string as the data extracted:

如您所见,包装的单元格数据会生成新行,其中每个包装的数据都由单个空格分隔.

As you can see, wrapped cell data generate new lines, where each wrapped piece of data separated by a single white space.

问题:现在,如何识别给定的包装数据属于哪一列?如果仅iTextSharp保留与列一样多的空格...

The problem: how does one identify, now, to which column a given piece of wrapped data belongs to ? If only iTextSharp preserved as many white spaces as columns...

在我的示例中-如何确定 111 属于哪一列?

In my example - how can I identify to which column does 111 belong ?

更新1:

只要一个字段有多个单词(即包含空格),就会发生类似的问题.例如,考虑上面示例的第一行:

A similar problem occurs whenever a field has more than one word (i.e., contains white spaces). For example, considering the 1st line of the sample above:

说看起来像

---A---  ---B---  ---C---  ---D---
aaaaaaa    bb b     cccc      

iText会再次以以下方式为此内容生成提取:

iText again would generate the extraction for this one as:

aaaaaaa bb b cccc

这里的问题是,必须确定每一列的边界.

Same problem here, in having to determine the borders of each column.

更新2: 我正在使用的真实pdf文件的示例: pdf数据就是这样.

Update 2: A sample of the real pdf file I am working with: This is how the pdf data looks like.

推荐答案

除了Chris的通用答案之外,iText(Sharp)内容解析中的某些背景...

iText(Sharp)在namespace iTextSharp.text.pdf.parser/package com.itextpdf.text.pdf.parser中提供了用于提取内容的框架.这项繁琐的工作可以读取页面内容,跟踪当前图形状态,并将有关内容的信息转发给用户IExtRenderListenerIRenderListener/ExtRenderListenerRenderListener(即 >)提供.特别是,它不会将结构解释为该信息.

iText(Sharp) provides a framework for content extraction in the namespace iTextSharp.text.pdf.parser / package com.itextpdf.text.pdf.parser. This franework reads the page content, keeps track of the current graphics state, and forwards information on pieces of content to the IExtRenderListener or IRenderListener / ExtRenderListener or RenderListener the user (i.e. you) provides. In particular it does not interpret structure into this information.

此渲染侦听器可以是文本提取策略(ITextExtractionStrategy/TextExtractionStrategy),即特殊的渲染侦听器,其主要用于提取纯文本流,而没有格式或布局信息.对于这种特殊情况,iText(Sharp)还提供了两个示例实现,SimpleTextExtractionStrategyLocationTextExtractionStrategy.

This render listener may be a text extraction strategy (ITextExtractionStrategy / TextExtractionStrategy), i.e. a special render listener which is predominantly designed to extract a pure text stream without formatting or layout information. And for this special case iText(Sharp) additionally provides two sample implementations, the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy.

对于您的任务,您需要一个更复杂的渲染侦听器,或者要么

For your task you need a more sophisticated render listener which either

  • 导出带有坐标的文本(克里斯(在他的答案之一中提供了扩展的LocationTextExtractionStrategy,它可以另外提供文本块的位置和边界框),使您可以使用其他代码来分析表格结构;或
  • 对表格数据本身进行分析.
  • exports the text with coordinates (Chris in one of his answers has provided an extended LocationTextExtractionStrategy which can additionally provide positions and bounding boxes of text chunks) allowing you in additional code to analyse tabular structures; or
  • does the analysis of tabular data itself.

对于后一种变体,我没有一个示例,因为一般来说,识别和解析表本身就是一个完整的项目.您可能需要研究 Tabula 项目以获取启发;这个项目出奇地擅长于表提取任务.

I do not have an example for the latter variant because generically recognizing and parsing tables is a whole project in itself. You might want to look into the Tabula project for inspiration; this project is surprisingly good at the task of table extraction.

PS:如果您尝试从内容的纯字符串表示中提取结构化内容(但仍尝试反映原始布局)时感到宾至如归,则可以尝试使用

PS: If you feel more at home with trying to extract structured content from a pure string representation of the content which nonetheless tries to reflect the original layout, you might try something like what is proposed in this answer, a variant of the LocationTextExtractionStrategy working similar to the pdftotext -layout tool; only the changes to be applied to the LocationTextExtractionStrategy are shown there.

PPS:从非常特定的PDF表中提取数据可能要容易得多;例如,查看此答案,它表明在进行一些PDF分析之后,创建给定表的特定方式可能会引起人们的注意.一个简单的自定义呈现侦听器,以提取表数据.对于单个PDF而言,它的表跨越许多页面(例如在该答案的情况下),这可能很有意义,或者如果您使用同一软件创建的许多PDF完全相同,那么这也很有意义.

PPS: Extraction of data from very specific PDF tables may be much easier; for example have a look at this answer which demonstrates that after some PDF analysis the specific way a given table is created might give rise to a simple custom render listener for extracting the table data. This can make sense for a single PDF with a table spanning many many pages like in the case of that answer, or it can make sense if you have many PDFs identically created by the same software.

这就是为什么我在对您的问题的评论中要求提供代表性的示例文件的原因

This is why I asked for a representative sample file in a comment to your question

关于您的评论

在上面的pdf示例中,从头开始实现了ITextExtractionStrategy,并扩展了LocationExtractionStrategy,我看到每个RenderText都在以下块中调用:Fi,el,d,A,Fi,el,d. . 等等.可以更改吗?

Still with the pdf example above, both with an implementation from scratch of ITextExtractionStrategy and with extending LocationExtractionStrategy, I see that each RenderText is called at the following chunks: Fi, el, d, A, Fi, el, d... and so on. Can this be changed?

作为单独的RenderText调用获得的文本块不会因偶然或iText的某些随机决定而分开.它们是页面内容中分别绘制的字符串!

The chunks of text you get as separate RenderText calls are not separated by accident or some random decision of iText. They are the very strings drawn separately in the page content!

在示例"Fi"中,"el","d"和"A"使用不同的RenderText调用,因为内容流包含先绘制"Fi",然后绘制"el",然后绘制的操作"d",然后是"A".

In your sample "Fi", "el", "d", and "A" come in different RenderText calls because the content stream contains operations in which first "Fi" is drawn, then "el", then "d", then "A".

一开始听起来很奇怪.造成此类字词残缺的一个常见原因是PDF不会使用字体的字距调整信息.为了应用字距调整,因此,PDF生成软件必须在字符之间插入微小的向前或向后跳转,而字符之间的距离应该比没有字距调整更远或更近.因此,字距调整对之间的单词经常会被撕裂.

This may sound weird at first. A common cause for such torn up words is that PDF does not use the kerning information from fonts; to apply kerning, therefore, the PDF generating software has to insert tiny forward or backward jumps between characters which should be farther from or nearer to each other than without kerning. Thus, words often are torn apart between kerning pairs.

因此无法更改,您将得到这些片段,文本提取策略的任务是将它们放在一起.

So this cannot be changed, you will get those pieces, and it is the job of the text extraction strategy to put them together.

顺便说一句,PDF较差,一些PDF生成器分别定位每个字形,最重要的是这类生成器主要构建GUI,但可以自动将GUI画布导出为PDF.

By the way, there are worse PDFs, some PDF generators position each and every glyph separately, foremost such generators which predominantly build GUIs but can as a feature automatically export GUI canvasses as PDFs.

我希望进入添加自己的实现"领域时,我将能够控制如何确定什么是文本大块".

I would expect that in entering the realm of "adding my own implementation" I would have control over how to determine what is a "chunk" of text.

您可以...好,您必须决定哪些传入的片段属于一起,哪些不属于.例如.具有相同y坐标的字形形成一条直线吗?或者它们是在恰好彼此相邻的不同列中形成单独的线.

You can... well, you have to decide which of the incoming pieces belong together and which don't. E.g. do glyphs with the same y coordinate form a single line? Or do they form separate lines in different columns which just happen to be located next to each other.

是的,您可以决定将哪个字形解释为单个单词还是单个表格单元格的内容,但是您的输入由实际PDF内容流中使用的字形组组成.

So yes, you decide which glyphs you interpret as a single word or as content of a single table cell, but your input consists of the groups of glyphs used in the actual PDF content stream.

不仅如此,在界面的任何一种方法中,我都无法发现"它处理非文本数据/图像的方式/位置-这样我就可以解决间距问题(未调用RenderImage)

Not only that, in none of the interface's methods I can "spot" how/where it deals with non-text data/images - so I could intercede with the spacing issue (RenderImage is not called)

RenderImage将被用于嵌入的 bitmap 图像,JPEG等.如果您想了解有关矢量图形的信息,则您的策略还必须实现提供方法ModifyPathRenderPathClipPath.

RenderImage will be called for embedded bitmap images, JPEGs etc. If you want to be informed about vector graphics, your strategy will also have to implement IExtRenderListener which provides methods ModifyPath, RenderPath and ClipPath.

这篇关于iTextSharp将包装的单元格内容提取到新行中-您如何确定给定的包装数据现在属于哪一列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆