PDFClown中的空白转换 [英] Empty whitespace conversion in PDFClown

查看:112
本文介绍了PDFClown中的空白转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在PDFClown中使用TextExtractor类时出现问题,出现空白也被称为自由换行符".这些字符被随机嵌入,但在Acrobat Reader中被忽略.因此,如果我在TextExtractor.ToString(...)中将'\n'指定为换行符,则这些字符所在的行将在Acrobat中显示为单行,但是在提取文本时会分成多行.

I'm having an issue when using the TextExtractor class in PDFClown, with occurrences of empty whitespace also known as a "discretionary newline". These characters are embedded randomly but ignored in Acrobat Reader. So, lines where these characters exist will show as a single line in Acrobat, but are broken into many lines when the text is extracted, if I specify '\n' as the newline character in TextExtractor.ToString(...).

看来,PDF小丑只是采用了任何空白字符并将其转换为单个空格,即' '.有没有一种方法可以绕过这种转换,从而提取原始字符呢?

It appears that PDF clown simply takes any whitespace character and converts it into a single space, or ' '. Is there a way to bypass this conversion, so that the original character is extracted instead?

推荐答案

经过更多研究后,PDFClown库似乎有很多错误.有几个问题:

After more research, it appears that the PDFClown library is very buggy. There are several issues:

  • 将大多数形式的空格字符转换为单个普通空格 特点.
  • 插入空格而不是换行符.
  • 如果尝试使用提供的替代为空格或换行符插入自己的字符,则提取数组中的字符到每个单个字符的框的内部映射都会被破坏.
  • 无法正确解码所有嵌入的字体.
  • 由于无法正确解码嵌入的字体,因此它将在提取的文本中默默地忽略字符.
  • 无法可靠地处理连字或连字的分解.通常从提取的文本中完全无声地丢弃.
  • Converts most forms of space character to a single normal space character.
  • Inserts spaces instead of newlines.
  • If you attempt to use the provided overrides to insert your own character for spaces or newlines, the internal mappings of characters in the extracted array to boxes for each individual character gets destroyed.
  • Cannot properly decode all embedded fonts.
  • Since it cannot properly decode embedded fonts, it will silently omit characters from extracted text.
  • Cannot reliably handle ligatures or decomposition of ligatures. Often silently dropped altogether from extracted text.

要直接解决我遇到的问题,您可以通过检查其边界矩形以查看它们是否与其他非空白字符重叠,来检测并删除这些假"空白字符,但是鉴于该库的所有其他问题,建议使用PDFBox代替.

To come directly to the issue I had, you can detect and remove these "false" whitespace characters by checking their bounding rectangle to see if they overlap other non-whitespace characters, but given all the other issues with the library, my advice to use use PDFBox instead.

如果您使用的是.NET,并且想使用 PDFBox ,则可以使用 Tika On Dot Net (通过IKVM移至.NET的Apache Tika 项目.

If you're using .NET and you'd like to use PDFBox, you can use Tika On Dot Net which is the Apache Tika project brought over to .NET via IKVM.

Apache Tika是其他库的集合,包括PDFBox. Tika On Dot Net当前具有PDFBox 1.8.10,还具有一个Nuget软件包,可以轻松地将其添加到您的项目中.

Apache Tika is a collection of other libraries, include PDFBox. Tika On Dot Net currently has PDFBox 1.8.10 and also has a Nuget package to make adding to your project easy.

我有一个项目要比截止日期提前1.5周,因为所有这些问题都是在中途发现的,因此需要完全重写.只是抬起头来.

I had a project go 1.5 weeks over deadline because all of these issues were discovered half way through, which required a full rewrite. Just a heads up.

这篇关于PDFClown中的空白转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆