如何使用ghostscript -sDEVICE = txtwrite在每个单词后添加分隔符 [英] how to add a separator after each word with ghostscript -sDEVICE=txtwrite

查看:157
本文介绍了如何使用ghostscript -sDEVICE = txtwrite在每个单词后添加分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已使用ghostscript从具有表格的PDF中成功提取文本.

I have used ghostscript to successfully extract text from PDFs that have tables.

这个简单的命令效果很好:

This simple command works very well:

gswin64c -sDEVICE=txtwrite -o test.txt "c:\reports\sample.pdf"

但是,有些单词尤其是从表中连接在一起,例如:

However some words get joined together especially from tables, for example:

  234801111111109-12-2014 16:17:04764030208117034 2883253100.00  Payment
  234801111111109-12-2014 16:18:461088956908117033 2883253400.00 Payment
  234801111111109-12-2014 16:19:48769948208117040 2883253750.00  Payment

实际上应该是:

  2348011111111 09-12-2014 16:17:04 764030208117034 2883253 100.00  Payment
  2348011111111 09-12-2014 16:18:46 1088956908117033 2883253 400.00 Payment
  2348011111111 09-12-2014 16:19:48 769948208117040 2883253 750.00  Payment

请提供一种在每个单词的末尾添加分隔符的方法.

Please is there a way to add a separator character at the end of each word.

那将完美解决这个问题.

That would solve this perfectly.

推荐答案

对不起,这个想法根本行不通.

No sorry, this idea simply won't work.

PDF文件中没有单词"之类的东西,只有一系列字符代码和位置. txtwrite代码会花一些时间来尝试通过查看每个文本的位置以及所使用的字体的度量来重建单词,但是原文中没有单词.

There is no such thing as a 'word' in a PDF file, there is simply a sequence of character codes and positions. The txtwrite code goes to some lengths to try and reconstruct words by looking at the position of each piece of text, and the metrics of the fonts used, but there are no words in the original.

我并不认为这是完美的,如果您希望我查看它,则需要提供原始文件.最好的解决方案是打开一个错误报告并将其附加到文件中.

I don't claim this is perfect, if you'd like me to look at it you will need to supply the original file. Best solution is to open a bug report and attach the file to it.

对于另一个项目(RTF输出),这仍然是我要关注的领域,因此现在是报告此问题的好时机.我不能保证能够解决它,但很可能只是因为重新构建页面布局"代码对文本的位置太过简单了.

This is still an area I'm looking at, for a different project (RTF output) so now is a good time to report it. I cannot guarantee being able to resolve it, but it may well simply be that the 'rebuild the page layout' code is being too simple-minded about the location of the text.

但是,您可以获得较低级别的输出,类似XML的输出将分别为您提供每个文本片段及其在页面上的位置.您可以自己使用这些信息来重建内容.

You can, however, get a lower level output, the XML-like output will give you each fragment of text individually, and its position on the page. You could use that information yourself to rebuild the content.

默认选项尝试通过使用空格字符来尽可能地重现原始布局来构建页面的简单表示形式,但我并不认为没有错误:-)

The default option tries to build a simple representation of the page by using space characters to reproduce the layout of the original, as far as possible, but I have no illusions that there aren't bugs :-)

这篇关于如何使用ghostscript -sDEVICE = txtwrite在每个单词后添加分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆