是否可以使用Apache Tika提取表信息? [英] Is it possible to extract table infomation using Apache Tika?

查看：384 发布时间：2020/9/4 23:05:14 java apache-tika

本文介绍了是否可以使用Apache Tika提取表信息?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找pdf和MS Office文档格式的解析器，以从文件中提取表格信息.看到Apache Tika时，我正在考虑编写单独的实现.我可以从任何这些文件格式中提取全文.但是我的要求是提取表格数据，我希望其中的2列为键值格式.我检查了网络上大多数可用的东西以寻求解决方案，但找不到任何东西. 有任何指针吗?

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

推荐答案

好吧，我继续使用apache poi分别实现了MS格式.我回到Tika取得PDF. Tika对文档所做的工作是将其输出为基于SAX的XHTML事件" 1

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

因此，基本上，我们可以编写一个自定义SAX实现来解析文件.

So basically we can write a custom SAX implementation to parse the file.

结构文本输出将采用以下形式(避免使用元细节)

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

在我们的SAX实现中，我们可以将第一部分视为键(对于我的问题，我已经知道键了，并且我正在寻找值，因此它是一个子字符串).

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

使用逻辑覆盖公共无效字符(char [] ch，int开头，int长度)

Override public void characters(char[] ch, int start, int length) with the logic

对于我的情况，请注意内容的结构是固定的，而且我知道即将到来的密钥，因此用这种方法很容易.这不是通用解决方案

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

这篇关于是否可以使用Apache Tika提取表信息?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

是否可以使用Apache Tika提取表信息? [英] Is it possible to extract table infomation using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

是否可以使用Apache Tika提取表信息? [英] Is it possible to extract table infomation using Apache Tika?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭