OneNote解析-如何获取文档中的文本Blob? [英] OneNote parsing - how to get to the Text Blobs in the document?

查看:171
本文介绍了OneNote解析-如何获取文档中的文本Blob?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为.one文件扩展名创建一个解析器,完成后,我将其添加到Apache Tika项目中.

这是我正在创建的APL 2.0许可的开源项目: https://github. com/nddipiazza/onenote-parser-java

我在这里使用了规范文档: https://github .com/dropbox/onenote-parser

我在解析文档方面已经走了很长一段路,但是遇到了障碍.

这是我用来解析的OneNote文件: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing

我无法在解析的结果中查看Section1TextArea1和Section1TextArea2.因此,我缺少某种关键数据解析元素或其他东西.

它肯定在OneNote文件本身中.我可以在十六进制查看器中看到它:

这是JSON解析输出: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267

a>

我觉得规范文档中缺少一些非常重要的信息,以解析此专有格式.

我缺少哪些主要元素,导致我没有获得实际的文本内容?

解决方案

我知道了.需要理解的是,OneNote中的属性值可以具有以下任一值:

  • 二进制内容
  • Ascii文本内容
  • UTF-16LE内容.

它们遍布各处.

我也继续解析整个根文件树.这将导致大量重复的文本,但我不在乎.

该项目已通过测试用例及其修复程序进行了更新: https://github.com/nddipiazza/onenote-parser-java/tree/master/src/main/java/org/apache/tika/onenote

更新:

刚创建了apache tika PR: https://github.com/apache/tika/拉/300

I am creating a parser for the .one file extension, which when finished I will add to the Apache Tika project.

Here is the APL 2.0 licensed Open Source project I'm creating: https://github.com/nddipiazza/onenote-parser-java

I used the specification document here: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-one/73d22548-a613-4350-8c23-07d15576be50

As a starting point, I ported over the code from this open source C++ project: https://github.com/dropbox/onenote-parser

I have gotten a long way in the parsing of the documents, but I've hit a road block.

Here is the OneNote file I'm using to parse: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing

I am unable to view the Section1TextArea1 and Section1TextArea2 in my parsed results. So I'm missing some sort of key data parsing element or something.

It is definitely in the OneNote file itself. I can see it in the Hex viewer:

Here is the JSON parse output: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267

I feel like the spec document is missing some very important information needed in order to parse this proprietary format.

What major element(s) am I missing resulting in me not getting the actual text content?

解决方案

I figured it out. It was a matter of understanding that property values in OneNote can have either:

  • Binary contents
  • Ascii text contents
  • UTF-16LE contents.

There is a variety of them sprinkled throughout.

Also I just went ahead and parse the entire root file tree. It will result in lots of duplicate text but i don't really care.

The project is updated with test cases and the fix here: https://github.com/nddipiazza/onenote-parser-java/tree/master/src/main/java/org/apache/tika/onenote

UPDATE:

Just created the apache tika PR: https://github.com/apache/tika/pull/300

这篇关于OneNote解析-如何获取文档中的文本Blob?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆