通过Java提取两个环节之间的文本在HTML [英] Extract text between two links in HTML through Java

查看:128
本文介绍了通过Java提取两个环节之间的文本在HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试从使用Java的ePub文件中的文本数据。 ePub文件的文本在于被格式化像这样一个HTML文件中 -

I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

打开此文件我已经知道我需要提取并能找到下一个章节的ID太章的ID前面。正因为如此我想到了一个合乎逻辑的做法是试图解析它在SAX解析器,并提取每个段落中的文本,直到我到达下一个章节的链接。但是,这被证明是一项艰巨的任务。

Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.

当然,一切都是动态的,所以没有设置链接转到等HTML是半严格的格式,所以我没想到解析到这么多的问题。谁能推荐一个很好的方法来提取所需要的文字?

Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?

该解决方案需要的 JAVA ONLY ,可以使用任何其他语言。我期待在Android设备来实现这个

The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device

推荐答案

那么,你知道章节IDS,为什么不使用String.indexOf?

Well, you know ids of the chapters, why not use String.indexOf ?

start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");

whatYoureLookingFor = text.substring(start, end-start)

保持简单。

这篇关于通过Java提取两个环节之间的文本在HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆