删除HTML实体,同时用JSoup保留换行符 [英] Removing HTML entities while preserving line breaks with JSoup

查看:124
本文介绍了删除HTML实体,同时用JSoup保留换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 JSoup 来分析歌词,到目前为止它一直很棒,但遇到了问题。

我可以使用 Node.html()返回所需节点的完整HTML,它保留换行符因此:

  Gl& oacute; andi augu,silfurn& aacute; tt 
< br />> Bl& amp ; oacute;&安培; ETH; alv& ouml; ru,starir& aacute;
< br />& Oacute;& eth; ur hundur er& iacute; v& iacute; gam& oacute;& eth,& iacute; maga ... m& eacute; r
< br />
< br /> Kolni& eth; ur gref,kvik sem dreg h& eacute; r
> Kolni& eth; ur svart,hvergi bjart n& eacute;

但是,如您所见,存在不幸的副作用,即保留HTML实体和标记。 / p>

但是,如果我使用 Node.text(),我可以获得更好的结果,不含标签和实体:

 Glóandiaugu,silfurnáttBlóðalvöru,stariráÓðurhundur erívígamóð,ímaga ...mérKolniðurgref, kvik sem dreghérKolniðursvart,

另一个不幸的副作用是删除换行符和压缩

在调用<$ c $之前,只需从节点中替换< br />> c> Node.text()产生相同的结果,并且似乎该方法将文本压缩到方法本身的一行中,忽略换行符。



是否有两全其美,并且标签和实体可以正确替换以保留换行符,或者是否有其他方法或方法解码实体和删除标签而不必手动替换它们?

解决方案

(免责声明)我还没有使用过这个API .. 。
但快速查看文档表明您可以访问每个后代节点并转储其文本内容。遇到特殊标签(如< br> )时,可以插入分隔符。



TextNode.getWholeText()调用也很有用。


I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.

I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such:

Gl&oacute;andi augu, silfurn&aacute;tt
<br />Bl&oacute;&eth; alv&ouml;ru, starir &aacute;
<br />&Oacute;&eth;ur hundur er &iacute; v&iacute;gam&oacute;&eth;, &iacute; maga... m&eacute;r
<br />
<br />Kolni&eth;ur gref, kvik sem dreg h&eacute;r
<br />Kolni&eth;ur svart, hvergi bjart n&eacute;

But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.

However, if I use Node.text(), I can get a better looking result, free of tags and entities:

Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,

Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.

Simply replacing <br /> from the node before calling Node.text() yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.

Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

解决方案

(disclaimer) I haven't used this API ... but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered.

The TextNode.getWholeText() call also looks useful.

这篇关于删除HTML实体,同时用JSoup保留换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆