jsoup - 去除所有格式和链接标记，仅保留文本 [英] jsoup - strip all formatting and link tags, keep text only

查看：122 发布时间：2018/6/13 17:54:08 java html jsoup

本文介绍了jsoup - 去除所有格式和链接标记，仅保留文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有这样一个html片段：

 <跨度> foo bar< a> foobar< / a>巴兹 < / p为H.

我想从中提取的是：

  foo bar foobar baz

所以我的问题是：我怎么能剥离所有的包装标签从一个HTML和只获取文本在相同的顺序，因为它是在HTML中？
正如你可以在标题中看到的，我想用jsoup来解析。

$ b $ p

重音html示例（注意'á'字符）：

 < p>< strong> Tarthatatlanbiztonságiviszonyok< / strong>< / p> 
< p>< strong> Tarthatatlanbiztonságiviszonyok< / strong>< / p>

我想要的：

  Tarthatatlanbiztonságiviszonyok 
 Tarthatatlanbiztonságiviszonyok

这个html不是静态的，通常我只是希望通用html片段的每个文本都以解码后的人类可读形式，宽度换行符来分隔。

使用Jsoup：

  final String html =< p>< span> foo< / span> ;< em> bar< a> foobar< / a> baz< / em>< / p>; 
 Document doc = Jsoup.parse（html）; 
 
 System.out.println（doc.text（））;

输出：

  foo bar foobar baz

如果您只想要使用它而不是 doc.text（）：

  doc.select（ p）文本（）;

...或只有正文：

  doc.body（）文本（）;

换行符：

final String html = Tarthatatlanbiztonságiviszonyok + Tarthatatlanbiztonságiviszonyok」; Document doc = Jsoup.parse（html）; （元素元素：doc.select（p）） { System.out.println（element.text（））; //例如。你可以使用一个StringBuilder并在这里追加行...... }
输出：

Tarthatatlanbiztonságiviszonyok Tarthatatlanbiztonságiviszonyok

Let's say i have a html fragment like this:
 foo bar <a> foobar </a> baz 
What i want to extract from that is:
foo bar foobar baz
So my question is: how can i strip all the wrapping tags from a html and get only the text in the same order as it is in the html? As you can see in the title, i want to use jsoup for the parsing.

Example for accented html (note the 'á' character):
Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok
What i want:
Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok
This html is not static, generally i just want every text of a generic html fragment in decoded human readable form, width line breaks.
解决方案
With Jsoup:
final String html = " foo bar <a> foobar </a> baz "; Document doc = Jsoup.parse(html); System.out.println(doc.text());
Output:
foo bar foobar baz
If you want only the text of p-tag, use this instead of doc.text():
doc.select("p").text();
... or only body:
doc.body().text();

Linebreak:

final String html = "Tarthatatlan biztonsági viszonyok" + "Tarthatatlan biztonsági viszonyok"; Document doc = Jsoup.parse(html); for( Element element : doc.select("p") ) { System.out.println(element.text()); // eg. you can use a StringBuilder and append lines here ... }
Output:
Tarthatatlan biztonsági viszonyok Tarthatatlan biztonsági viszonyok

这篇关于jsoup - 去除所有格式和链接标记，仅保留文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

jsoup - 去除所有格式和链接标记，仅保留文本 [英] jsoup - strip all formatting and link tags, keep text only

问题描述

换行符：

Linebreak:

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

jsoup - 去除所有格式和链接标记，仅保留文本 [英] jsoup - strip all formatting and link tags, keep text only

问题描述

换行符：

Linebreak:

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭