使用jsoup将html转换为纯文本时如何保留换行符? [英] How do I preserve line breaks when using jsoup to convert html to plain text?

查看：59 发布时间：2021/11/25 15:35:06 java jsoup

本文介绍了使用jsoup将html转换为纯文本时如何保留换行符?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下代码:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN ">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href="http://google.com">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

我得到了结果:

hello world yo googlez

但我想打破界限:

hello world
yo googlez

我看过 jsoup 的 TextNode#getWholeText() 但我可以不知道如何使用它.

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

如果我解析的标记中有 <br>，如何在结果输出中获得换行符?

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

推荐答案

真正保留换行符的解决方案应该是这样的:

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\n");
    document.select("p").prepend("\n\n");
    String s = document.html().replaceAll("\\n", "
");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

满足以下要求:

如果原始 html 包含换行符( )，它会被保留
如果原始 html 包含 br 或 p 标签，它们将被转换为换行符 ( ).

if the original html contains newline( ), it gets preserved
if the original html contains br or p tags, they gets translated to newline( ).

这篇关于使用jsoup将html转换为纯文本时如何保留换行符?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用jsoup将html转换为纯文本时如何保留换行符? [英] How do I preserve line breaks when using jsoup to convert html to plain text?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用jsoup将html转换为纯文本时如何保留换行符? [英] How do I preserve line breaks when using jsoup to convert html to plain text?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭