使用Jsoup.parse时如何保持换行符? [英] How to keep line breaks when using Jsoup.parse?

查看:1053
本文介绍了使用Jsoup.parse时如何保持换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这不重复。这是类似的问题,但这些答案中没有一个能够处理真正的html文件。一个人可以保存任何HTML,即使是这个,并尝试运行任何解决方案的答案......没有一个完全解决问题

This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely

问题是

我有一个已保存的 .htm 我桌面上的文件。我需要从中获取纯文本。但是我确实需要保留换行符,以便文本不在一行或几行上。

I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.

我尝试了以下所有方法here

I tried the following and all methods from here

        FileInputStream in = new FileInputStream("C:\\...myfile.htm");
        String htmlText = IOUtils.toString(in);
        for (String line : htmlText.split("\n")) {
            String stripped = Jsoup.parse(line).text();
            System.out.println(stripped);
        }

这只会保留html文件行。但是,文字仍然搞砸了,因为< / br> < p> 这样的东西了除去。我如何解析,以便文本保留所有自然换行符。

This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.

推荐答案

这是我注意到jsoup之间的区别并说Selenium,其中Selenium保持换行符,而jsoup在提取文本时不会。话虽如此,我认为最好的路线是在您尝试提取文本的节点上获取innerHtml,然后在innerHtml上执行replaceAll以替换< / br> < p> 包含换行符。

This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.

作为一个更完整的解决方案,而不是阅读文本文件逐行,是否可以更原生地遍历html文本?您最好的选择是使用类似递归函数的方式遍历树,当您点击TextNode时,将该文本添加到示例中的剥离变量中。然后,当您点击< p> < / br> 元素时,您可以根据需要添加换行符be。

As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.

类似于:

Document doc = Jsoup.parse(htmlText);

然后在每个子节点的递归函数中传递它:

Then pass that in a recursive function for each child node:

String getText(Element parentElement) {
     String working = "";
     for (Node child : parentElement.childNodes()) {
          if (child instanceof TextNode) {
              working += child.text();
          }
          if (child instanceof Element) {
              Element childElement = (Element)child;
              // do more of these for p or other tags you want a new line for
              if (childElement.tag().getName().equalsIgnoreCase("br")) {
                   working += "\n";
              }                  
              working += getText(childElement);
          }
     }

     return working;
 }

然后你可以调用函数去除文本。

Then you can just call the function to strip the text.

 strippedText = getText(doc);

这不是最简单的解决方案,但如果您想提取所有文本,我认为这应该有用来自HTML。我没有运行此代码,只是现在写了所以如果我错过了什么,我道歉。但它应该给你一般的想法。

Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.

这篇关于使用Jsoup.parse时如何保持换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆