使用jsoup将HTML解析为格式化的纯文本 [英] Parsing HTML into formatted plaintext using jsoup

查看:282
本文介绍了使用jsoup将HTML解析为格式化的纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个Maven项目,该项目使我能够解析网站中的html数据.我可以使用下面的代码来解析它:

I was working on a maven project that allows me to parse a html data from a website. I was able to parse it using this code below:

public void parseData(){
        String url = "http://stackoverflow.com/help/on-topic";
        try {
            Document doc = Jsoup.connect(url).get();
            Element essay = doc.select("div.col-section").first();
            String essayText = essay.text();
            jTextAreaAdem.setText(essayText);


        } catch (IOException ex) {
            Logger.getLogger(formAdem.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

到目前为止,我没有任何问题.我可以解析html数据. 我正在从jsoup中使用select方法,并使用"div.col-section"检索数据,这意味着我正在使用class为col-section的div元素进行查找. 我想在textarea中打印数据.即使网站上的实际数据超过一个段落,我得到的结果还是一个巨大的段落.那么,如何像在网站上一样解析数据呢?

So far I have no problems. I can parse the html data. I was using select method from jsoup and retrieving data using "div.col-section" which means I'm looking for div element with the class is col-section. I wanted to print the data in a textarea. The result that I have is a huge one paragraph even though the real data on the website is more than one paragraphs. So how to parse the data just like the one on the website?

推荐答案

未格式化的原因是格式化为HTML格式-使用<p><ol>标记等.在.text()上调用块元素会丢失该格式.

The reason that it is not formatted is that the formatting is in the HTML -- with <p> and <ol> tags etc. Calling .text() on a block element loses that formatting.

Jsoup有一个示例 HTML到纯文本转换器,您可以通过将div元素作为焦点来适应您的需求.

Jsoup has an example HTML to Plain Text convertor which you can adapt to your needs -- by providing the div element as the focus.

或者,您可以选择 "div.col-section > *" ,然后遍历每个元素,然后打印出来带有换行符的文本.

Alternatively, you could just select "div.col-section > *", and iterate through each Element, and print out that text with a newline.

这篇关于使用jsoup将HTML解析为格式化的纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆