JSoup - 通过标签解析HTML标签 [英] JSoup - Parse HTML tag by tag

查看:609
本文介绍了JSoup - 通过标签解析HTML标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上在Java中开发了一个文本解析器,并被要求通过解析HTML来增强它。
解析器的目的是将解析的文件分成3个其他文件,一个包含文件中包含的所有单词,一个包含所有语句,另一个包含所有问题。
$ b $



我创建一个扩展名为* .txt的临时文件,并将其传入我的文本解析器,但如果我传递一个链接HTML文件的链接,它是这样形成的:

 <!DOCTYPE html> 
< head>
...这里的一些HTML ...
< / head>
< body>
< ul class =some_menu>
< li class =some_menu_item> n1< / li>
< li class =some_menu_item> n2< / li>
< li class =some_menu_item> n2< / li>
< / ul>
< div>
这是一个问题?
这是一个句子。
...其他一些文字...
< / div>
< / body>
< / html>

问题文件将填入: n1 n2 n3这是一个问题



所以,我只是想知道,有没有一种方法可以通过标签解析JSoup标签,所以我可以每次添加一个换行符块关闭?



如果你需要一些新的信息,不要麻烦问!



编辑:我应该有3个输出文件,在这个例子中:


  1. 单词

      n1 
    n2
    n3


    a
    question
    sentence
    ... some other words ...


  2. 所有句子之一

     这是一句话


  3. 是一个问题


TimmyM

解决方案

要获得html正文中的所有文本,您可以使用:

  Document doc = Jsoup.connect(url).get(); 
Elements body = doc.select(body);
String allText = body [0] .text();

然后,您可以拆分文本以将每个单词分开。
要获取div标签中的文本,您可以使用:

 元素div = doc.select(div ); 
String divText = div [0] .text();

然后,您可以拆分divText来获取每个句子。注意,select查询的返回类型实际上是Element的列表,即元素。这是因为可以有多个匹配你的元素 select 查询。在这种情况下,由于每种情况只有一个元素,我们可以通过访问返回数组的索引0来访问它。



编辑:为了迭代所有元素请检查此答案。基本上

 元素元素= doc.body()。select(*); (元素元素){
System.out.println(element.text());



$ / code>

尽管可能有没有文字的元素,所以您可以检查一下。


I'm actually developping a text parser in Java and I was asked to enhance it by parsing HTML with it. The parser's purpose is to divide the file parsed into 3 other files, one with all the words contained in the file, one with all sentences and the other with all questions.

The *.txt part works perfectly, but I got a problem when parsing HTML.

I create a temporary file with *.txt extension and pass it in my text parser, but if I pass an URL with HTML file linked which is formed like this:

<!DOCTYPE html>
    <head>
        ... some HTML here ...
    </head>
    <body>
        <ul class="some_menu">
            <li class="some_menu_item">n1</li>
            <li class="some_menu_item">n2</li>
            <li class="some_menu_item">n2</li>
        </ul>
        <div>
            This is a question ?
            This is a sentence .
            ... some other text ...
        </div>
    </body>
</html>

the question file will be filled with: n1 n2 n3 This is a question

So, I just was wondering, is there a way to parse with JSoup tags by tags so I can add a line feed each time a block is closed?

If you need some new informations, don't bother to ask!

Edit: I should have 3 output files, which are, for this example:

  1. One with all the words

    n1
    n2
    n3
    This
    is
    a
    question
    sentence
    ... some other words ...
    

  2. One with all the sentences

    This is a sentence
    

  3. One with all the questions

    This is a question
    

TimmyM

解决方案

To get all the text in an html body, you can use:

Document doc = Jsoup.connect(url).get();
Elements body = doc.select("body");
String allText = body[0].text();

You can then split the text to get each word separate. To get the text in the div tag, you can use:

Elements div = doc.select("div");
String divText = div[0].text();

You can then split the divText to get each sentence.

Notice that the return type of the select query is actually a list of Element i.e., Elements. That's because there can be more than one elements matching you select query. In this case, since there is only one element for each case we access it by accessing the index 0 of the returned array.

Edit: In order to iterate through all elements check this answer. Basically

Elements elements = doc.body().select("*");

for (Element element : elements) {
    System.out.println(element.text());
}

Though there might be elements with no texts so you can put a check on that.

这篇关于JSoup - 通过标签解析HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆