Jsoup css选择器代码(包含xpath代码) [英] Jsoup css selector code (xpath code included)

查看:141
本文介绍了Jsoup css选择器代码(包含xpath代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用jsoup解析下面的HTML,但无法获得正确的语法。

 < div class =info>< strong>第1行:< / strong>一些文本1< br> 
< b>一些文字2< / b>< br>
< strong>第3行:< / strong>一些文本3< br>
< / div>

我需要捕获一些文本1,一些文本2和一些文本3在三个不同的变量。 / p>

我有第一行的xpath(对于第3行应该是类似的),但无法找到等效的css选择器。

  // div [@ class ='info'] / strong [1] / following :: text()
pre>

请帮助。



在一个单独的我有几百个html文件,需要解析和提取数据从他们存储在数据库中。是Jsoup的最佳选择吗?



我试图重新打开这个问题,因为我还没有找到解决方案。请帮助。

解决方案

它看起来像Jsoup不能处理从混合内容的元素中获取文本。以下是一个使用您制定的XPath的解决方案,它使用 XOM TagSoup

  import java.io.IOException; 

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
public static void main(final String [] args)throws SAXException,ValidityException,ParsingException,IOException {
final String html =< div class = \info \>< strong>第1行:< / strong>一些文字1< br>< b>部分文字2< / b>< br>< strong& ; / strong>一些文字3< br>< / div>;
final Parser parser = new Parser();
final Builder builder = new Builder(parser);
final文档文档= builder.build(html,null);
final nu.xom.Element root = document.getRootElement();
final nodes textElements = root.query(// xhtml:div [@ class ='info'] / xhtml:strong [1] / following :: text(),new XPathContext(xhtml,root .getNamespaceURI()));
for(int textNumber = 0; textNumber< textElements.size(); ++ textNumber){
System.out.println(textElements.get(textNumber).toXML());
}
}
}

>

 一些文本1 
一些文本2
第3行:
一些文本3

不知道你想做什么的更多细节,但我不知道这是否是你想要的。


I am trying to parse below HTML using jsoup but not able to get the right syntax for it.

<div class="info"><strong>Line 1:</strong> some text 1<br>
  <b>some text 2</b><br>
  <strong>Line 3:</strong> some text 3<br>
</div>

I need to capture some text 1, some text 2 and some text 3 in three different variables.

I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.

//div[@class='info']/strong[1]/following::text()

Please help.

On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?

I am trying to re-open this question as I still haven't found the solution. Please help.

解决方案

It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

This outputs:

 some text 1
some text 2
Line 3:
 some text 3

Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.

这篇关于Jsoup css选择器代码(包含xpath代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆