Jsoup css选择器代码(包含xpath代码) [英] Jsoup css selector code (xpath code included)
问题描述
我试图使用jsoup解析下面的HTML,但无法获得正确的语法。
< div class =info>< strong>第1行:< / strong>一些文本1< br>
< b>一些文字2< / b>< br>
< strong>第3行:< / strong>一些文本3< br>
< / div>
我需要捕获一些文本1,一些文本2和一些文本3在三个不同的变量。 / p>
我有第一行的xpath(对于第3行应该是类似的),但无法找到等效的css选择器。
// div [@ class ='info'] / strong [1] / following :: text()
pre>
请帮助。
在一个单独的我有几百个html文件,需要解析和提取数据从他们存储在数据库中。是Jsoup的最佳选择吗?
我试图重新打开这个问题,因为我还没有找到解决方案。请帮助。
解决方案它看起来像Jsoup不能处理从混合内容的元素中获取文本。以下是一个使用您制定的XPath的解决方案,它使用 XOM 和 TagSoup :
import java.io.IOException;
import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;
import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;
public class HtmlTest {
public static void main(final String [] args)throws SAXException,ValidityException,ParsingException,IOException {
final String html =< div class = \info \>< strong>第1行:< / strong>一些文字1< br>< b>部分文字2< / b>< br>< strong& ; / strong>一些文字3< br>< / div>;
final Parser parser = new Parser();
final Builder builder = new Builder(parser);
final文档文档= builder.build(html,null);
final nu.xom.Element root = document.getRootElement();
final nodes textElements = root.query(// xhtml:div [@ class ='info'] / xhtml:strong [1] / following :: text(),new XPathContext(xhtml,root .getNamespaceURI()));
for(int textNumber = 0; textNumber< textElements.size(); ++ textNumber){
System.out.println(textElements.get(textNumber).toXML());
}
}
}
>
一些文本1
一些文本2
第3行:
一些文本3
不知道你想做什么的更多细节,但我不知道这是否是你想要的。
I am trying to parse below HTML using jsoup but not able to get the right syntax for it.
<div class="info"><strong>Line 1:</strong> some text 1<br> <b>some text 2</b><br> <strong>Line 3:</strong> some text 3<br> </div>
I need to capture some text 1, some text 2 and some text 3 in three different variables.
I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector.
//div[@class='info']/strong[1]/following::text()
Please help.
On a separate I have few hundred html files and need to parse and extract data from them to store in a database. Is Jsoup best choice for this?
I am trying to re-open this question as I still haven't found the solution. Please help.
解决方案It really looks like Jsoup can't handle getting text out of an element with mixed content. Here is a solution that uses the XPath you formulated that uses XOM and TagSoup:
import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Nodes; import nu.xom.ParsingException; import nu.xom.ValidityException; import nu.xom.XPathContext; import org.ccil.cowan.tagsoup.Parser; import org.xml.sax.SAXException; public class HtmlTest { public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException { final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>"; final Parser parser = new Parser(); final Builder builder = new Builder(parser); final Document document = builder.build(html, null); final nu.xom.Element root = document.getRootElement(); final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI())); for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) { System.out.println(textElements.get(textNumber).toXML()); } } }
This outputs:
some text 1 some text 2 Line 3: some text 3
Without knowing more specifics of what you're trying to do though, I'm not sure if this is exactly what you want.
这篇关于Jsoup css选择器代码(包含xpath代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!