使用jsoup从两个标签中提取未识别的html内容？正则表达式？ [英] extract unidentified html content from between two tags, using jsoup? regex?

查看：95 发布时间：2018/6/22 20:02:41 java html parsing jsoup wikipedia

本文介绍了使用jsoup从两个标签中提取未识别的html内容？正则表达式？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从那两个 h2 标记之间获得所有这些链接的名称

 < code>< h2>< span class =mw-headlineid =People>人< span class = mw-editsection-bracket> [< / span>< a href =/ w / index.php？title = Bush& amp; action = edit& amp; section = 1title =人>编辑< / a>< span class =mw-editsection-bracket>]< / span>< / span>< / h2> 
< ul> 
< li>< a href =/ wiki / George_H._W._Bushtitle =George H. W. Bush> George H. W. Bush< / a> （生于1924年），美国第41任总统< / li> 
< li>< a href =/ wiki / George_W._Bushtitle =乔治W.布什>乔治W.布什< / a> （生于1946年），美利坚合众国第43任总裁< / li> 
< li>< a href =/ wiki / Jeb_Bushtitle =Jeb Bush> Jeb Bush< / a> （生于1953年），前佛罗里达州州长，也是布什家族的一员< / li> 
< li>< a href =/ wiki / Bush_familytitle =布什家庭>布什家庭< / a>包括两位总统的政治家族< / li> 
< li>< a href =/ wiki / Bush_（surname）title =布什（姓氏）>布什（姓氏）< / a>与名称）< / li> 
< / ul> 
 < span class =mw-headlineid =Places.2C_United_States> Places，United States< / span>< span class =mw-editsection>< span class =mw-editsection-bracket> [< / span>< a href =/ w / index.php？title = Bush& amp; action = edit& amp; section = 2title =Edit部分：Places，United States>编辑< / a>< span class =mw-editsection-bracket>]< / span>< / span>< / h2>

既不是这个

 元素h2next = docx.select（span.mw-headline＃People）; 
 do 
 {
 ul = h2next.select（ul）。first（）; 
 System.out.println（ul.text（））; 
} 
 while（h2next！= null&& ul == null）;

也不是

  // String content = docx.getElementById（People）。outerHtml（）;

有效。

看起来像这家伙，有权利的想法，但我无法适应我的情况。

也许我应该只使用正则表达式？

看起来维基百科html是一种非结构化，很难合作。

从维基百科消歧页面我想抓住布什（或者我正在考虑的任何不明确的名字）可以作为一个人的不同感官。

我尝试过使用jsoup抓取这些数据的各种方法，但我一直无法弄清楚。

我试过这个：

  Document docx = Jsoup.connect https://en.wikipedia.org/wiki/Bush）获得（）; 
元素contentDiv = docx.select（span＃mw-headlinePeople）。first（）; 
 String printMe = contentDiv.toString（）; //结果

由于我注意到我想要的数据位于名为

 < h2>< span class =mw-headlineid =People>

但是输出什么都没有。

我根据以前的问题尝试了一些变化，例如：

。选择（跨度＃MW-headlinePeople）;
但仍然没有任何结果。

如何获取该信息？

理想情况下，我想要的是这样的： / p>

George HW Bush George W. Bush Jeb Bush
尽管我知道我可能最初也得到 Bush family 和 Bush（姓氏），因为他们是该部分的一部分，但我想我可以稍后将其删除。

另外，使用它会更快：

Document docx = Jsoup.connect（https://en.wikipedia.org/wiki/Bush）.get（）;
或此：

网址site_two =新网址（https://en.wikipedia.org/wiki/Bush）; URLConnection ycb = site_two.openConnection（）; BufferedReader inb = new BufferedReader（ new InputStreamReader（ ycb.getInputStream（）））; StringBuilder sb = new StringBuilder（）; $（b）b while（（inputLine = ）; sb.append（inputLine）; sb.append（System.lineSeparator（））; inputLine = inb.readLine（）; }
我试过使用这个网站，但事实证明它并不是很有用。有人应该像所有这些正则表达式站点一样构建一个jsoup站点。
解决方案
一种可能的方式是选择所有标题（ span.mw-headlines ）和所有链接（我发现的最佳选择是 li> a ）。

如果您通过将它们与，结合使用来选择两个选择器，它们将按照它们在页面上的显示顺序排列。因此，您可以跟踪您是否处于人物部分，而不是像以下这样循环播放结果：

元素elements = docx.select（span.mw-headline，li> a）; boolean inPeopleSection = false; （元素elem：elements）{ if（elem.className（）。equals（mw-headline））{ //标题 inPeopleSection = elem .ID（）等于（人物）。 } else { //这是一个链接 if（inPeopleSection）{ System.out.println（elem.text（））; } } }
输出：
乔治HW布什乔治W.布什杰布布什布什家庭布什（姓氏）
关于表现，我不认为它有什么区别，只要去更简单的版本（尽管我的Jsoup体验非常有限，所以不要拿我的话来说）。

I want to get the names of all those links from between the two h2 tags there
<h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <ul> <li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United States of America</li> <li><a href="/wiki/George_W._Bush" title="George W. Bush">George W. Bush</a> (born 1946), the 43rd president of the United States of America</li> <li><a href="/wiki/Jeb_Bush" title="Jeb Bush">Jeb Bush</a> (born 1953), the former governor of Florida and also a member of the Bush family</li> <li><a href="/wiki/Bush_family" title="Bush family">Bush family</a>, the political family that includes both presidents</li> <li><a href="/wiki/Bush_(surname)" title="Bush (surname)">Bush (surname)</a>, a surname (including a list of people with the name) </li> </ul> <h2><span class="mw-headline" id="Places.2C_United_States">Places, United States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=2" title="Edit section: Places, United States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
neither this
Elements h2next = docx.select("span.mw-headline#People"); do { ul = h2next.select("ul").first(); System.out.println(ul.text()); } while (h2next!=null && ul==null);
nor
//String content = docx.getElementById("People").outerHtml();
works.

It seems like this guy, has the right idea, but I can't make it adapt to my situation.

Maybe I should just use regex?

Seems wikipedia html is kind of "unstructured" and hard to work with.

From the wikipedia disambiguation page I want to grab the different senses in which Bush (or whatever ambiguous name I'm considering) could be used as a person.

I've tried all kinds of ways to grab this data using jsoup but I've not been able to figure it out.

I tried this:
Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get(); Element contentDiv = docx.select("span#mw-headlinePeople").first(); String printMe = contentDiv.toString(); // The result
Since I noticed that the data I want lives in a partition named:
<h2><span class="mw-headline" id="People">
But that output nothing.

I tried some variation on that based on previous questions like this one:
.select("span#mw-headlinePeople");
but still nothing.

How to get at that info?

Ideally, what I'd like is somehting like this:
George H. W. Bush George W. Bush Jeb Bush
Though I know I'll probably initially also have to get Bush family and Bush (surname) since they're part of that segment, but I guess I can just remove them later.

Also, is it faster to use this:
Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();
or this:
URL site_two = new URL("https://en.wikipedia.org/wiki/Bush"); URLConnection ycb = site_two.openConnection(); BufferedReader inb = new BufferedReader( new InputStreamReader( ycb.getInputStream())); StringBuilder sb = new StringBuilder(); while ((inputLine = inb.readLine()) != null) { //get the disambig //System.out.println(inputLine); sb.append(inputLine); sb.append(System.lineSeparator()); inputLine = inb.readLine(); }
I tried using this site, but it turns out to be not very useful. Someone should make a jsoup site like all those regex sites.
解决方案
One possible way is to select both all headlines (span.mw-headlines) and all links (best selector I found wasli > a).

If you select both with one selector by combining them with a ,, they will be in the order they appear on the page. Therefore you can keep track of whether you are in a "People section" or not while looping through the results like this:
Elements elements = docx.select("span.mw-headline, li > a"); boolean inPeopleSection = false; for (Element elem : elements) { if (elem.className().equals("mw-headline")) { // It's a headline inPeopleSection = elem.id().equals("People"); } else { // It's a link if (inPeopleSection) { System.out.println(elem.text()); } } }
Output:
George H. W. Bush George W. Bush Jeb Bush Bush family Bush (surname)
Regarding the performance, I wouldn't think it makes any difference at all, just go with the simpler version (Although I have very limited Jsoup experience, so don't take my word for it).

这篇关于使用jsoup从两个标签中提取未识别的html内容？正则表达式？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用jsoup从两个标签中提取未识别的html内容？正则表达式？ [英] extract unidentified html content from between two tags, using jsoup? regex?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用jsoup从两个标签中提取未识别的html内容？正则表达式？ [英] extract unidentified html content from between two tags, using jsoup? regex?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭