提取结构松散的Wikipedia文本. html [英] extract loosly structured wikipedia text. html

查看:54
本文介绍了提取结构松散的Wikipedia文本. html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

维基百科歧义消除页面上的某些html是模棱两可的,也就是说,连接到名为Corzine的特定人员的链接很难使用jsoup捕获,因为它们没有明确的结构化,也没有存在例如此示例中的特定部分.请参见此处的"Corzine"页面.

Some of the html on wikipedia disambiguation pages is, shall we say, ambiguous, i.e. the links there that connect to specific persons named Corzine are difficult to capture using jsoup because they're not explicitly structured, nor do they live in a particular section as in this example. See the page Corzine page here.

如何获得它们? jsoup是适合此任务的工具吗?

How can I get a hold of them? Is jsoup a suitable tool for this task?

也许我应该使用正则表达式,但是我害怕这样做,因为我希望它可以通用.

Perhaps I should use regex, but I fear doing that because I want it to be generalizable.

</b> may refer to:</p> 
 <ul> 
  <li><a href

^这是标准的,也许我可以使用正则表达式来匹配它?

^this here is standard, maybe I could use regex to match that?

<p><b>Corzine</b> may refer to:</p> 
 <ul> 
  <li><a href="/wiki/Dave_Corzine" title="Dave Corzine">Dave Corzine</a> (born 1956), basketball player</li> 
  <li><a href="/wiki/Jon_Corzine" title="Jon Corzine">Jon Corzine</a> (born 1947), former CEO of <a href="/wiki/MF_Global" title="MF Global">MF Global</a>, former Governor on New Jersey, former CEO of <a href="/wiki/Goldman_Sachs" title="Goldman Sachs">Goldman Sachs</a></li> 
 </ul> 
 <table id="setindexbox" class="metadata plainlinks dmbox dmbox-setindex" style="" role="presentation"> 

理想的输出将是

Dave Corzine
Jon Corzine

也许可以匹配部分</b> may refer to:</p><table id="setindexbox"并提取两者之间的所有内容.我猜想<table id="setindexbox"在jsoup中可以很容易地匹配,但是</b> may refer to:</p>应该比较困难,因为<b><p>并不是很明显.

Maybe it would be possible to match the section </b> may refer to:</p> and also <table id="setindexbox" and extract all that's in between. I guess <table id="setindexbox" could be matched easily enough in jsoup, but </b> may refer to:</p> should be more difficule because <b> or <p> are not very distinguished.

我尝试过:

      Elements table = docx.select("ul");
      Elements links = table.select("li");



    Pattern ppp = Pattern.compile("table id=\"setindexbox\" ");
    Matcher mmm = ppp.matcher(inputLine);

    Pattern pp = Pattern.compile("</b> may refer to:</p>");
    Matcher mm = pp.matcher(inputLine);
    if (mm.matches()) 
    {
    while(!mmm.matches())
      for (Element link: links) 
      {
          String url = link.attr("href");
          String text = link.text();
          System.out.println(text + ", " + url);
      }
    }

但是没有用.

推荐答案

此选择器有效:

Elements els = doc.select("p ~ ul a:eq(0)");

请参阅: http://try.jsoup.org/~yPvgR0pxvA3oWQSJte4Rfm-lS2Y

正在寻找ul中的第一个A元素(a:eq(0)),它是p的同级.如果还有其他冲突,您也可以执行p:contains(corzine) ~ ul a:eq(0).

That's looking for the first A element (a:eq(0)) in a ul that's a sibling of a p. You could also do p:contains(corzine) ~ ul a:eq(0) if there were other conflicts.

或更一般地说::contains(may refer to) ~ ul a:eq(0)

很难概括维基百科,因为它是非结构化的.但是恕我直言,使用解析器和CSS选择器比使用正则表达式更容易,尤其是随着时间的推移,模​​板更改等.

It's hard to generalize Wikipedia because it's unstructured. But IMHO it's easier to use a parser and CSS selectors than regexes, particularly over time when templates change etc.

这篇关于提取结构松散的Wikipedia文本. html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆