获取维基百科文章的第一行 [英] Get first lines of Wikipedia Article

查看:96
本文介绍了获取维基百科文章的第一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到了Wikipedia-Article,我想从该文章中获取前z行(或前x个字符,或前y个单词,都没有关系).

问题:我可以获取源Wiki-Text(通过API)或解析的HTML(通过直接HTTP-Request,最终在打印版本中),但是如何找到显示的第一行?通常,源代码(包括html和wikitext)都以信息框和图像开头,并且要显示的第一个实际文本在代码中的某个位置.

The problem: I can get either the source Wiki-Text (via API) or the parsed HTML (via direct HTTP-Request, eventually on the print-version) but how can I find the first lines displayed? Normaly the source (both html and wikitext) starts with the info-boxes and images and the first real text to display is somewhere down in the code.

例如: 维基百科上的阿尔伯特·爱因斯坦(印刷版).从代码中可以看到,第一个真实文本行是阿尔伯特·爱因斯坦(发音为/ˈælbərt ˈaɪnstaɪn/;德语:[ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 1879年3月14日至1955年4月18日)是一位理论物理学家." 还不是开始.这同样适用于 Wiki-Source ,以相同的信息框开头,依此类推.

For example: Albert Einstein on Wikipedia (print Version). Look in the code, the first real-text-line "Albert Einstein (pronounced /ˈælbərt ˈaɪnstaɪn/; German: [ˈalbɐt ˈaɪ̯nʃtaɪ̯n]; 14 March 1879–18 April 1955) was a theoretical physicist." is not on the start. The same applies to the Wiki-Source, it starts with the same info-box and so on.

那么您将如何完成这项任务?编程语言是Java,但这无关紧要.

我想到的一个解决方案是使用xpath查询,但是该查询要处理所有边界情况都相当复杂. [update]没那么复杂,请在下面查看我的解决方案![/update]

谢谢!

推荐答案

我制定了以下解决方案: 在XHTML源代码上使用xpath查询(我选择了印刷版,因为它较短,但也适用于普通版本).

I worked out the following solution: Using a xpath-query on the XHTML-Source-Code (I took the print-version because it is shorter, but it also works on the normal version).

//html/body//div[@id='bodyContent']/p[1]

这适用于德语和英语维基百科,但我还没有找到没有输出第一段的文章. 解决方案也相当快,我还想到了只获取xhtml的前x个字符,但这会使xhtml无效.

This works on German and on English Wikipedia and I haven't found an article where it doesn't output the first paragraph. The solution is also quite fast, I also thought of only taking the first x chars of the xhtml, but this would render the xhtml invalid.

如果有人在这里搜索JAVA代码,则为:

If someone is searching for the JAVA-Code here it is then:

private static DocumentBuilderFactory dbf;
static {
    dbf = DocumentBuilderFactory.newInstance();
    dbf.setAttribute("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
}
private static XPathFactory xpathf = XPathFactory.newInstance();
private static String xexpr = "//html/body//div[@id='bodyContent']/p[1]";


private static String getPlainSummary(String url) {
    try {
        // OPen Wikipage
        URL u = new URL(url);
        URLConnection uc = u.openConnection();
        uc.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1) Gecko/20090616 Firefox/3.5");
        InputStream uio = uc.getInputStream();
        InputSource src = new InputSource(uio);

        //Construct Builder
        DocumentBuilder builder = dbf.newDocumentBuilder();
        Document docXML = builder.parse(src);

        //Apply XPath
        XPath xpath = xpathf.newXPath();
        XPathExpression xpathe = xpath.compile(xexpr);
        String s = xpathe.evaluate(docXML);

        //Return Attribute
        if (s.length() == 0) {
            return null;
        } else {
            return s;
        }
    }
    catch (IOException ioe) {
        logger.error("Cant get XML", ioe);
        return null;
    }
    catch (ParserConfigurationException pce) {
        logger.error("Cant get DocumentBuilder", pce);
        return null;
    }
    catch (SAXException se) {
        logger.error("Cant parse XML", se);
        return null;
    }
    catch (XPathExpressionException xpee) {
        logger.error("Cant parse XPATH", xpee);
        return null;
    }
}

通过调用getPlainSummary("http://de.wikipedia.org/wiki/Uma_Thurman");

这篇关于获取维基百科文章的第一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆