如何从维基百科页面中删除第一段？ [英] How to scrape the first paragraph from a wikipedia page?

查看：112 发布时间：2017/6/25 2:49:56 php dom xpath web-crawler

本文介绍了如何从维基百科页面中删除第一段？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我想抓住这个维基百科页面的第一段一>。如何使用XPath或DOM& PHP或类似的东西？

Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?

是否有任何php库？我不想使用api，因为它有点复杂。

Is there any php library for that? I don't want to use the api because it's a bit complex.

注意：我只需要在我的网页上添加一个小部件，显示维基百科的相关信息。

Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

推荐答案

使用以下XPath表达式：

/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]

这里前缀 h：绑定到XHTML命名空间（http://www.w3.org/1999/xhtml）。

Here the prefix h: is bound to the XHTML namespace ("http://www.w3.org/1999/xhtml").

此转换显示所需的结果真的生成：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml"
 >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/h:body//h:h1
  |
   /*/h:body//h:h1/following::node()
      [count(. | //h:table[@id='toc']
                  /preceding::node()
             )
      =
       count(//h:table[@id='toc']
                  /preceding::node()
             )
       ]
  "/>
 </xsl:template>
</xsl:stylesheet>

在维基百科文章的XHTML文档上运行（您还需要为本文档定义两个实体& nbsp; 和& reg; ），生成所需的结果。

When run on the XHTML document of the Wikipedia article ( you also need to define two entities   and ® for this document), the wanted result is produced.

这篇关于如何从维基百科页面中删除第一段？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从维基百科页面中删除第一段？ [英] How to scrape the first paragraph from a wikipedia page?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

如何从维基百科页面中删除第一段？ [英] How to scrape the first paragraph from a wikipedia page?

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭