如何从维基百科页面中删除第一段? [英] How to scrape the first paragraph from a wikipedia page?
问题描述
假设我想抓住这个维基百科页面的第一段一>。如何使用XPath或DOM& PHP或类似的东西?
Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar?
是否有任何php库?我不想使用api,因为它有点复杂。
Is there any php library for that? I don't want to use the api because it's a bit complex.
注意:我只需要在我的网页上添加一个小部件,显示维基百科的相关信息。
Note: i just need that to add a widget under my pages that displays related info from Wikipedia.
推荐答案
使用以下XPath表达式:
/*/h:body//h:h1
|
/*/h:body//h:h1/following::node()
[count(. | //h:table[@id='toc']
/preceding::node()
)
=
count(//h:table[@id='toc']
/preceding::node()
)
]
这里前缀 h:
绑定到XHTML命名空间(http://www.w3.org/1999/xhtml
)。
Here the prefix h:
is bound to the XHTML namespace ("http://www.w3.org/1999/xhtml"
).
此转换显示所需的结果真的生成:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml"
>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/h:body//h:h1
|
/*/h:body//h:h1/following::node()
[count(. | //h:table[@id='toc']
/preceding::node()
)
=
count(//h:table[@id='toc']
/preceding::node()
)
]
"/>
</xsl:template>
</xsl:stylesheet>
在维基百科文章的XHTML文档上运行(您还需要为本文档定义两个实体& nbsp;
和& reg;
),生成所需的结果。
When run on the XHTML document of the Wikipedia article ( you also need to define two entities
and ®
for this document), the wanted result is produced.
这篇关于如何从维基百科页面中删除第一段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!