在bash中通过XPath获取HTML元素 [英] Getting HTML elements via XPath in bash
问题描述
我试图用 xpath
Kaggle比赛)>在MacOS上,如另一个 SO问题所述:
I was trying to parse a page (Kaggle Competitions) with xpath
on MacOS as described in another SO question:
curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'
那只是获得表中链接的 href
.
That's just getting a href
of a link in a table.
但 xpath
而不是返回值,而是开始验证 .html
并在第89行第13列第2964字节返回错误,如 undefined entity
>.
But instead of returning the value, xpath
starts validating .html
and returns errors like undefined entity at line 89, column 13, byte 2964
.
由于 man xpath
不存在,并且 xpath --help
以一无所有结束,所以我陷入了困境.同样,许多类似的解决方案都与GNU发行版中的 xpath
有关,而不是在MacOS中.
Since man xpath
doesn't exist and xpath --help
ends with nothing, I'm stuck. Also, many similar solutions relate to xpath
from GNU distributions, not in MacOS.
在bash中是否有通过XPath获取HTML元素的正确方法?
Is there a correct way of getting HTML elements via XPath in bash?
推荐答案
在bash中通过XPath获取HTML元素
Getting HTML elements via XPath in bash
来自html文件(无效的xml)
from html file (with not valid xml)
一种可能是使用xsltproc.(我希望它可用于MAC).xsltproc有一个-html
选项,可将html用作输入.但是有了这个,你需要具有xslt样式表.
One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html
to use html as input. But with that you need
to have a xslt stylesheet.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" />
<xsl:template match="/*">
<xsl:value-of select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
</xsl:template>
</xsl:stylesheet>
请注意,xapht已更改.输入文件中没有 tbody
.致电xsltproc:
Notice that the xapht is changed. There is no tbody
in the input file.
Call xsltproc:
xsltproc --html test.xsl competitions.html 2> /dev/null
xslproc抱怨html中的错误的地方将被忽略(发送到/devn/null).
Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).
输出为:/c/R
要从命令行使用不同的xpath表达式,可以使用xslt模板并替换 __ xpath __
.
To use different xpath expression from command line you may use a xslt template and replace the __xpath__
.
例如xslt模板:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" />
<xsl:template match="/*">
<xsl:value-of select="__xpaht__" />
</xsl:template>
</xsl:stylesheet>
然后使用(例如)sed进行替换.
And use (e.g) sed for the replacement.
sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
xsltproc --html test.xsl competitions.html 2> /dev/null
这篇关于在bash中通过XPath获取HTML元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!