在bash中通过XPath获取HTML元素 [英] Getting HTML elements via XPath in bash

查看:106
本文介绍了在bash中通过XPath获取HTML元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用 xpath Kaggle比赛)>在MacOS上,如另一个 SO问题所述:

I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:

curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

那只是获得表中链接的 href .

That's just getting a href of a link in a table.

xpath 而不是返回值,而是开始验证 .html 并在第89行第13列第2964字节返回错误,如 undefined entity >.

But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.

由于 man xpath 不存在,并且 xpath --help 以一无所有结束,所以我陷入了困境.同样,许多类似的解决方案都与GNU发行版中的 xpath 有关,而不是在MacOS中.

Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.

在bash中是否有通过XPath获取HTML元素的正确方法?

Is there a correct way of getting HTML elements via XPath in bash?

推荐答案

在bash中通过XPath获取HTML元素

Getting HTML elements via XPath in bash

来自html文件(无效的xml)

from html file (with not valid xml)

一种可能是使用xsltproc.(我希望它可用于MAC).xsltproc有一个-html 选项,可将html用作输入.但是有了这个,你需要具有xslt样式表.

One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need to have a xslt stylesheet.

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

请注意,xapht已更改.输入文件中没有 tbody .致电xsltproc:

Notice that the xapht is changed. There is no tbodyin the input file. Call xsltproc:

xsltproc --html  test.xsl competitions.html 2> /dev/null

xslproc抱怨html中的错误的地方将被忽略(发送到/devn/null).

Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).

输出为:/c/R

要从命令行使用不同的xpath表达式,可以使用xslt模板并替换 __ xpath __ .

To use different xpath expression from command line you may use a xslt template and replace the __xpath__.

例如xslt模板:

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

然后使用(例如)sed进行替换.

And use (e.g) sed for the replacement.

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null

这篇关于在bash中通过XPath获取HTML元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆