在bash中通过XPath获取HTML元素 [英] Getting HTML elements via XPath in bash

查看：106 发布时间：2021/4/14 20:51:09 html xml bash xpath kaggle

本文介绍了在bash中通过XPath获取HTML元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用 xpath Kaggle比赛)>在MacOS上，如另一个 SO问题所述:


I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:
curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

那只是获得表中链接的 href .
That's just getting a href of a link in a table.
但 xpath 而不是返回值，而是开始验证 .html 并在第89行第13列第2964字节返回错误，如 undefined entity >.
But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.
由于 man xpath 不存在，并且 xpath --help 以一无所有结束，所以我陷入了困境.同样，许多类似的解决方案都与GNU发行版中的 xpath 有关，而不是在MacOS中.
Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.
在bash中是否有通过XPath获取HTML元素的正确方法?
Is there a correct way of getting HTML elements via XPath in bash?
推荐答案
 
在bash中通过XPath获取HTML元素

  Getting HTML elements via XPath in bash   
来自html文件(无效的xml)
from html file (with not valid xml)
一种可能是使用xsltproc.(我希望它可用于MAC).xsltproc有一个-html 选项，可将html用作输入.但是有了这个，你需要具有xslt样式表.
One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need 
to have a xslt stylesheet. 
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

请注意，xapht已更改.输入文件中没有 tbody .致电xsltproc:
Notice that the xapht is changed. There is no tbodyin the input file.
Call xsltproc:
xsltproc --html  test.xsl competitions.html 2> /dev/null

 xslproc抱怨html中的错误的地方将被忽略(发送到/devn/null).
Where the xslproc complaining about errors in html is ignored  ( send to /devn/null ).
输出为:/c/R  
要从命令行使用不同的xpath表达式，可以使用xslt模板并替换 __ xpath __ .
To use different xpath expression from command line you may use a xslt template and replace the __xpath__. 
例如xslt模板:
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

然后使用(例如)sed进行替换.
And use (e.g) sed for the replacement.  
 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null


                        这篇关于在bash中通过XPath获取HTML元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在bash中通过XPath获取HTML元素 [英] Getting HTML elements via XPath in bash

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在bash中通过XPath获取HTML元素 [英] Getting HTML elements via XPath in bash

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭