如何在命令行中使用带有 Saxon-HE 的 XPath 解析 HTML? [英] How to parse HTML using XPath with Saxon-HE in command line?
问题描述
我使用 saxon HE 9.6,它非常适合在解析格式良好的 XML 文件时使用 XPath 3.
I use saxon HE 9.6, and it's great for playing with XPath 3 while you are parsing well formed XML files.
但我想知道如何结合 expath-http-client(或任何其他工作解决方案)与 Saxon 有能力解析 realLife©®™(可能已损坏的)HTML.(Java 不是我更好的技能).
But I would like to know how to combine expath-http-client (or any other working solution) with Saxon to have the power to parse realLife©®™ (possibly broken) HTML. (Java is not my better skill).
我在谷歌上搜索了很多小时,但没有任何可行的解决方案.我试过类似的东西:
I searched google quite many hours without any working solution. I tried something like :
xquery_file.xsl :
xquery_file.xsl :
xquery version "1.0";
declare namespace http="http://expath.org/ns/http-client";
let $url := 'http://stackoverflow.com'
let $response := http:send-request(
<http:request href="{$url}" method="get"/>
) return
<echo-results>
{$response}
</echo-results>
Shell 命令取自 expath-http-client-saxon-0.10.0
Shell command taken from the README of expath-http-client-saxon-0.10.0
saxon --repo /usr/share/java/expath/repo -xsl:sample/simple-get.xsl -it:main
或
saxon --repo /usr/share/java/expath/repo -xsl:xquery_file.xsl -it:main
没有成功.我得到:转换失败:未知的配置属性 http://saxon.sf.net/feature/repo
我最终想要做的理想情况是直接从命令行查询 URL,没有 XQuery 文件,但有 XPath 表达式(如果可能).我很确定一些 XML/那里的 Java/XPath 专家有我正在寻找的解决方案.
What I want to do ideally in final, is to query directly an URL from the command line without a XQuery file but an XPath expression (if possible). I'm pretty sure some XML/Java/XPath guru around there have the solution I'm looking for.
/usr/share/java/expath/repo
包含:
/usr/share/java/expath/repo
├── expath-http-client-saxon-0.10.0
│ ├── cxan.xml
│ ├── expath-http-client-saxon
│ │ ├── jar
│ │ │ ├── expath-http-client-java.jar
│ │ │ └── expath-http-client-saxon.jar
│ │ ├── lib
│ │ │ ├── apache-mime4j-0.6.jar
│ │ │ ├── commons-codec-1.4.jar
│ │ │ ├── commons-logging-1.1.1.jar
│ │ │ ├── httpclient-4.0.1.jar
│ │ │ ├── httpcore-4.0.1.jar
│ │ │ └── tagsoup-1.2.jar
│ │ ├── xq
│ │ │ └── expath-http-client-saxon.xq
│ │ └── xsl
│ │ └── expath-http-client-saxon.xsl
│ ├── expath-pkg.xml
│ └── saxon.xml
└── hello-1.1
├── expath-pkg.xml
└── hello
├── hello.xq
└── hello.xsl
我的最佳尝试(基于 linux 的解决方案)
My best attempt (linux based solution)
java -classpath "./tagsoup-1.2.jar:./saxon9he.jar" \
net.sf.saxon.Query \
-x:org.ccil.cowan.tagsoup.Parser \
-s:myrealLife.html \
-qs://*:body
这项工作,但现在我试图弄清楚如何设置default namespace
以便能够通过示例直接查询//a
This work, but now I try to figure out how to set the default namespace
to be able to query directly by example //a
我根据这个帖子创建了一个完整的 github 项目,检查 https://github.com/sputnick-dev/saxon-lint
I have created a whole github project according to this POST, check https://github.com/sputnick-dev/saxon-lint
推荐答案
我认为您不需要任何 HTTP 客户端.您可以使用 doc() 函数读取文件,或将其作为主要输入文档提供,前提是您将其配置为使用 HTML SAX 解析器而不是 XML 解析器进行解析.如果您将 John Cowan 的 TagSoup 放在类路径上,然后使用
I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with
-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html
应该可以解决问题.
我认为你也可以使用validator.nu,它在HTML5上比TagSoup更快,但我自己没有尝试过.
I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.
这篇关于如何在命令行中使用带有 Saxon-HE 的 XPath 解析 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!