如何在命令行中使用带有 Saxon-HE 的 XPath 解析 HTML? [英] How to parse HTML using XPath with Saxon-HE in command line?

查看:68
本文介绍了如何在命令行中使用带有 Saxon-HE 的 XPath 解析 HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 saxon HE 9.6,它非常适合在解析格式良好的 XML 文件时使用 XPath 3.

I use saxon HE 9.6, and it's great for playing with XPath 3 while you are parsing well formed XML files.

但我想知道如何结合 expath-http-client(或任何其他工作解决方案)与 Saxon 有能力解析 realLife©®™(可能已损坏的)HTML.(Java 不是我更好的技能).

But I would like to know how to combine expath-http-client (or any other working solution) with Saxon to have the power to parse realLife©®™ (possibly broken) HTML. (Java is not my better skill).

我在谷歌上搜索了很多小时,但没有任何可行的解决方案.我试过类似的东西:

I searched google quite many hours without any working solution. I tried something like :

xquery_file.xsl :

xquery_file.xsl :

xquery version "1.0";

declare namespace http="http://expath.org/ns/http-client";

let $url := 'http://stackoverflow.com'
let $response := http:send-request(
   <http:request href="{$url}" method="get"/>
) return
    <echo-results>
        {$response}
    </echo-results>

Shell 命令取自 expath-http-client-saxon-0.10.0

Shell command taken from the README of expath-http-client-saxon-0.10.0

saxon --repo /usr/share/java/expath/repo -xsl:sample/simple-get.xsl -it:main

saxon --repo /usr/share/java/expath/repo -xsl:xquery_file.xsl -it:main

没有成功.我得到:转换失败:未知的配置属性 http://saxon.sf.net/feature/repo

我最终想要做的理想情况是直接从命令行查询 URL,没有 XQuery 文件,但有 XPath 表达式(如果可能).我很确定一些 XML/那里的 Java/XPath 专家有我正在寻找的解决方案.

What I want to do ideally in final, is to query directly an URL from the command line without a XQuery file but an XPath expression (if possible). I'm pretty sure some XML/Java/XPath guru around there have the solution I'm looking for.

/usr/share/java/expath/repo 包含:

/usr/share/java/expath/repo
├── expath-http-client-saxon-0.10.0
│   ├── cxan.xml
│   ├── expath-http-client-saxon
│   │   ├── jar
│   │   │   ├── expath-http-client-java.jar
│   │   │   └── expath-http-client-saxon.jar
│   │   ├── lib
│   │   │   ├── apache-mime4j-0.6.jar
│   │   │   ├── commons-codec-1.4.jar
│   │   │   ├── commons-logging-1.1.1.jar
│   │   │   ├── httpclient-4.0.1.jar
│   │   │   ├── httpcore-4.0.1.jar
│   │   │   └── tagsoup-1.2.jar
│   │   ├── xq
│   │   │   └── expath-http-client-saxon.xq
│   │   └── xsl
│   │       └── expath-http-client-saxon.xsl
│   ├── expath-pkg.xml
│   └── saxon.xml
└── hello-1.1
    ├── expath-pkg.xml
    └── hello
        ├── hello.xq
        └── hello.xsl

我的最佳尝试(基于 linux 的解决方案)

My best attempt (linux based solution)

java -classpath "./tagsoup-1.2.jar:./saxon9he.jar" \
    net.sf.saxon.Query \
   -x:org.ccil.cowan.tagsoup.Parser \
   -s:myrealLife.html \
   -qs://*:body

这项工作,但现在我试图弄清楚如何设置default namespace 以便能够通过示例直接查询//a

This work, but now I try to figure out how to set the default namespace to be able to query directly by example //a

我根据这个帖子创建了一个完整的 github 项目,检查 https://github.com/sputnick-dev/saxon-lint

I have created a whole github project according to this POST, check https://github.com/sputnick-dev/saxon-lint

推荐答案

我认为您不需要任何 HTTP 客户端.您可以使用 doc() 函数读取文件,或将其作为主要输入文档提供,前提是您将其配置为使用 HTML SAX 解析器而不是 XML 解析器进行解析.如果您将 John Cowan 的 TagSoup 放在类路径上,然后使用

I don't think you need any HTTP client for this. You can read the file using the doc() function, or supply it as the primary input document, provided you configure it to be parsed using an HTML SAX parser rather than an XML parser. If you put John Cowan's TagSoup on the classpath, then invoking Saxon with

-x:org.ccil.cowan.tagsoup.Parser -s:myrealLife.html

应该可以解决问题.

我认为你也可以使用validator.nu,它在HTML5上比TagSoup更快,但我自己没有尝试过.

I think you can also use validator.nu, which is rather more up-to-speed with HTML5 than TagSoup, but I haven't tried it myself.

这篇关于如何在命令行中使用带有 Saxon-HE 的 XPath 解析 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆