如何使用 BaseX 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with BaseX commandline

查看:30
本文介绍了如何使用 BaseX 命令行从 html 页面中提取 XPATH的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此网页中提取 XPATH//DIV[@id="ps-content"]:http://www.amazon.com/dp/1449319432(保存为本地文件)

我想使用最好的解析器之一(例如 BaseX 或 Saxon-PE)使用一行命令行来完成.

到目前为止,我(似乎已经)找到的最短解决方案是这两行:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"basex -ipage.xhtml "//DIV[@id='ps-content']"

但它返回的所有内容都是一个空行,而不是我预期的 html 代码块:

我的问题有两个:

解决方案

您的查询有两个问题:

  1. Tagsoup 添加命名空间

    要么注册命名空间(声明默认命名空间似乎是合理的,因为您可能只处理 XHTML):

    basex -ipage.xhtml "声明默认元素命名空间 'http://www.w3.org/1999/xhtml';//div[@id='ps-content']"

    或使用 * 作为每个元素的命名空间指示符:

    basex -ipage.xhtml "///*:div[@id='ps-content']"

  2. XML/XQuery 区分大小写

    我已经在 (1) 中的查询中更正了它:

    不同.(1) 中的两个查询都已产生预期结果.

<小时>

Tagsoup 可以在 BaseX 中使用,您不必为 HTML 输入单独调用它.确保在默认 Java 类路径中包含 tagsoup,例如.通过在 Debian 中安装 libtagsoup-java.

basex '声明选项 db:parser "html";doc("page.html")///*:div[@id="ps-content"]'

如果需要,您甚至可以直接从 BaseX 查询 HTML 页面:

basex '声明选项 db:parser "html";doc("http://www.amazon.com/dp/1449319432")///*:div[@id="ps-content"]'

使用 -i 对我使用 tagsoup 不起作用,但您可以使用 doc(...) 代替.

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

but all what it returns is an empty line, instead than my expected block of html code:

My questions are two:

解决方案

There are two problems with your query:

  1. Tagsoup adds namespaces

    Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):

    basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
    

    or use * as namespace indicator for each element:

    basex -ipage.xhtml "//*:div[@id='ps-content']"
    

  2. XML/XQuery is case sensitive

    I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.


Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.

这篇关于如何使用 BaseX 命令行从 html 页面中提取 XPATH的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆