如何使用 BaseX 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with BaseX commandline

查看：30 发布时间：2021/10/1 18:39:12 xml xpath xhtml basex

本文介绍了如何使用 BaseX 命令行从 html 页面中提取 XPATH的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从此网页中提取 XPATH//DIV[@id="ps-content"]:http://www.amazon.com/dp/1449319432(保存为本地文件)

我想使用最好的解析器之一(例如 BaseX 或 Saxon-PE)使用一行命令行来完成.

到目前为止，我(似乎已经)找到的最短解决方案是这两行:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"basex -ipage.xhtml "//DIV[@id='ps-content']"

但它返回的所有内容都是一个空行，而不是我预期的 html 代码块:

我的问题有两个:

我的命令行有什么问题?为什么他们不返回我的 XPATH 定义的预期 html 代码块?
由于 BaseX 已嵌入 TagSoup 功能(请参阅 https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser)，我该如何整合我的两行合并成一行?

解决方案

您的查询有两个问题:

Tagsoup 添加命名空间
要么注册命名空间(声明默认命名空间似乎是合理的，因为您可能只处理 XHTML):
```
basex -ipage.xhtml "声明默认元素命名空间 'http://www.w3.org/1999/xhtml';//div[@id='ps-content']"
```
或使用 * 作为每个元素的命名空间指示符:
```
basex -ipage.xhtml "///*:div[@id='ps-content']"
```
XML/XQuery 区分大小写
我已经在 (1) 中的查询中更正了它:
与
不同.(1) 中的两个查询都已产生预期结果.

<小时>

Tagsoup 可以在 BaseX 中使用，您不必为 HTML 输入单独调用它.确保在默认 Java 类路径中包含 tagsoup，例如.通过在 Debian 中安装 libtagsoup-java.

basex '声明选项 db:parser "html";doc("page.html")///*:div[@id="ps-content"]'

如果需要，您甚至可以直接从 BaseX 查询 HTML 页面:

basex '声明选项 db:parser "html";doc("http://www.amazon.com/dp/1449319432")///*:div[@id="ps-content"]'

使用 -i 对我使用 tagsoup 不起作用，但您可以使用 doc(...) 代替.

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

but all what it returns is an empty line, instead than my expected block of html code:

My questions are two:

what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
since BaseX has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser), how can I integrate my two lines into a single line?

解决方案

There are two problems with your query:

Tagsoup adds namespaces

Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):
```
basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
```
or use * as namespace indicator for each element:
```
basex -ipage.xhtml "//*:div[@id='ps-content']"
```
XML/XQuery is case sensitive

I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.

Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.

这篇关于如何使用 BaseX 命令行从 html 页面中提取 XPATH的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 BaseX 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with BaseX commandline

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 BaseX 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with BaseX commandline

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭