如何使用 BaseX 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with BaseX commandline
问题描述
我想从此网页中提取 XPATH//DIV[@id="ps-content"]:http://www.amazon.com/dp/1449319432(保存为本地文件)
我想使用最好的解析器之一(例如 BaseX 或 Saxon-PE)使用一行命令行来完成.
到目前为止,我(似乎已经)找到的最短解决方案是这两行:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"basex -ipage.xhtml "//DIV[@id='ps-content']"
但它返回的所有内容都是一个空行,而不是我预期的 html 代码块:
我的问题有两个:
- 我的命令行有什么问题?为什么他们不返回我的 XPATH 定义的预期 html 代码块?
- 由于 BaseX 已嵌入 TagSoup 功能(请参阅 https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser),我该如何整合我的两行合并成一行?
您的查询有两个问题:
Tagsoup 添加命名空间
要么注册命名空间(声明默认命名空间似乎是合理的,因为您可能只处理 XHTML):
basex -ipage.xhtml "声明默认元素命名空间 'http://www.w3.org/1999/xhtml';//div[@id='ps-content']"
或使用
*
作为每个元素的命名空间指示符:basex -ipage.xhtml "///*:div[@id='ps-content']"
XML/XQuery 区分大小写
我已经在 (1) 中的查询中更正了它:
与不同.(1) 中的两个查询都已产生预期结果.
<小时>
Tagsoup 可以在 BaseX 中使用,您不必为 HTML 输入单独调用它.确保在默认 Java 类路径中包含 tagsoup,例如.通过在 Debian 中安装 libtagsoup-java
.
basex '声明选项 db:parser "html";doc("page.html")///*:div[@id="ps-content"]'
如果需要,您甚至可以直接从 BaseX 查询 HTML 页面:
basex '声明选项 db:parser "html";doc("http://www.amazon.com/dp/1449319432")///*:div[@id="ps-content"]'
使用 -i
对我使用 tagsoup 不起作用,但您可以使用 doc(...)
代替.
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)
I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.
So far the shortest solution that I (seemed to have) found is with these two lines:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"
but all what it returns is an empty line, instead than my expected block of html code:
My questions are two:
- what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
- since BaseX has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser), how can I integrate my two lines into a single line?
There are two problems with your query:
Tagsoup adds namespaces
Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):
basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
or use
*
as namespace indicator for each element:basex -ipage.xhtml "//*:div[@id='ps-content']"
XML/XQuery is case sensitive
I already corrected it in my queries in (1):
<div/>
is not the same as<DIV/>
. Both queries in (1) already yield the expected result.
Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java
in Debian.
basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'
You can even query the HTML page directly from BaseX if you want to:
basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'
Using -i
didn't work for me with using tagsoup, but you can use doc(...)
instead.
这篇关于如何使用 BaseX 命令行从 html 页面中提取 XPATH的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!