如何使用 Saxon-PE 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with Saxon-PE commandline
问题描述
我想从此网页中提取 XPATH//DIV[@id="ps-content"]:http://www.amazon.com/dp/1449319432(保存为本地文件)
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)
我想使用最好的解析器之一(如 Saxon-PE 或 BaseX)使用一行命令行来完成.
I would like to do it with a single line of command-line with one of the best parsers, like Saxon-PE or BaseX.
到目前为止,我(似乎已经)找到的最短解决方案是这两行:
So far the shortest solution that I (seemed to have) found is with these two lines:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"
但它返回的所有内容都是这个,这不是我预期的 html 代码块:
but all what it returns is this, that is not my expected block of html code:
<?xml version="1.0" encoding="UTF-8"?>
我的问题有两个:
- 我的命令行有什么问题?为什么他们不返回我的 XPATH 定义的预期 html 代码块?
- 由于 Saxon-PE 已嵌入 TagSoup 功能(请参阅 https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html),如何将两行合并为一行?
- what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
- since Saxon-PE has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html), how can I integrate my two lines into a single line?
推荐答案
我找到了正确的命令行来启动没有 TagSoup 的查询:
I found the correct command-line to launch the query without TagSoup:
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"
请注意,像这样反转引号的类型不起作用(在 Win7 中):
Note that inverting the type of quotes like this doesn't work (in Win7):
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:'//*:div[@id="ps-content"]'
有人知道如何在同一个命令行中添加 TagSoup 预处理吗?
Does anyone know how to add the TagSoup preprocess in the same command-line?
这篇关于如何使用 Saxon-PE 命令行从 html 页面中提取 XPATH的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!