如何使用 Saxon-PE 命令行从 html 页面中提取 XPATH [英] how to extract an XPATH from an html page with Saxon-PE commandline

查看:92
本文介绍了如何使用 Saxon-PE 命令行从 html 页面中提取 XPATH的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从此网页中提取 XPATH//DIV[@id="ps-content"]:http://www.amazon.com/dp/1449319432(保存为本地文件)

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

我想使用最好的解析器之一(如 Saxon-PE 或 BaseX)使用一行命令行来完成.

I would like to do it with a single line of command-line with one of the best parsers, like Saxon-PE or BaseX.

到目前为止,我(似乎已经)找到的最短解决方案是这两行:

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"page.xhtml" -qs:"//DIV[@id='ps-content']"

但它返回的所有内容都是这个,这不是我预期的 html 代码块:

but all what it returns is this, that is not my expected block of html code:

<?xml version="1.0" encoding="UTF-8"?>

我的问题有两个:

  • what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
  • since Saxon-PE has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fsaxonica.com%252Fdocumentation9.4-demo%252Fhtml%252Fextensions%252Ffunctions%252Fparse-html.html), how can I integrate my two lines into a single line?

推荐答案

我找到了正确的命令行来启动没有 TagSoup 的查询:

I found the correct command-line to launch the query without TagSoup:

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

请注意,像这样反转引号的类型不起作用(在 Win7 中):

Note that inverting the type of quotes like this doesn't work (in Win7):

java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:'//*:div[@id="ps-content"]'

有人知道如何在同一个命令行中添加 TagSoup 预处理吗?

Does anyone know how to add the TagSoup preprocess in the same command-line?

这篇关于如何使用 Saxon-PE 命令行从 html 页面中提取 XPATH的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆