将xmllint和xpath与不太完善的HTML文档一起使用? [英] Using xmllint and xpath with a less-than-perfect HTML document?

查看:87
本文介绍了将xmllint和xpath与不太完善的HTML文档一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由现有工具生成的HTML页面-我无法更改此工具的输出.

I have an HTML page that is generated by an existing tool - I cannot change the output of this tool.

但是,我想将xmllint--xpath选项一起使用,以从下载的网页中挑选出一些特定的信息.问题是页面以以下内容开头:

However, I want to use xmllint with the --xpath option to pick out a few specific pieces of information from the downloaded webpage. The problem is that the page starts with:

<html lang=en><head>...

xmllint几乎立即引发错误:

html.out:2: parser error : AttValue: " or ' expected
<html lang=en><head>
           ^

当然,问题似乎是在lang属性值周围缺少包围的引号引起的.整个页面充满了此类问题. (尽管只是偶尔出现.)

The issue certainly seems to be the missing enclosing quotation marks around the value of the lang attribute. The entire page is full of this kind of issue. (Though only sporadically.)

几乎每个浏览器都可以很好地解析-我如何说服xmllint这样做呢?我希望避免注入中间步骤来修复"文件.相反,我想:

Nearly every browser can parse this just fine - how can I convince xmllint to do so as well? I would like to avoid having to inject an intermediate step to "fix" the file. Instead, I would like to either:

1)找到有助于解析器的标志,验证选项等,或者:

1) Find a flag, validation option, etc. that helps the parser along, or:

2)使用其他工具. (但是,什么?xmllint始终是命令行XPath命令的首选.)

2) Use some other tool. (But what? xmllint is always my go-to for command line XPath commands.)

进一步,仅使用xpath会导致:

Further, using just xpath results in:

> xpath html.out '//myquery...'

not well-formed (invalid token) at line 2, column 11, ...

推荐答案

您可以使用--html命令行选项在xmllint中启用HTML解析器.这样,您将能够处理HTML文档.

You can enable the HTML parser in xmllint using the --html command line option. That way, you will be able to process HTML documents.

这篇关于将xmllint和xpath与不太完善的HTML文档一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆