将xmllint和xpath与不太完善的HTML文档一起使用? [英] Using xmllint and xpath with a less-than-perfect HTML document?
问题描述
我有一个由现有工具生成的HTML页面-我无法更改此工具的输出.
I have an HTML page that is generated by an existing tool - I cannot change the output of this tool.
但是,我想将xmllint
与--xpath
选项一起使用,以从下载的网页中挑选出一些特定的信息.问题是页面以以下内容开头:
However, I want to use xmllint
with the --xpath
option to pick out a few specific pieces of information from the downloaded webpage. The problem is that the page starts with:
<html lang=en><head>...
xmllint
几乎立即引发错误:
html.out:2: parser error : AttValue: " or ' expected
<html lang=en><head>
^
当然,问题似乎是在lang
属性值周围缺少包围的引号引起的.整个页面充满了此类问题. (尽管只是偶尔出现.)
The issue certainly seems to be the missing enclosing quotation marks around the value of the lang
attribute. The entire page is full of this kind of issue. (Though only sporadically.)
几乎每个浏览器都可以很好地解析-我如何说服xmllint
这样做呢?我希望避免注入中间步骤来修复"文件.相反,我想:
Nearly every browser can parse this just fine - how can I convince xmllint
to do so as well? I would like to avoid having to inject an intermediate step to "fix" the file. Instead, I would like to either:
1)找到有助于解析器的标志,验证选项等,或者:
1) Find a flag, validation option, etc. that helps the parser along, or:
2)使用其他工具. (但是,什么?xmllint
始终是命令行XPath命令的首选.)
2) Use some other tool. (But what? xmllint
is always my go-to for command line XPath commands.)
进一步,仅使用xpath
会导致:
Further, using just xpath
results in:
> xpath html.out '//myquery...'
not well-formed (invalid token) at line 2, column 11, ...
推荐答案
您可以使用--html
命令行选项在xmllint
中启用HTML解析器.这样,您将能够处理HTML文档.
You can enable the HTML parser in xmllint
using the --html
command line option. That way, you will be able to process HTML documents.
这篇关于将xmllint和xpath与不太完善的HTML文档一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!