xmllint解析html文件 [英] xmllint to parse a html file

查看:348
本文介绍了xmllint解析html文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图解析各种HTML文件中Mac上特定标签之间的文本.我正在寻找体内的第一个<H1>标题.示例:

I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1> heading in the body. Example:

<BODY>
<H1>Dublin</H1>

为此使用正则表达式,我相信这是一种反模式,因此我改用xmllint和xpath.

Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.

xmllint --nowarning --xpath '/HTML/BODY/H1[0]'

问题是某些HTML文件包含格式错误的标记.所以我在

Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of

 parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>

问题是我不能做,2>/dev/null,因为这样我就完全松开了那些文件.有什么办法,我可以在这里使用XPath表达式,然后说,如果XML不是完美的,放松一下,只给我第一个H1标题之间的值?

Problem is I can't just do, 2>/dev/null as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?

推荐答案

尝试使用--html选项.否则,xmllint会将您的文档解析为XML,比XML严格得多.还要注意,XPath索引是基于1的,并且在解析时HTML标记会转换为小写.命令

Try the --html option. Otherwise, xmllint parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command

xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF

打印

<h1>Dublin</h1>

这篇关于xmllint解析html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆