xmllint解析html文件 [英] xmllint to parse a html file
问题描述
我试图解析各种HTML文件中Mac上特定标签之间的文本.我正在寻找体内的第一个<H1>
标题.示例:
I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1>
heading in the body. Example:
<BODY>
<H1>Dublin</H1>
为此使用正则表达式,我相信这是一种反模式,因此我改用xmllint和xpath.
Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.
xmllint --nowarning --xpath '/HTML/BODY/H1[0]'
问题是某些HTML文件包含格式错误的标记.所以我在
Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of
parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>
问题是我不能做,2>/dev/null
,因为这样我就完全松开了那些文件.有什么办法,我可以在这里使用XPath表达式,然后说,如果XML不是完美的,放松一下,只给我第一个H1标题之间的值?
Problem is I can't just do, 2>/dev/null
as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?
推荐答案
尝试使用--html
选项.否则,xmllint
会将您的文档解析为XML,比XML严格得多.还要注意,XPath索引是基于1的,并且在解析时HTML标记会转换为小写.命令
Try the --html
option. Otherwise, xmllint
parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command
xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF
打印
<h1>Dublin</h1>
这篇关于xmllint解析html文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!