xmllint解析html文件 [英] xmllint to parse a html file

查看：348 发布时间：2020/7/15 2:55:27 bash macos xpath xmllint

本文介绍了xmllint解析html文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图解析各种HTML文件中Mac上特定标签之间的文本.我正在寻找体内的第一个<H1>标题.示例:

I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1> heading in the body. Example:

<BODY>
<H1>Dublin</H1>

为此使用正则表达式，我相信这是一种反模式，因此我改用xmllint和xpath.

Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.

xmllint --nowarning --xpath '/HTML/BODY/H1[0]'

问题是某些HTML文件包含格式错误的标记.所以我在

Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of

 parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>

问题是我不能做，2>/dev/null，因为这样我就完全松开了那些文件.有什么办法，我可以在这里使用XPath表达式，然后说，如果XML不是完美的，放松一下，只给我第一个H1标题之间的值?

Problem is I can't just do, 2>/dev/null as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?

推荐答案

尝试使用--html选项.否则，xmllint会将您的文档解析为XML，比XML严格得多.还要注意，XPath索引是基于1的，并且在解析时HTML标记会转换为小写.命令

Try the --html option. Otherwise, xmllint parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command

xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF

打印

<h1>Dublin</h1>

这篇关于xmllint解析html文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

xmllint解析html文件 [英] xmllint to parse a html file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

xmllint解析html文件 [英] xmllint to parse a html file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭