libxml2 HTML块解析 [英] libxml2 HTML chunk parsing

查看:86
本文介绍了libxml2 HTML块解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从网站下载HTML.该文件可能很大,因此在下载文件时,我希望已经解析了可用的HTML块,以便该程序的最终用户可以更快地看到该过程.我无法控制组块的生成方式,因此块可以从单词的中间开始,例如像这样:

I'm downloading HTML from a website. The file can be quite large so while the file's downloading, I want to already parse the available chunks of HTML so that the process appears faster for the end-user of my program. I don't have control over how the cunks are generated, so a chunk can begin in the middle of a word, e.g. like so:

chunk 1 --->  <div class="storyti
chunk 2 --->  tle"><a href="htt
chunk 3 --->  p://www.xkcd.com/">XKCD</a>
...and so on.

我看过一个示例,其中使用libxml2完全按照我的描述来解析XML块. libxml2还能解析HTML块吗?我已经整理好要下载的html文件,它报告警告,但没有错误. libxml2也可以解析这些HTML块吗?

I have seen example where libxml2 was used to parse XML chunks exactly how I described. Can libxml2 also parse HTML chunks? I have checked with tidy on the html files I'm going to be downloading, it reports warnings but no errors. Can libxml2 parse those HTML chunks as well?

推荐答案

libxml2具有html解析器,该解析器支持格式错误/损坏的html.请在此处检查链接.

libxml2 has a html parser which supports malformed/broken html. Please check the link here.

这篇关于libxml2 HTML块解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆