如何读取和解析html文件? [英] How to read and parse the html file?

查看:64
本文介绍了如何读取和解析html文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个html文件,需要读取它并访问一些值:

I have an html file and need to read it and access to some values :

myHtml = 'toto.html';
readFile = fileread(myHtml);

现在要解析html文件,您知道是否有可能将html转换为xml,然后使用xpath吗?

now to parse the html file , do you know if it's possible to convert html to xml and then use xpath ?

推荐答案

我不建议尝试将HTML转换为XML.它们是不同的格式,您可能会被烫伤. HTML解析器已经存在,因此我们可以直接使用它们.

I would not recommend attempting to convert HTML to XML. They are different formats, and you are likely to get burned. HTML parsers exist, so we can use those directly.

此外,仅出于完整性考虑,请勿尝试使用正则表达式解析HTML.在Matlab中存在有关解析HTML的Stack Overflow问题,答案中建议使用正则表达式.请无辜的小猫帮个忙,然后把它们调出来.

Also, just for completeness, don't try and parse HTML with regex. There are Stack Overflow questions about parsing HTML in Matlab in which the answers recommend regex. Do innocent kittens a favor and tune them out.

不幸的是,看起来Matlab的库中没有HTML解析器.

Unfortunately, it doesn't look like Matlab has an HTML parser as part of it's library.

幸运的是,您可以在Matlab中轻松利用Java代码!
这样,Java HTML解析器就是公平的游戏.查看jsoup或jtidy.在这个问题中打一下.

Fortunately, you can leverage Java code with ease in Matlab!
With that, Java HTML parsers are fair game. Look into jsoup or jtidy. Poke around this question.

实际上,看着这个问题,再加上 HTML解析器的比较 Wikipedia文章(感谢@Daniel R!),看起来HTMLCleaner或Jtidy可能会将HTML清除为XML.再说一次,我不会打扰,只会直接解析HTML.

Actually, looking at that question, plus the Comparison of HTML parsers Wikipedia article (thanks @Daniel R!) it looks like HTMLCleaner or Jtidy might clean HTML to XML. Again, I wouldn't bother and would simply parse HTML directly.

这篇关于如何读取和解析html文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆