如何从原始HTML文件提取数据? [英] How to extract data from a raw HTML file?

查看:154
本文介绍了如何从原始HTML文件提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种方法可以从没有IDsclasses的不正确地编写的原始html中提取所需的数据?我的意思是,假设有一个保存的网页(配置文件)的html文件,并且我想提取诸如爱好"之类的数据.可以使用PHP来做到这一点吗?

解决方案

使用纯JavaScript HTML HTML解析器感到幸运./p>

最终,如果您需要从不是以语义方式构建的html页面中获取语义信息,则可能会以编程方式注定要失败,最好的选择是 解决方案

Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there that will parse html for. I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. I've had luck with John Resig's pure javascript HTML parser before.

At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk.

这篇关于如何从原始HTML文件提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆