如何从原始HTML文件提取数据? [英] How to extract data from a raw HTML file?
问题描述
是否有一种方法可以从没有IDs
和classes
的不正确地编写的原始html中提取所需的数据?我的意思是,假设有一个保存的网页(配置文件)的html文件,并且我想提取诸如爱好"之类的数据.可以使用PHP来做到这一点吗?
使用纯JavaScript HTML HTML解析器感到幸运./p>
最终,如果您需要从不是以语义方式构建的html页面中获取语义信息,则可能会以编程方式注定要失败,最好的选择是 解决方案
Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there that will parse html for. I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. I've had luck with John Resig's pure javascript HTML parser before.
At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk.
这篇关于如何从原始HTML文件提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!