如何解析包含隐藏标签的HTML页面 [英] How to parse a HTML page including hidden tags
问题描述
我试图解析一些网页以备将来使用。对于解析网页,我使用了不同的模块,例如urllib,lxml,BeautifulSoup,HTMLParser来达到我的目标。
I'm trying to parse some web pages for future use. For parsing webpages, I've used different modules like urllib, lxml, BeautifulSoup, HTMLParser to reach my goal.
我在解析网页时没有遇到任何问题,直到
I didn't meet any problem while parsing web pages until I faced the hidden tags.
当我用chrome浏览器打开页面并使用开发工具查看页面元素时,我可以看到部分代码:
When I opened the page with a chrome browser and used the developer tools to see elements of page, I was able to see the <embed>
part of the code:
<embed type="..." src="..." ID="..." >
只需手动复制/粘贴即可。
and simply can copy/paste manually.
我需要从这个隐藏的标签中解析 ID
。为什么我可以使用python从网站解析这部分?任何方式来解析这些隐藏的部分?
I need to parse ID
from this hidden tag. Why can I parse this part from the site by using python? Any way to parse these hidden parts?
我知道在html源代码中不可能看到像php和asp这样的代码部分,但我认为情况并非如此。
I know it's not possible to see some code parts like php and asp in the html source but I suppose it's not the case.
推荐答案
这个隐藏代码可能是在运行时由JavaScript生成的。
This "hidden" code is probably generated by JavaScript at runtime.
了解JavaScript如何工作以及它获取数据的位置(URL),而不是试图让某些东西运行脚本,然后解析生成的DOM树......
You might have better luck finding out how the JavaScript works and where it gets its data (the URLs) than attempting to have something run the script and then parse the resulting DOM tree...
这篇关于如何解析包含隐藏标签的HTML页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!