获取网页内容(不是从源代码) [英] Get web page content (Not from source code)

查看:91
本文介绍了获取网页内容(不是从源代码)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从当我处于检查模式时,我可以看到数据.但是,当我查看源代码时,找不到它.

When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.

我正在使用bs4中的 urllib2 BeautifulSoup

I am using urllib2 and BeautifulSoup from bs4

这是我的代码:

import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"

r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")

我得到了一个空数组.

我的问题是:如何获取页面内容,而不是页面源代码?

My question is: How can I get the page content, but not from the page source code?

推荐答案

如果您无法在源代码中找到div,则意味着生成了您要查找的div.它可能使用了Angular之类的JS框架,或者只是使用了JQuery.如果要浏览呈现的HTML,则必须使用运行包含的JS代码的浏览器.

If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.

尝试使用硒

如何解析网站在python中使用Selenium和Beautifulsoup?

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')

html = driver.page_source
soup = BeautifulSoup(html)

print soup.find_all("td", class_="td1_normal_class")

但是请注意,使用Selenium会极大地减慢该过程,因为它必须启动无头浏览器.

However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

这篇关于获取网页内容(不是从源代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆