获取网页内容(不是从源代码) [英] Get web page content (Not from source code)
问题描述
我想从当我处于检查模式
时,我可以看到数据.但是,当我查看源代码时,找不到它.
When I am in inspect mode
, I can see the data. However, when I view the source code, I cannot find it.
我正在使用bs4中的 urllib2
和 BeautifulSoup
I am using urllib2
and BeautifulSoup from bs4
这是我的代码:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
我得到了一个空数组.
我的问题是:如何获取页面内容,而不是页面源代码?
My question is: How can I get the page content, but not from the page source code?
推荐答案
如果您无法在源代码中找到div,则意味着生成了您要查找的div.它可能使用了Angular之类的JS框架,或者只是使用了JQuery.如果要浏览呈现的HTML,则必须使用运行包含的JS代码的浏览器.
If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
尝试使用硒
如何解析网站在python中使用Selenium和Beautifulsoup?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
但是请注意,使用Selenium会极大地减慢该过程,因为它必须启动无头浏览器.
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.
这篇关于获取网页内容(不是从源代码)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!