抓取的HTML与源代码有何不同? [英] How can a scraped HTML be different from the source code?

查看:69
本文介绍了抓取的HTML与源代码有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从网站(经允许)抓取餐馆列表,但遇到了问题.网站上的html python片段与源代码中的html不同.他们网站上不到一半的餐厅是在python的html中找到的.这是我的代码:

I'm scraping a list of restaurants from a website (with permission) and I have a problem. The html python scrapes from the website is different from the html in the source code. Less then half of the restaurants on their site are found in the html in python. This is what my code looks like:

import requests
from bs4 import BeautifulSoup
from tempfile import TemporaryFile
import xlwt

url = 'https://www.example.com'

r = requests.get(url)
data = BeautifulSoup(r.text)
soup = data.find_all('span',{'class':'restaurant_name'})
print soup

现在我知道这很不方便,但由于公司不允许我使用,因此我无法显示html.我只是想知道你们是否一般都知道python下载的html与源代码中的html有什么不同,我该怎么做.

Now I know it's incovenient, but I can't show the html since the company won't let me. I'm just wondering whether you guys in general know how the html downloaded by python can be different from the one in the source code and what I can do about it.

提前谢谢!

推荐答案

您可以为此目的使用Selenium.它将像浏览器一样在运行时呈现您的网页.您可以将Selenium与firefox,chrome或phantomjs一起使用.

You can use Selenium for this purpose. It will render your web page in run time just like your browser does. You can use Selenium with firefox, chrome or phantomjs.

我们基本上使用selenium来完全呈现我们的网页,因为大多数站点都是由Modern JavaScript框架组成的.通常,它用于开发爬网程序/爬网程序以从网站的不同页面收集数据,或者Selenium也用于网络自动化.

We use selenium basically to completely render our web page as most of the sites are made up of Modern JavaScript frameworks. Mostly it is used in developing Crawlers/Scrappers for gathering data from different pages of a website or Selenium is also used in web automation.

有关Selenium的更多信息,请在此处阅读 http://selenium-python.readthedocs.io/我也为初学者准备了有关Slenium的博客文章.也请 http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/

More on Selenium, read it here http://selenium-python.readthedocs.io/ Also I have blog post on Slenium for the beginners. Check this one too http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/

示例

import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

profile_link = 'http://hassanmehmood.com'


class TitleScrapper(object):

    def __init__(self):

        fp = webdriver.FirefoxProfile()
        fp.set_preference("browser.startup.homepage_override.mstone", "ignore") #Avoid startup screen
        fp.set_preference("startup.homepage_welcome_url.additional",  "about:blank")

        self.driver = webdriver.Firefox(firefox_profile=fp)
        self.driver.set_window_size(1120, 550)

    def scrape_profile(self):
        self.driver.get(profile_link)
        print self.driver.title
        self.driver.close()

    def scrape(self):
        self.scrape_profile()


if __name__ == '__main__':
    scraper = TitleScrapper()
    scraper.scrape()

这篇关于抓取的HTML与源代码有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆