抓取的 HTML 与源代码有何不同? [英] How can a scraped HTML be different from the source code?

查看:23
本文介绍了抓取的 HTML 与源代码有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从网站上抓取餐馆列表(经许可),但遇到了问题.来自网站的 html python 抓取与源代码​​中的 html 不同.他们网站上不到一半的餐厅是在 python 的 html 中找到的.这是我的代码的样子:

I'm scraping a list of restaurants from a website (with permission) and I have a problem. The html python scrapes from the website is different from the html in the source code. Less then half of the restaurants on their site are found in the html in python. This is what my code looks like:

import requests
from bs4 import BeautifulSoup
from tempfile import TemporaryFile
import xlwt

url = 'https://www.example.com'

r = requests.get(url)
data = BeautifulSoup(r.text)
soup = data.find_all('span',{'class':'restaurant_name'})
print soup

现在我知道这很不方便,但我不能显示 html,因为公司不让我.我只是想知道大家是否知道python下载的html与源代码中的html有何不同以及我可以做些什么.

Now I know it's incovenient, but I can't show the html since the company won't let me. I'm just wondering whether you guys in general know how the html downloaded by python can be different from the one in the source code and what I can do about it.

提前致谢!

推荐答案

您可以将 Selenium 用于此目的.它会像您的浏览器一样在运行时呈现您的网页.您可以将 Selenium 与 firefox、chrome 或 phantomjs 结合使用.

You can use Selenium for this purpose. It will render your web page in run time just like your browser does. You can use Selenium with firefox, chrome or phantomjs.

我们基本上使用 selenium 来完全渲染我们的网页,因为大多数网站都是由现代 JavaScript 框架组成的.它主要用于开发爬虫/抓取工具,以从网站的不同页面收集数据,或者 Selenium 也用于网络自动化.

We use selenium basically to completely render our web page as most of the sites are made up of Modern JavaScript frameworks. Mostly it is used in developing Crawlers/Scrappers for gathering data from different pages of a website or Selenium is also used in web automation.

有关 Selenium 的更多信息,请在此处阅读 http://selenium-python.readthedocs.io/我也有关于 Slenium 的博客文章供初学者使用.也检查这个 http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/

More on Selenium, read it here http://selenium-python.readthedocs.io/ Also I have blog post on Slenium for the beginners. Check this one too http://blog.hassanmehmood.com/creating-your-first-crawler-in-python/

示例

import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

profile_link = 'http://hassanmehmood.com'


class TitleScrapper(object):

    def __init__(self):

        fp = webdriver.FirefoxProfile()
        fp.set_preference("browser.startup.homepage_override.mstone", "ignore") #Avoid startup screen
        fp.set_preference("startup.homepage_welcome_url.additional",  "about:blank")

        self.driver = webdriver.Firefox(firefox_profile=fp)
        self.driver.set_window_size(1120, 550)

    def scrape_profile(self):
        self.driver.get(profile_link)
        print self.driver.title
        self.driver.close()

    def scrape(self):
        self.scrape_profile()


if __name__ == '__main__':
    scraper = TitleScrapper()
    scraper.scrape()

这篇关于抓取的 HTML 与源代码有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆