如何加速python中的网页抓取 [英] How to speed up web scraping in python

查看:27
本文介绍了如何加速python中的网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为学校开展一个项目,我正在尝试获取有关电影的数据.我设法编写了一个脚本来从 IMDbPY 和 Open Movie DB API (omdbapi.com) 获取我需要的数据.我遇到的挑战是我试图获取 22,305 部电影的数据,每个请求大约需要 0.7 秒.基本上我当前的脚本需要大约 8 个小时才能完成.寻找可能同时使用多个请求的任何方法或任何其他建议以显着加快获取此数据的过程.

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.

import urllib2
import json
import pandas as pd
import time
import imdb

start_time = time.time() #record time at beginning of script

#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }

#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='

#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')

cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
 u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
 u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
 u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
 u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
 u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
 u'imdbpy_gross']

#create movies dataframe
movies = pd.DataFrame(columns=cols)

i=0
for i in range(len(imdb_ids)-1):

    start = time.time()
    req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
    response = urllib2.urlopen(req) #actually call the html request
    the_page = response.read() #read the json from the omdbapi query
    movie_json = json.loads(the_page) #convert the json to a dict

    #get the gross revenue and budget from IMDbPy
    data = imdb.IMDb()
    movie_id = imdb_ids.ix[i,['imdb_id']]
    movie_id = movie_id.to_string()
    movie_id = int(movie_id[-7:])
    data = data.get_movie_business(movie_id)
    data = data['data']
    data = data['business']

    #get the budget $ amount out of the budget IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0]) 
    except:
        None

    #get the gross $ amount out of the gross IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0])

        #get the gross $ amount out of the gross IMDbPy string
        gross = data['gross']
        gross = gross[0]
        gross = gross.replace('$', '')
        gross = gross.replace(',', '')
        gross = gross.split(' ')
        gross = str(gross[0])
    except:
        None

    #add gross to the movies dict 
    try:
        movie_json[u'imdbpy_gross'] = gross
    except:
        movie_json[u'imdbpy_gross'] = 0

    #add gross to the movies dict    
    try:
        movie_json[u'imdbpy_budget'] = budget
    except:
        movie_json[u'imdbpy_budget'] = 0

    #create new dataframe that can be merged to movies DF    
    tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
    tempDF = tempDF.T

    #add the new movie to the movies dataframe
    movies = movies.append(tempDF, ignore_index=True)
    end = time.time()
    time_took = round(end-start, 2)
    percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
    print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
    #increment counter
    i+=1  

#save the dataframe to a csv file            
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"

推荐答案

使用Eventlet库并发获取

正如评论中所建议的,您应该同时获取您的提要.这可以通过使用 treadingmultiprocessing 或使用 eventlet 来完成.

Use Eventlet library to fetch concurently

As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.

$ pip install eventlet

尝试来自 eventlet

的网络爬虫示例

参见:http://eventlet.net/doc/examples.html#web-crawler

使用 threading 系统负责在您的线程之间切换.如果您必须访问一些常见的数据结构,这会带来大问题,因为您永远不知道哪个其他线程当前正在访问您的数据.然后,您开始使用同步块、锁、信号量 - 只是为了同步对共享数据结构的访问.

With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.

使用eventlet 就简单多了——你总是只运行一个线程,并且只在I/O 指令或其他eventlet 调用时在它们之间跳转.您的其余代码不间断地运行,并且没有风险,另一个线程会弄乱我们的数据.

With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.

您只需要注意以下几点:

You only have to take care of following:

  • 所有 I/O 操作必须是非阻塞的(这很容易,eventlet 为您需要的大多数 I/O 提供非阻塞版本).

  • all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).

您剩余的代码不得占用 CPU 资源,因为它会长时间阻塞绿色"线程之间的切换,并且绿色"多线程的能力将消失.

your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.

eventlet 的巨大优势在于,它允许以简单的方式编写代码,而不会用锁、信号量等破坏它(太多).

Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.

如果我理解正确,要获取的 url 列表是预先知道的,并且它们在分析中的处理顺序并不重要.这将允许从 eventlet 几乎直接复制示例.我明白了,索引 i 有一些意义,所以你可以考虑将 url 和索引混合为一个元组,并将它们作为独立的作业处理.

If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.

肯定还有其他方法,但我个人发现 eventlet 与其他技术相比非常易于使用,同时获得非常好的结果(尤其是获取提要).您只需要掌握主要概念,并注意遵循 eventlet 要求(保持非阻塞).

There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).

有多种用于使用 requests 进行异步处理的包,其中之一使用 eventlet 并命名为erequests 参见 https://github.com/saghul/erequests

There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests

import erequests

# have list of urls to fetch
urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed

# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
    if req.ok:
        content = req.content
        # process it here
        print "processing data from:", req.url

处理此特定问题的问题

我们能够获取并以某种方式处理我们需要的所有网址.但是在这个问题中,处理绑定到源数据中的特定记录,因此我们需要将处理过的请求与我们需要的记录索引进行匹配,以获得最终处理的更多详细信息.

Problems for processing this specific question

We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.

正如我们稍后将看到的,异步处理不遵循请求的顺序,有些处理得早一些处理得晚些,map 产生任何已完成的东西.

As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.

一种选择是将给定 url 的索引附加到请求中,然后在处理返回数据时使用它.

One option is to attach index of given url to the requests and use it later when processing returned data.

注意:以下示例相当复杂,如果您可以接受上面提供的解决方案,请跳过此部分.但请确保您没有遇到下面检测到并解决的问题(网址被修改,重定向后的请求).

Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).

import erequests
from itertools import count, izip
from functools import partial

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

def print_url_index(index, req, *args, **kwargs):
    content_length = req.headers.get("content-length", None)
    todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
    print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())

async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))

for req in erequests.map(async_reqs):
    pass

为请求附加钩子

requests(以及erequests)允许定义钩子到名为response 的事件.每次请求得到响应时,都会调用这个钩子函数,可以做一些事情,甚至修改响应.

Attaching hooks to request

requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.

以下行定义了一些响应钩子:

Following line defines some hook to response:

erequests.async.get(url, hooks={"response": partial(print_url_index, i)})

将url索引传递给钩子函数

任何钩子的签名应该是 func(req, *args, *kwargs)

但是我们需要将我们正在处理的 url 索引传递给钩子函数.

But we need to pass into the hook function also the index of url we are processing.

为此,我们使用 functools.partial,它允许通过将一些参数固定为特定值来创建简化的函数.这正是我们需要的,如果您看到print_url_index 签名,我们只需要修复index 的值,其余的将满足钩子功能的要求.

For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.

在我们的调用中,我们使用 partial 和简化函数的名称 print_url_index 并为每个 url 提供它的唯一索引.

In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.

可以通过enumerate在循环中提供索引,在大量参数的情况下,我们可以使用更高效的内存方式并使用count,每次递增生成数字默认从 0 开始.

Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.

$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/

这表明:

  • 请求不会按照它们生成的顺序进行处理
  • 有些请求会跟随重定向,所以钩子函数会被多次调用
  • 仔细检查我们可以看到的 url 值,响应没有报告来自原始列表 urls 的 url,即使对于索引 2 我们也附加了额外的 /.这就是为什么在原始 url 列表中简单查找响应 url 对我们没有帮助的原因.
  • requests are not processed in the order they were generated
  • some requests follow redirection, so hook function is called multiple times
  • carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.

这篇关于如何加速python中的网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆