python/beautifulsoup问题中的多处理 [英] Multiprocessing in python/beautifulsoup issues

查看:234
本文介绍了python/beautifulsoup问题中的多处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,我是python的新手.我想做的是将旧代码移入多处理程序,但是我遇到了一些错误,希望有人能帮助我.我的代码用于检查以文本形式给出的几千个链接,以检查某些标签.一旦找到它将输出给我.由于我要检查数千个链接的原因,速度是一个问题,因此需要我转向多重处理.

Hi guys i'm fairly new in python. what i'm trying to do is to move my old code into multiprocessing however i'm facing some errors that i hope anyone could help me out. My code is used to check a few thousand links given in a text form to check for certain tags. Once found it will output it to me. Due to the reason i have a few thousand links to check, speed is an issue and hence the need for me to move to multi processing.

更新:我的返回错误为HTTP 503错误.我发送的请求太多还是想念痛风的东西?

Update: i'm having return errors of HTTP 503 errors. Am i sending too much request or am i missin gout something?

多处理代码:

from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

no_stock = []

def main(lines):
    done = False
    tries = 1
    while tries and not done:
        try:
            r = br.open(lines, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except Exception as e: 
            print '%s: %s' % (e.__class__.__name__, e)
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}\n'.format(lines))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(lines)

if __name__ == "__main__":
    r = br.open('http://www.randomweb.com/') #avoid redirection
    fileName = "url.txt"
    pool = Pool(processes=2)
    with open(fileName, "r+") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
    for i in no_stock:
        f.write(i + '\n')

跟踪:

Traceback (most recent call last):
  File "test2.py", line 43, in <module>
    lines = pool.map(main, f)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
UnboundLocalError: local variable 'soup' referenced before assignment

我的txt文件是这样的:-

my txt file is something like this:-

http://www.randomweb.com/item.htm?uuid=44733096229
http://www.randomweb.com/item.htm?uuid=4473309622789
http://www.randomweb.com/item.htm?uuid=447330962291
....etc

推荐答案

from mechanize import Browser
from bs4 import BeautifulSoup
import sys
import socket
from multiprocessing.dummy import Pool  # This is a thread-based Pool
from multiprocessing import cpu_count

br = Browser()

no_stock = []

def main(line):
    done = False
    tries = 3
    while tries and not done:
        try:
            r = br.open(line, timeout=15)
            r = r.read()
            soup = BeautifulSoup(r,'html.parser')
            done = True # exit the loop
        except socket.timeout:
            print('Failed socket retrying')
            tries -= 1 # to exit when tries == 0
        except:
            print('Random fail retrying')
            print sys.exc_info()[0]
            tries -= 1 # to exit when tries == 0
    if not done:
        print('Failed for {}\n'.format(i))
    table = soup.find_all('div', {'class' : "empty_result"})
    results = soup.find_all('strong', style = 'color: red;')
    if table or results:
        no_stock.append(i)

if __name__ == "__main__":
    fileName = "url.txt"
    pool = Pool(cpu_count() * 2)  # Creates a Pool with cpu_count * 2 threads.
    with open(fileName, "rb") as f:
        lines = pool.map(main, f)
    with open('no_stock.txt','w') as f :
        f.write('No. of out of stock items : '+str(len(no_stock))+'\n'+'\n')
    for i in no_stock:
        f.write(i + '\n')

pool.map有两个参数,第一个是函数(在您的代码中是main),另一个是可迭代的,iterable的每个项都是该函数的参数(在您的代码中,是每一行的文件)

pool.map takes two parameters, the fist is a function(in your code, is main), the other is an iterable, each item of iterable will be a parameter of the function(in your code, is each line of the file)

这篇关于python/beautifulsoup问题中的多处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆