多线程网络刮板在启用Cookie的网站上使用urlretrieve [英] Multi threaded web scraper using urlretrieve on a cookie-enabled site

查看:513
本文介绍了多线程网络刮板在启用Cookie的网站上使用urlretrieve的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想写我的第一个Python脚本,并且有很多Google搜索,我认为我只是做了。



我需要写一个脚本,登录到启用Cookie的网站,抓取一堆链接,然后产生几个进程下载文件。我有程序运行在单线程,所以我知道代码工作。但是,当我试图创建一个下载工作者池,我遇到了一堵墙。

 #manager.py 
import Fetch#多处理导入池中
的模块名

def FetchReports(links,Username,Password,VendorID):
pool = multiprocessing.Pool(processes = 4,initializer = Fetch._ProcessStart,initargs =(SiteBase,DataPath,Username,Password,VendorID,))
pool.map(Fetch.DownloadJob,links)
pool.close()
pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,密码,VendorID):
登录(用户,密码)

全局SiteBase
SiteBase = _SiteBase

全局DataPath
DataPath = _DataPath

atexit.register(logout)

def DownloadJob(link):
mechanize.urlretrieve(mechanize.urljoin(SiteBase,link),filename = DataPath +'\ \'+ filename,data = data)
return True

代码失败,因为cookie没有被转移到工作者urlretrieve使用。没有问题,我可以使用mechanize的.cookiejar类来保存cookie到管理器中,并传递给工作者。

 #worker.py 
import mechanize
import atexit

来自多重处理import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID) :
全局cookie
cookies = mechanize.LWPCookieJar()

opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

登录,Password,opener)#注意我通过开启者登录,所以它可以捕获的cookie。

全局SiteBase
SiteBase = _SiteBase

全局DataPath
DataPath = _DataPath

cookie.save(DataPath +'\ \'+ current_process()。name +'cookies.txt',True,True)

atexit.register(注销)

def DownloadJob(link):
cj = mechanize.LWPCookieJar()
cj.revert(filename = DataPath +'\\'+ current_process()。name +'cookies.txt',ignore_discard = True,ignore_expires = True)
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

file = open(DataPath +'\\'+ filename,wb)
file.write(opener.open mechanize.urljoin(SiteBase,link))。read()
file.close

但是,THAT失败,因为opener(我认为)想把二进制文件回到经理进行处理,我得到一个无法pickle对象的错误消息,指的是它试图读取的文件的网页。 p>

显而易见的解决方案是从cookie jar中读取cookie,并在进行urlretrieve请求时手动将它们添加到标题中,但我想避免这种情况,为什么我要钓鱼的建议。

解决方案

在大部分时间工作后,机械化不是问题,它看起来更像一个编码错误。



对于像我这样的未来Google员工,我提供以下更新的代码:

 #manager.py [与原来不变] 
def FetchReports(links,Username,Password,VendorID):
import Fetch
import multiprocessing

pool = multiprocessing.Pool(processes = 4,initializer = Fetch._ProcessStart,initargs =(SiteBase,DataPath,Username,Password,VendorID,))
pool。 map(Fetch.DownloadJob,_SplitLinksArray(links))
pool.close()
pool.join()


#worker.py
import从多处理导入机制
import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
全局cookie
cookies = mechanize.LWPCookieJar $ b opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

登录(用户,密码,开启者)

全局SiteBase
SiteBase = _SiteBase

全局DataPath
DataPath = _DataPath

cookie.save(DataPath +'\\'+ current_process()。name +'cookies.txt',True,True)

def DownloadJob(link):
cj = mechanize.LWPCookieJar()
cj.revert(filename = DataPath +'\\'+ current_process .txt',True,True)
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

mechanize.urlretrieve(url = mechanize.urljoin(SiteBase,link) DataPath +'\\'+ filename,data = data)

一个列表,机械化的非线程安全性质似乎并不是一个问题[完全公开:我已经运行这个过程正好三次,所以一个问题可能出现在进一步测试]。多处理模块和它的工作池完成了所有的繁重工作。维护文件中的cookie对我来说很重要,因为我从下载的web服务器必须给每个线程它自己的会话ID,但实现此代码的其他人可能不需要使用它。我注意到它似乎在init调用和run调用之间忘记变量,所以cookiejar可能不会跳转。


I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line.

I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then spawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall.

#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool

def FetchReports(links,Username,Password,VendorID):
    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,links)
    pool.close()
    pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    Login(User,Password)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    atexit.register(Logout)

def DownloadJob(link):
    mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
    return True

In this revision, the code fails because the cookies have not been transferred to the worker for urlretrieve to use. No problem, I was able to use mechanize's .cookiejar class to save the cookies in the manager, and pass them to the worker.

#worker.py
import mechanize
import atexit

from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()

    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)  # note I pass the opener to Login so it can catch the cookies.

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

    atexit.register(Logout)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    file = open(DataPath+'\\'+filename, "wb")
    file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
    file.close

But, THAT fails because opener (I think) wants to move the binary file back to the manager for processing, and I get an "unable to pickle object" error message, referring to the webpage it's trying to read to the file.

The obvious solution is to read the cookies in from the cookie jar and manually add them to the header when making the urlretrieve request, but I am trying to avoid that, and that is why I am fishing for suggestions.

解决方案

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

这篇关于多线程网络刮板在启用Cookie的网站上使用urlretrieve的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆