如何将代理添加到BeautifulSoup搜寻器 [英] How to add proxies to BeautifulSoup crawler

查看:227
本文介绍了如何将代理添加到BeautifulSoup搜寻器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这些是python搜寻器中的定义:

These are the definitions in the python crawler:

from __future__ import with_statement

from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime

如何向在BeautifulSoup上工作的递归爬虫添加一个旋转代理(每个打开的线程一个代理)?

How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup?

如果我使用Mechanise的浏览器,我知道如何添加代理:

I know how to add proxies if I was using Mechanise's browser:

br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})

但是我想特别了解BeautifulSoup需要什么样的解决方案.

but I would like to know specifically what kind of solution would BeautifulSoup require.

非常感谢您的帮助!

推荐答案

请注意,现在可以使用此解决方案的复杂性较低,请共享

Heads up that there is a less complex solution to this available now, shared here:

import requests

proxies = {"http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.10:1080"}

requests.get("http://example.org", proxies=proxies)

然后从请求响应中像往常一样做一顿漂亮的汤.

Then do your beautifulsoup as normal from the request response.

因此,如果您希望使用具有不同代理的单独线程,则只需为每个请求调用不同的词典条目(例如,从词典列表中).

So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).

当已经请求使用现有软件包/bs4时,这似乎更容易实现,因为它只是在现有requests.get()调用上添加的一个额外的**kwargs.您不必为每个线程初始化/安装/打开单独的urllib处理程序.

This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs added on your existing requests.get() call. You don't have to initialize/install/open separate urllib handlers for each thread.

这篇关于如何将代理添加到BeautifulSoup搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆