如何将代理添加到BeautifulSoup搜寻器 [英] How to add proxies to BeautifulSoup crawler
问题描述
这些是python搜寻器中的定义:
These are the definitions in the python crawler:
from __future__ import with_statement
from eventlet.green import urllib2
import eventlet
import re
import urlparse
from bs4 import BeautifulSoup, SoupStrainer
import sqlite3
import datetime
如何向在BeautifulSoup上工作的递归爬虫添加一个旋转代理(每个打开的线程一个代理)?
How to I add a rotating proxy (one proxy per open thread) to a recursive cralwer working on BeautifulSoup?
如果我使用Mechanise的浏览器,我知道如何添加代理:
I know how to add proxies if I was using Mechanise's browser:
br = Browser()
br.set_proxies({'http':'http://username:password@proxy:port',
'https':'https://username:password@proxy:port'})
但是我想特别了解BeautifulSoup需要什么样的解决方案.
but I would like to know specifically what kind of solution would BeautifulSoup require.
非常感谢您的帮助!
推荐答案
Heads up that there is a less complex solution to this available now, shared here:
import requests
proxies = {"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080"}
requests.get("http://example.org", proxies=proxies)
然后从请求响应中像往常一样做一顿漂亮的汤.
Then do your beautifulsoup as normal from the request response.
因此,如果您希望使用具有不同代理的单独线程,则只需为每个请求调用不同的词典条目(例如,从词典列表中).
So if you want separate threads with different proxies, you just call different dictionary entries for each request (e.g. from a list of dicts).
当已经请求使用现有软件包/bs4时,这似乎更容易实现,因为它只是在现有requests.get()
调用上添加的一个额外的**kwargs
.您不必为每个线程初始化/安装/打开单独的urllib处理程序.
This seems more straight-forward to implement when your existing package use is already requests / bs4, since it is just an extra **kwargs
added on your existing requests.get()
call. You don't have to initialize/install/open separate urllib handlers for each thread.
这篇关于如何将代理添加到BeautifulSoup搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!