使用Tor代理时使用多线程搜寻器 [英] multithreaded crawler while using tor proxy

查看:168
本文介绍了使用Tor代理时使用多线程搜寻器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建使用Tor代理的多线程搜寻器: 我正在使用以下方法建立Tor连接:

I am trying to build multi threaded crawler that uses tor proxies: I am using following to establish tor connection:

from stem import Signal
from stem.control import Controller
controller = Controller.from_port(port=9151)
def connectTor():
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150)
    socket.socket = socks.socksocket


def renew_tor():
    global request_headers
    request_headers = {
        "Accept-Language": "en-US,en;q=0.5",
        "User-Agent": random.choice(BROWSERS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": "http://thewebsite2.com",
        "Connection": "close"
    }

    controller.authenticate()
    controller.signal(Signal.NEWNYM)

这是网址提取程序:

def get_soup(url):
    while True:
        try:
            connectTor()
            r = requests.Session()
            response = r.get(url, headers=request_headers)
            the_page = response.content.decode('utf-8',errors='ignore')
            the_soup = BeautifulSoup(the_page, 'html.parser')
            if "captcha" in the_page.lower():
                print("flag condition matched while url: ", url)
                #print(the_page)
                renew_tor()
            else:
                return the_soup
                break
        except Exception as e:
            print ("Error while URL :", url, str(e))

然后我要创建多线程提取作业:

I am then creating multithreaded fetch job:

with futures.ThreadPoolExecutor(200) as executor:
            for url in zurls:
                future = executor.submit(fetchjob,url)

然后我遇到以下错误,在使用多处理程序时看不到此错误:

then I am getting following error, which I am not seeing when I use multiprocessing:

 Socket connection failed (Socket error: 0x01: General SOCKS server failure)

我将不胜感激,建议您避免袜子错误并提高爬网方法的性能以使其成为多线程.

I would appreciate Any advise to avoid socks error and improving the performance of crawling method to make it multi threaded.

推荐答案

这是为什么猴子修补socket.socket不好的完美示例.

This is a perfect example of why monkey patching socket.socket is bad.

这用SOCKS套接字替换了 all socket连接(大多数情况下)使用的套接字.

This replaces the socket used by all socket connections (which is most everything) with the SOCKS socket.

稍后再连接到控制器时,它将尝试使用SOCKS协议进行通信,而不是建立直接连接.

When you go to connect to the controller later, it attempts to use the SOCKS protocol to communicate instead of establishing a direct connection.

由于您已经在使用requests,因此建议您摆脱SocksiPy和socks.socket = socks.socksocket代码,并使用SOCKS

Since you're already using requests, I'd suggest getting rid of SocksiPy and the socks.socket = socks.socksocket code and using the SOCKS proxy functionality built into requests:

proxies = {
    'http': 'socks5h://127.0.0.1:9050',
    'https': 'socks5h://127.0.0.1:9050'
}

response = r.get(url, headers=request_headers, proxies=proxies)

这篇关于使用Tor代理时使用多线程搜寻器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆