无法修改函数以独立工作,而不是依赖于返回的结果 [英] Can't modify a function to work independently instead of depending on a returned result

查看:108
本文介绍了无法修改函数以独立工作,而不是依赖于返回的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在python中使用代理编写了一个脚本,同时将请求发送到某些链接,以便从那里解析产品名称.我当前的尝试完美地完成了这项工作.此函数parse_product()完全取决于返回的结果(代理),以便以正确的方式重用同一代理.我试图以这种方式修改parse_product()函数,以便该函数不依赖于对该函数的先前调用,以便重复使用有效的代理直到无效.更清楚一点-我期望主要功能如下所示.但是,完成解决后,我将使用多重处理来使脚本运行更快:

I've written a script in python making use of proxies while sending requests to some links in order to parse the product name from there. My current attempt does the job flawlessly. This function parse_product() is fully dependent on the returned result (proxy) in order to reuse the same proxy in the right way. I'm trying to modify parse_product() function in such a way so that the very function does not depend on a previous call to the same function in order to reuse a working proxy until invalid. To be clearer - I'm expecting the main function to be more like below. However, when it is done solving, I'll use multiprocessing to make the script run faster:

if __name__ == '__main__':
    for url in linklist:
        parse_product(url)

并且仍然希望脚本能够像现在一样工作.

and still, expect that the script will work as it is now.

我尝试过(使用一个):

I've tried with (working one):

import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup

linklist = [
    'https://www.amazon.com/dp/B00OI0RGGO', 
    'https://www.amazon.com/dp/B00TPKOPWA', 
    'https://www.amazon.com/dp/B00TH42HWE' 
]

proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']

def process_proxy(proxy):
    global proxyVault
    if not proxy:
        proxy_url = choice(proxyVault)
        proxy = {'https': f'http://{proxy_url}'}
    else:
        proxy_pattern = proxy.get("https").split("//")[-1]
        if proxy_pattern in proxyVault:
            proxyVault.remove(proxy_pattern)
        random.shuffle(proxyVault)
        proxy_url = choice(proxyVault)
        proxy = {'https': f'http://{proxy_url}'}
    return proxy


def parse_product(link,proxy):
    try:
        if not proxy:raise
        print("checking the proxy:",proxy)
        res = requests.get(link,proxies=proxy,timeout=5)
        soup = BeautifulSoup(res.text,"html5lib")
        try:
            product_name = soup.select_one("#productTitle").get_text(strip=True)
        except Exception: product_name = ""

        return proxy, product_name

    except Exception:
        """the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
        proxy_link = process_proxy(proxy)
        return parse_product(link,proxy_link)


if __name__ == '__main__':
    proxy = None
    for url in linklist:
        result = parse_product(url,proxy)
        proxy = result[0]
        print(result)

注意:parse_product()函数返回代理和产品名称.但是,该函数返回的代理将在同一函数parse_product()中重用,直到无效.

Note: parse_product() function returns a proxy and a product name. However, the proxy the function returns gets reused within the same function parse_product() until invalid.

顺便说一句,proxyVault中使用的代理只是占位符.

By the way, proxies used within proxyVault are just placeholders.

推荐答案

如果您不需要多线程支持(您的编辑建议您不需要),则可以通过以下较小更改来使其工作.在整理列表(您的代码同时具有shufflechoice)之后,proxyVault保留整个代理池,活动代理(最后一个),但是仅其中之一就足够了).从列表中pop() -ing更改活动代理,直到没有剩余为止.

If you don't need multithreading support (your edits suggest you don't), you can make it work with the following minor changes. proxyVault keeps both the entire proxy pool, and the active proxy (the last one) after shuffling the list (your code had both shuffle and choice, but just one of them is enough). pop()-ing from the list changes the active proxy, until there are no more left.

import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup

linklist = [
    'https://www.amazon.com/dp/B00OI0RGGO',
    'https://www.amazon.com/dp/B00TPKOPWA',
    'https://www.amazon.com/dp/B00TH42HWE'
]

proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
random.shuffle(proxyVault)


class NoMoreProxies(Exception):
    pass


def skip_proxy():
    global proxyVault
    if len(proxyVault) == 0:
        raise NoMoreProxies()
    proxyVault.pop()


def get_proxy():
    global proxyVault
    if len(proxyVault) == 0:
        raise NoMoreProxies()
    proxy_url = proxyVault[-1]
    proxy = {'https': f'http://{proxy_url}'}
    return proxy


def parse_product(link):
    try:
        proxy = get_proxy()
        print("checking the proxy:", proxy)
        res = requests.get(link, proxies=proxy, timeout=5)
        soup = BeautifulSoup(res.text, "html5lib")
        try:
            product_name = soup.select_one("#productTitle").get_text(strip=True)
        except Exception:
            product_name = ""

        return product_name

    except Exception:
        """the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
        skip_proxy()
        return parse_product(link)


if __name__ == '__main__':
    for url in linklist:
        result = parse_product(url)
        print(result)

我还建议更改最后一个try/except子句以捕获RequestException而不是Exception.

I would also suggest changing the last try/except clause to catch a RequestException instead of Exception.

这篇关于无法修改函数以独立工作,而不是依赖于返回的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆