无法修改函数以独立工作,而不是依赖于返回的结果 [英] Can't modify a function to work independently instead of depending on a returned result
问题描述
我已经在python中使用代理编写了一个脚本,同时将请求发送到某些链接,以便从那里解析产品名称.我当前的尝试完美地完成了这项工作.此函数parse_product()
完全取决于返回的结果(代理),以便以正确的方式重用同一代理.我试图以这种方式修改parse_product()
函数,以便该函数不依赖于对该函数的先前调用,以便重复使用有效的代理直到无效.更清楚一点-我期望主要功能如下所示.但是,完成解决后,我将使用多重处理来使脚本运行更快:
I've written a script in python making use of proxies while sending requests to some links in order to parse the product name from there. My current attempt does the job flawlessly. This function parse_product()
is fully dependent on the returned result (proxy) in order to reuse the same proxy in the right way. I'm trying to modify parse_product()
function in such a way so that the very function does not depend on a previous call to the same function in order to reuse a working proxy until invalid. To be clearer - I'm expecting the main function to be more like below. However, when it is done solving, I'll use multiprocessing to make the script run faster:
if __name__ == '__main__':
for url in linklist:
parse_product(url)
并且仍然希望脚本能够像现在一样工作.
and still, expect that the script will work as it is now.
我尝试过(使用一个):
I've tried with (working one):
import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE'
]
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
def process_proxy(proxy):
global proxyVault
if not proxy:
proxy_url = choice(proxyVault)
proxy = {'https': f'http://{proxy_url}'}
else:
proxy_pattern = proxy.get("https").split("//")[-1]
if proxy_pattern in proxyVault:
proxyVault.remove(proxy_pattern)
random.shuffle(proxyVault)
proxy_url = choice(proxyVault)
proxy = {'https': f'http://{proxy_url}'}
return proxy
def parse_product(link,proxy):
try:
if not proxy:raise
print("checking the proxy:",proxy)
res = requests.get(link,proxies=proxy,timeout=5)
soup = BeautifulSoup(res.text,"html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception: product_name = ""
return proxy, product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
proxy_link = process_proxy(proxy)
return parse_product(link,proxy_link)
if __name__ == '__main__':
proxy = None
for url in linklist:
result = parse_product(url,proxy)
proxy = result[0]
print(result)
注意:parse_product()
函数返回代理和产品名称.但是,该函数返回的代理将在同一函数parse_product()
中重用,直到无效.
Note: parse_product()
function returns a proxy and a product name. However, the proxy the function returns gets reused within the same function parse_product()
until invalid.
顺便说一句,proxyVault中使用的代理只是占位符.
By the way, proxies used within proxyVault are just placeholders.
推荐答案
如果您不需要多线程支持(您的编辑建议您不需要),则可以通过以下较小更改来使其工作.在整理列表(您的代码同时具有shuffle
和choice
)之后,proxyVault
保留整个代理池,和活动代理(最后一个),但是仅其中之一就足够了).从列表中pop()
-ing更改活动代理,直到没有剩余为止.
If you don't need multithreading support (your edits suggest you don't), you can make it work with the following minor changes. proxyVault
keeps both the entire proxy pool, and the active proxy (the last one) after shuffling the list (your code had both shuffle
and choice
, but just one of them is enough). pop()
-ing from the list changes the active proxy, until there are no more left.
import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE'
]
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
random.shuffle(proxyVault)
class NoMoreProxies(Exception):
pass
def skip_proxy():
global proxyVault
if len(proxyVault) == 0:
raise NoMoreProxies()
proxyVault.pop()
def get_proxy():
global proxyVault
if len(proxyVault) == 0:
raise NoMoreProxies()
proxy_url = proxyVault[-1]
proxy = {'https': f'http://{proxy_url}'}
return proxy
def parse_product(link):
try:
proxy = get_proxy()
print("checking the proxy:", proxy)
res = requests.get(link, proxies=proxy, timeout=5)
soup = BeautifulSoup(res.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception:
product_name = ""
return product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
skip_proxy()
return parse_product(link)
if __name__ == '__main__':
for url in linklist:
result = parse_product(url)
print(result)
我还建议更改最后一个try/except子句以捕获RequestException
而不是Exception
.
I would also suggest changing the last try/except clause to catch a RequestException
instead of Exception
.
这篇关于无法修改函数以独立工作,而不是依赖于返回的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!