使用特定的URL和脚本构建代理旋转器 [英] Building a proxy rotator with specific URL and script

查看：2393 发布时间：2020/11/24 6:19:02 html python-3.x web-scraping proxy rotation

本文介绍了使用特定的URL和脚本构建代理旋转器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在努力使用结构化为不同url的现有代码构建代理旋转器.

I am struggling to build a proxy rotator with existing code structured for a different url.

下面的代码示例中提供了我想要的URL.我正在尝试让提供的脚本调用所需的URL，并在代理类型为"HTTPS"时获取 ALL "IP:PORT"(当前脚本限制为十个).
可以在xpath或bs4中完成.我更喜欢bs4.

The URLs I want are provided in the code example below. I am trying to have the provided script call the desired URLs and get ALL the 'IP:PORT' (current script limits to ten) when proxy type is "HTTPS".
It can be done in xpath or bs4. I am familair with bs4 more though.

我理解逻辑，但是在如何构造它上却失败了. 首先，我尝试剥离字符串并尝试调用特定的td元素，但是它不起作用.

I understand the logic but I am failing on how to structure this. To start, I've tried stripping strings and trying to call specific td elements but its not working.

#URLs I want 
url_list = ['http://spys.one/free-proxy-list/US/','http://spys.one/free-proxy-list/US/1/']

#code I have 
 from lxml.html import fromstring
 import requests
 from itertools import cycle
 import traceback

 def get_proxies():
 url = 'https://free-proxy-list.net/'
 response = requests.get(url)
 parser = fromstring(response.text)
 proxies = set()
 for i in parser.xpath('//tbody/tr')[:10]:
     if i.xpath('.//td[7][contains(text(),"yes")]'):
        proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
        proxies.add(proxy)
return proxies

proxies = get_proxies()
proxy_pool = cycle(proxies)
proxy = next(proxy_pool)
response = requests.get(url,proxies={"http": proxy, "https": proxy})

我希望了解提供的代码是如何针对2个所需的URL构造的，当代理类型为HTTPS时返回所有IP:PORT号

I hope to learn how the code provided is structured for the 2 desired URLs, return all IP:PORT numbers when proxy type is HTTPS

推荐答案

一种方法是在循环中发出特定于端口的POST请求.您可以修改以添加到一个最终列表中.终结点已经是https专用的.

One way is to issue port specific POST requests in a loop. You could amend to add to one final list. The endpoint is already https specific.

import requests
from bs4 import BeautifulSoup as bs

def get_proxies(number, port, p):
    r = requests.post('http://spys.one/en/https-ssl-proxy/', data = {'xpp': 5, 'xf4': number})
    proxies = [':'.join([str(i),port]) for i in p.findall(r.text)]
    return proxies

ports = ['3128', '8080', '80']
p = re.compile(r'spy14>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})<script')
proxies = []

for number, port in enumerate(ports,1):
    proxies+=get_proxies(number, port, p)

print(proxies)

结果示例:

针对特定国家/地区:

import requests
from bs4 import BeautifulSoup as bs

def get_proxies(number, port, p, country):
    r = requests.post('http://spys.one/en/https-ssl-proxy/',  data = {'xpp': 5, 'xf4': number})
    soup = bs(r.content, 'lxml')
    proxies = [':'.join([p.findall(i.text)[0], port]) for i in soup.select('table table tr:has(.spy14:contains("' + country + '")) td:has(script) .spy14')]
    return proxies

ports = ['3128', '8080', '80']
p = re.compile(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})document')
proxies = []

for number, port in enumerate(ports,1):
    proxies+=get_proxies(number, port, p, 'United States')

print(proxies)

对于你所说的已经写过的，我将参考我的原始答案:

For the one you said is already written I will refer to my original answer of:

from bs4 import BeautifulSoup as bs
import requests

def get_proxies(): 
    r = requests.get('https://free-proxy-list.net/')
    soup = bs(r.content, 'lxml')
    proxies = {tr.td.text + ':' + tr.td.next_sibling.text for tr in soup.select('tr:has(.hx:contains(yes))')} 
    return proxies 

get_proxies()

这篇关于使用特定的URL和脚本构建代理旋转器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用特定的URL和脚本构建代理旋转器 [英] Building a proxy rotator with specific URL and script

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

使用特定的URL和脚本构建代理旋转器 [英] Building a proxy rotator with specific URL and script

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭