使用 Python 在代理后面登录、导航和检索数据 [英] Login, Navigate and Retrieve data behind a proxy with Python

查看:49
本文介绍了使用 Python 在代理后面登录、导航和检索数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用 Python 脚本能够登录网站并检索一些数据.这背后是我公司的代理.

I want, with a python script, to be able to login a website and retrieve some data. This behind my company's proxy.

我知道这个问题似乎与您可以搜索的其他问题重复,但事实并非如此.

I know that this question seems a duplicate of others that you can find searching, but it isn't.

我已经尝试在对这些答案的回复中使用建议的解决方案,但它们没有用...我不仅需要一段代码来登录并获取特定网页,还需要一些概念"背后的概念"所有这些机制都有效.

I already tried using the proposed solutions in the responses to those answers but they didn't work... I don't only need a piece of code to login and get a specific webpage but also some "concepts" behind how all this mechanism works.

这里是我想要做的事情的描述:

Here is a description of what I want to be able to do:

登录网站 > 进入 X 页 > 在 X 页的某种形式中插入数据并按下计算"按钮 > 捕获我的查询结果

Log into a website > Get to page X > Insert data in some form of page X and push "Calculate" button > Capture the results of my query

获得结果后,我将了解如何对数据进行排序.

Once I have the results I'll see how to sort how the data.

我怎样才能在代理后面实现这一点?每次我尝试使用请求"库登录时,它都不起作用,说我无法获得第 X 页,因为我没有进行身份验证……或者最糟糕的是,我什至无法到达那一侧,因为我没有之前设置代理.

How can I achieve this behind a proxy? Every time I try to use "request" library to login it doesn't work saying I am unable to get page X since I did not authenticate... or worst, I am even unable to get to that side because I didn't set up the proxy before.

推荐答案

需求说明

首先,确保您了解获得计算结果的上下文

Clarification of Requirements

First, make sure you understand context for getting results of your calculation

(F12 将在 Chrome 中显示 DevTools 或在 Firefox 中显示 Firebug,您可以在其中了解下面讨论的大多数详细信息)

(F12 shall show DevTools in Chrome or Firebug in Firefox where you can learn most details discussed below)

  • 您是否管理从您的网络浏览器的目标页面访问?
  • 真的有必要使用代理吗?如果是,则在浏览器中进行测试并准确记下要使用的代理
  • 您必须使用哪种身份验证才能访问目标网络应用.选项为基本"、摘要"或某些自定义,需要填写某种表格并在 cookie 中包含某些内容等.
  • 当您在浏览器中访问计算表单时,按下计算"按钮是否会导致可见的 HTTP 请求?是 POST 吗?请求的内容是什么?

您的情况很可能允许使用简单的 HTTP 通信.我将假设以下情况:

It is very likely, that your situation will allow use of simple HTTP communication. I will assume following situation:

  • 使用了代理并且您知道使用代理的 URL 以及可能的用户名和密码
  • 目标 Web 应用程序上的所有页面都需要基本身份验证或摘要身份验证.
  • 计算按钮使用经典的 HTML 表单,结果是 HTTP POST 请求,所有数据见表单参数.

在某些情况下,获得结果所需的交互部分取决于 JavaScript 代码在页面上执行某些操作.通常可以通过调查最终的HTTP请求是什么来转换成HTTP场景,但在这里我假设这是不可行或不可能的,我们将使用真实浏览器进行模拟.

There are some chances, that part of interaction needed to get your result is dependent on JavaScript code performing something on the page. Often it can be converted into HTTP scenario by investigating, what are final HTTP requests, but here I will assume this is not feasible or possible and we will emulate using real browser.

对于这种情况,我将假设:

For this scenario I will assume:

  • 您可以在网络浏览器中自己执行任务并获得所有必需的信息
    • 代理网址
    • 代理用户名和密码(如果需要)
    • 登录网址
    • 填写登录表单的用户名和密码即可进入
    • 登录后知道在哪里关注"以访问您的计算表格

    Python 提供了优秀的 requests 包,可以满足我们的需求:

    Python provides excellent requests package, which shall serve our needs:

    假设代理在http://10.10.1.10:3128,用户名user,密码pass

    Aassuming proxy at http://10.10.1.10:3128, username being user and password pass

    import requests
    proxies = {
        "http": "http://user:pass@10.10.1.10:3128/",
    }
    #ready for `req = requests.get(url, proxies=proxies)`
    

    基本认证

    假设,Web 应用程序允许用户appuser 和密码apppass

    url = "http://example.com/form"
    auth=("appuser", "apppass")
    req = requests.get(url, auth=auth)
    

    或显式使用 BasicAuthentication

    or using explicitly BasicAuthentication

    from requests.auth import HTTPBasicAuth
    url = "http://example.com/path"
    auth = HTTPBasicAuth("appuser", "apppass")
    req = requests.get(url, auth=auth)
    

    摘要认证的不同之处仅在于类名是 HTTPDigestAuth

    Digest authentication differs only in classname being HTTPDigestAuth

    其他身份验证方法记录在请求页面.

    import requests
    a = 4
    b = 5
    data = {"a": a, "b": b}
    url = "http://example.com/formaction/url"
    req = requests.post(url, data=data)
    

    注意,这个 url 不是表单的 url,而是当你按下 submit 按钮时所采取的动作".

    Note, that this url is not url of the form, but of the "action" taken, when you press the submit button.

    用户通常分两步进入最终的 HTML 表单,首先登录,然后导航到表单.

    Users often go to the final HTML form in two steps, first log in, then navigate to the form.

    但是,Web 应用程序通常允许(知道表单 url)直接访问.这将在同一步骤执行身份验证,这是下面描述的方式.

    However, web applications typically allow (with knowledge of the form url) direct access. This will perform authentication at the same step and this is the way described below.

    注意:如果这不起作用,您将不得不使用带有 requests 的会话,这是可能的,但我不会在这里详细说明.

    Note: If this would not work, you would have to use sessions with requests, which is possible, but I will not elaborate on that here.

    import request
    from requests.auth import HTTPBasicAuth
    proxies = {
        "http": "http://user:pass@10.10.1.10:3128/",
    }
    auth = HTTPBasicAuth("appuser", "apppass")
    a = 4
    b = 5
    data = {"a": a, "b": b}
    url = "http://example.com/formaction/url"
    req = requests.post(url, data=data, proxies=proxies, auth=auth)
    

    现在,您应该可以通过 req 获得结果,您就完成了.

    By now, you shall have your result available via req and you are done.

    Selenimum doc for configuration proxy 建议配置您的代理在您的网络浏览器中.同一链接提供了详细信息,即如何从脚本设置代理,但在这里我假设您使用了 Firefox,并且已经(在手动测试期间)成功配置了代理.

    Selenimum doc for configuring proxy recommends configuring your proxy in your web browser. The same link provides details, how to set up proxy from your script, but here I will assume, you used Firefox and have already (during manual testing) succeeded with configuring proxy.

    以下修改后的片段源自 SO 答案咪咪,使用基本认证:

    Following modified snippet originates from SO answer by Mimi, using Basic Authentication:

    from selenium import webdriver
    
    profile = webdriver.FirefoxProfile()
    profile.set_preference('network.http.phishy-userpass-length', 255)
    driver = webdriver.Firefox(firefox_profile=profile)
    driver.get("https://appuser:apppass@somewebsite.com/")
    

    请注意,Selenium 似乎没有为 Basic/Digest 身份验证提供完整的解决方案,上面的示例可能有效,但如果没有,您可以查看此 Selenium 开发者活动 Google 群组主题 看看,您并不孤单.一些解决方案可能适合您.

    Note, that Selenium does not seem providing complete solution for Basic/Digest authentication, the sample above is likely to work, but if not, you may check this Selenium Developer Activity Google Group thread and see, you are not alone. Some solutions might work for you.

    Digest Authentication 的情况似乎比 Basic 更糟糕,有些人报告 AutoIT 成功或盲目发送密钥,上面引用的讨论显示了一些尝试.

    Situation with Digest Authentication seems even worse then with Basic one, some people reporting success with AutoIT or blindly sending keys, discussion referenced above shows some attempts.

    如果该网站允许通过在某种形式中输入凭据来登录,那么您可能很幸运,因为使用 Selenium 可以轻松完成这项任务.有关更多信息,请参阅有关填写表格的下一章.

    If the web site allows logging in by entering credentials into some form, you might be lucky one, as this is rather easy task to do with Selenium. For more see next chapter about Filling in forms.

    与身份验证相比,将数据填充到表单中、单击按钮和类似活动是 Selenium 非常有效的地方.

    In contrast to Authentication, filling data into forms, clicking buttons and similar activities are where Selenium works very well.

    from selenium import webdriver
    
    a = 4
    b = 5
    url = "http://example.com/form"
    # formactionurl = "http://example.com/formaction/url" # this is not relevant in Selenium
    
    # Start up Firefox
    browser = webdriver.Firefox()
    
    # Assume, you get somehow authenticated now
    # You might succeed with Basic Authentication by using url = "http://appuser:apppass@example.com/form
    
    # Navigate to your url
    browser.get(url)
    
    # find the element that's id is param_a and fill it in
    inputElement = browser.find_element_by_id("param_a")
    inputElement.send_keys(str(a))
    # repeat for "b"
    inputElement = browser.find_element_by_id("param_b")
    inputElement.send_keys(str(b))
    
    # submit the form (if having problems, try to set inputElement to the Submit button)
    inputElement.submit()
    
    time.sleep(10) # wait 10 seconds (better methods can be used)
    
    page_text = browser.page_source
    # now you have what you asked for
    browser.quit()
    

    结论

    所提供的信息以相当笼统的方式描述了要完成的工作,但缺乏具体细节,因此可以提供量身定制的解决方案.这就是为什么这个答案侧重于提出一般方法.

    Conclusions

    Information provided in question describes what is to be done in rather general manner, but is lacking specific details, which would allow providing tailored solution. That is why this answer focuses on proposing general approach.

    有两种情况,一种是基于 bing HTTP,第二种是使用模拟浏览器.

    There are two scenarios, one bing HTTP based, second one uses emulated browser.

    HTTP 解决方案更可取,尽管事实上,它需要更多的准备来搜索,使用什么 HTTP 请求.最大的优势是,它的生产速度要快得多,需要的内存要少得多,而且应该更健壮.

    HTTP Solution is preferable, despite of a fact, it requires a bit more preparation in searching, what HTTP requests are to be used. Big advantage is, it is then in production much faster, requiring much less memory and shall be more robust.

    在极少数情况下,当浏览器中有一些必要的 JavaScript 活动时,我们可能会使用浏览器模拟解决方案.但是,这设置起来要复杂得多,并且在身份验证步骤中存在重大问题.

    In rare cases, when there is some essential JavaScript activity in the browser, we may use Browser emulation solution. However, this is much more complex to set up and has major problems at the Authentication step.

    这篇关于使用 Python 在代理后面登录、导航和检索数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆