Python urllib2或请求发布方法 [英] Python urllib2 or requests post method

查看:77
本文介绍了Python urllib2或请求发布方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常了解如何使用urllib2发出POST请求(对数据进行编码等),但是问题是所有在线教程都完全使用了无用的示例网址来演示如何进行操作(someserver.comcoolsite.org等),因此我看不到与它们使用的示例代码相对应的特定html.即使是python.org自己的教程在这方面也完全没有用. /p>

我需要对此URL发出POST请求:

https://patentscope.wipo.int/search/en/search.jsf

代码的相关部分是这样的(我认为):

<form id="simpleSearchSearchForm" name="simpleSearchSearchForm" method="post" action="/search/en/search.jsf" enctype="application/x-www-form-urlencoded" style="display:inline">
<input type="hidden" name="simpleSearchSearchForm" value="simpleSearchSearchForm" />
<div class="rf-p " id="simpleSearchSearchForm:sSearchPanel" style="text-align:left;z-index:-1;"><div class="rf-p-hdr " id="simpleSearchSearchForm:sSearchPanel_header">

或者也许是这样:

<input id="simpleSearchSearchForm:fpSearch" type="text" name="simpleSearchSearchForm:fpSearch" class="formInput" dir="ltr" style="width: 400px; height: 15px; text-align: left; background-image: url(&quot;https://patentscope.wipo.int/search/org.richfaces.resources/javax.faces.resource/org.richfaces.staticResource/4.5.5.Final/PackedCompressed/classic/org.richfaces.images/inputBackgroundImage.png&quot;); background-position: 1px 1px; background-repeat: no-repeat;">

如果我想将JP2014084003编码为搜索词,则html中要使用的对应值是什么? input id? name? value?

附录:此答案不能回答我的问题,因为它只是重复我已经在python docs页面中查看过的信息.

更新:

我找到了,并在那里尝试了代码,具体是:

import requests

headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'name':'simpleSearchSearchForm:fpSearch','value':'2014084003'}
link    = 'https://patentscope.wipo.int/search/en/search.jsf'
session = requests.Session()
resp    = session.get(link,headers=headers)
cookies = requests.utils.cookiejar_from_dict(requests.utils.dict_from_cookiejar(session.cookies))
resp    = session.post(link,headers=headers,data=payload,cookies =cookies)

r = session.get(link)

f = open('htmltext.txt','w')

f.write(r.content)

f.close()

我得到一个成功的响应(200),但是数据又一次只是原始页面中的数据,所以我不知道我是否正确地将其发布到了表单,还有其他需要做的事情使其从搜索结果页面返回数据,或者如果我仍在错误地发布数据,请执行此操作.

是的,我意识到这使用了requests而不是urllib2,但是我想要做的就是获取数据.

解决方案

这不是最直接的帖子请求,如果您使用开发人员工具或萤火虫,则可以从成功的浏览器帖子中看到formdata:

所有这些都非常简单,除非您看到一些:嵌入在键中,这可能会造成混淆,simpleSearchSearchForm:commandSimpleFPSearch是键,而Search.

唯一无法硬编码的是javax.faces.ViewState,我们需要向站点发出请求,然后解析该值,我们可以使用BeautifulSoup做到这一点:

import requests
from bs4 import BeautifulSoup

url = "https://patentscope.wipo.int/search/en/search.jsf"

data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
        "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
        "simpleSearchSearchForm:fpSearch": "automata",
        "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
        "simpleSearchSearchForm:j_idt406": "workaround"}
head = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

with requests.Session() as s:
    # Get the cookies and the source to parse the Viewstate token
    init = s.get(url)
    soup = BeautifulSoup(init.text, "lxml")
    val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
    # update post data dict
    data["javax.faces.ViewState"] = val
    r = s.post(url, data=data, headers=head)
    print(r.text)

如果我们运行上面的代码:

In [13]: import requests

In [14]: from bs4 import BeautifulSoup

In [15]: url = "https://patentscope.wipo.int/search/en/search.jsf"

In [16]: data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
   ....:         "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
   ....:         "simpleSearchSearchForm:fpSearch": "automata",
   ....:         "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
   ....:         "simpleSearchSearchForm:j_idt406": "workaround"}

In [17]: head = {
   ....:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [18]: with requests.Session() as s:
   ....:         init = s.get(url)
   ....:         soup = BeautifulSoup(init.text, "lxml")
   ....:         val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
   ....:         data["javax.faces.ViewState"] = val
   ....:         r = s.post(url, data=data, headers=head)
   ....:         print("\n".join([s.text.strip() for s in BeautifulSoup(r.text,"lxml").select("span.trans-section")]))
   ....:     

Fuzzy genetic learning automata classifier
Fuzzy genetic learning automata classifier
FINITE AUTOMATA MANAGER
CELLULAR AUTOMATA MUSIC GENERATOR
CELLULAR AUTOMATA MUSIC GENERATOR
ANALOG LOGIC AUTOMATA
Incremental automata verification
Cellular automata music generator
Analog logic automata
Symbolic finite automata

您将看到它与网页匹配.如果要抓取网站,则需要熟悉开发人员工具/萤火虫等.以观察请求的发出方式,然后尝试模仿.要打开Firebug,请在页面上右键单击并选择检查元素,再单击网络选项卡并提交您的请求.您只需要从列表中选择请求,然后选择任何您想要的信息选项卡,即用于发布请求的参数:

您也可能会发现答案对于如何使用方法很有用发布到网站.

I understand in general how to make a POST request using urllib2 (encoding the data, etc.), but the problem is all the tutorials online use completely useless made-up example urls to show how to do it (someserver.com, coolsite.org, etc.), so I can't see the specific html that corresponds to the example code they use. Even python.org's own tutorial is totally useless in this regard.

I need to make a POST request to this url:

https://patentscope.wipo.int/search/en/search.jsf

The relevant part of the code is this (I think):

<form id="simpleSearchSearchForm" name="simpleSearchSearchForm" method="post" action="/search/en/search.jsf" enctype="application/x-www-form-urlencoded" style="display:inline">
<input type="hidden" name="simpleSearchSearchForm" value="simpleSearchSearchForm" />
<div class="rf-p " id="simpleSearchSearchForm:sSearchPanel" style="text-align:left;z-index:-1;"><div class="rf-p-hdr " id="simpleSearchSearchForm:sSearchPanel_header">

Or maybe it's this:

<input id="simpleSearchSearchForm:fpSearch" type="text" name="simpleSearchSearchForm:fpSearch" class="formInput" dir="ltr" style="width: 400px; height: 15px; text-align: left; background-image: url(&quot;https://patentscope.wipo.int/search/org.richfaces.resources/javax.faces.resource/org.richfaces.staticResource/4.5.5.Final/PackedCompressed/classic/org.richfaces.images/inputBackgroundImage.png&quot;); background-position: 1px 1px; background-repeat: no-repeat;">

If I want to encode JP2014084003 as the search term, what is the corresponding value in the html to use? input id? name? value?

Addendum: this answer does not answer my question, because it just repeats the information I've already looked at in the python docs page.

UPDATE:

I found this, and tried out the code in there, specifically:

import requests

headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'name':'simpleSearchSearchForm:fpSearch','value':'2014084003'}
link    = 'https://patentscope.wipo.int/search/en/search.jsf'
session = requests.Session()
resp    = session.get(link,headers=headers)
cookies = requests.utils.cookiejar_from_dict(requests.utils.dict_from_cookiejar(session.cookies))
resp    = session.post(link,headers=headers,data=payload,cookies =cookies)

r = session.get(link)

f = open('htmltext.txt','w')

f.write(r.content)

f.close()

I get a successful response (200) but the data, once again is simply the data in the original page, so I don't know whether I'm posting to the form correctly and there's something else I need to do to get it to return the data from the search results page, or if I'm still posting the data wrong.

And yes, I realize that this uses requests instead of urllib2, but all I want to be able to do is get the data.

解决方案

This is not the most straight forward post request, if you look in developer tools or firebug you can see the formdata from a successful browser post:

All that is pretty straight forward bar the fact you see some : embedded in the keys which may be a bit confusing, simpleSearchSearchForm:commandSimpleFPSearch is the key and Search.

The only thing that you cannot hard code is javax.faces.ViewState, we need to make a request to the site and then parse that value which we can do with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = "https://patentscope.wipo.int/search/en/search.jsf"

data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
        "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
        "simpleSearchSearchForm:fpSearch": "automata",
        "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
        "simpleSearchSearchForm:j_idt406": "workaround"}
head = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

with requests.Session() as s:
    # Get the cookies and the source to parse the Viewstate token
    init = s.get(url)
    soup = BeautifulSoup(init.text, "lxml")
    val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
    # update post data dict
    data["javax.faces.ViewState"] = val
    r = s.post(url, data=data, headers=head)
    print(r.text)

If we run the code above:

In [13]: import requests

In [14]: from bs4 import BeautifulSoup

In [15]: url = "https://patentscope.wipo.int/search/en/search.jsf"

In [16]: data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
   ....:         "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
   ....:         "simpleSearchSearchForm:fpSearch": "automata",
   ....:         "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
   ....:         "simpleSearchSearchForm:j_idt406": "workaround"}

In [17]: head = {
   ....:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [18]: with requests.Session() as s:
   ....:         init = s.get(url)
   ....:         soup = BeautifulSoup(init.text, "lxml")
   ....:         val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
   ....:         data["javax.faces.ViewState"] = val
   ....:         r = s.post(url, data=data, headers=head)
   ....:         print("\n".join([s.text.strip() for s in BeautifulSoup(r.text,"lxml").select("span.trans-section")]))
   ....:     

Fuzzy genetic learning automata classifier
Fuzzy genetic learning automata classifier
FINITE AUTOMATA MANAGER
CELLULAR AUTOMATA MUSIC GENERATOR
CELLULAR AUTOMATA MUSIC GENERATOR
ANALOG LOGIC AUTOMATA
Incremental automata verification
Cellular automata music generator
Analog logic automata
Symbolic finite automata

You will see it matches the webpage. If you want to scrape sites you need to get familiar with developer tools/firebug etc.. to watch how the requests are made and then try to mimic. To get firebug open, right click on the page and select inspect element, click the network tab and submit your request. You just have to select the requests from the list then select whatever tab you want info on i.e params for out post request:

You may also find this answer useful on how to approach posting to a site.

这篇关于Python urllib2或请求发布方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆