Python 使用会话 Cookie 抓取 Web [英] Python Scraping Web with Session Cookie
问题描述
我正在尝试从此 URL 中删除一些数据:
Hi iam trying to scrap some data off from this URL:
http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1
您可能已经注意到,如果尚未设置 cookie 和会话数据,您将被重定向到其基本 URL (http://www.21cineplex.com/)
As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)
我试着这样做:
def main():
try:
cj = CookieJar()
baseurl = "http://www.21cineplex.com"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(baseurl)
urllib2.install_opener(opener)
movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()
splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)
print splitSource
except Exception, e:
str(e)
print "Error occured in main Block"
但是,我最终未能从该特定网址中删除.
However, i ended up failing to scrap from that particular URL.
快速检查显示该网站正在设置会话 ID (PHPSESSID) 并将副本复制到客户端的 cookie.
A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.
问题是我如何减轻这样的例子?
The question is how do i mitigate such example?
ps:我尝试安装 request(通过 pip)它给了我什么(404):
ps: i've tried to install request (via pip) how ever it gives me (404):
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Getting page https://pypi.python.org/simple/
URLs to search for versions for request:
* https://pypi.python.org/simple/request/
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Could not find any downloads that satisfy the requirement request
Cleaning up...
推荐答案
感谢 @Chainik,我现在可以使用它了.我最终像这样修改了我的代码:
Thanks to @Chainik i got it to work now. I ended up modify my code like this:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
一次,检索 html 文本.一切都是为了解析其内容.
Once, the html text is retrieved. It's all about parsing its content.
干杯
这篇关于Python 使用会话 Cookie 抓取 Web的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!