Python 使用会话 Cookie 抓取 Web [英] Python Scraping Web with Session Cookie

查看:70
本文介绍了Python 使用会话 Cookie 抓取 Web的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此 URL 中删除一些数据:

Hi iam trying to scrap some data off from this URL:

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

您可能已经注意到,如果尚未设置 cookie 和会话数据,您将被重定向到其基本 URL (http://www.21cineplex.com/)

As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)

我试着这样做:

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

但是,我最终未能从该特定网址中删除.

However, i ended up failing to scrap from that particular URL.

快速检查显示该网站正在设置会话 ID (PHPSESSID) 并将副本复制到客户端的 cookie.

A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.

问题是我如何减轻这样的例子?

The question is how do i mitigate such example?

ps:我尝试安装 request(通过 pip)它给了我什么(404):

ps: i've tried to install request (via pip) how ever it gives me (404):

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

推荐答案

感谢 @Chainik,我现在可以使用它了.我最终像这样修改了我的代码:

Thanks to @Chainik i got it to work now. I ended up modify my code like this:

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

一次,检索 html 文本.一切都是为了解析其内容.

Once, the html text is retrieved. It's all about parsing its content.

干杯

这篇关于Python 使用会话 Cookie 抓取 Web的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆