Python的刮板 - 如果目标404'd套接字错误打破脚本 [英] Python Scraper - Socket Error breaks script if target is 404'd

查看:115
本文介绍了Python的刮板 - 如果目标404'd套接字错误打破脚本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

遇到错误,而建设一个网站刮板编译数据,并输出到XLS格式;再次测试域列表时,我想从,程序faulters刮时临危一个套接字错误。希望能找到一个'如果'的语句,将空解析一个破碎的网站,并继续通过我while循环。有任何想法吗?

  workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
DESTINATIONLIST = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('聚集')
运行startx = 1
startY = 0
而运行startx = 21!
    workingCell = workingSheet.cell(startx的,startY).value的
    打印''
    打印''
    打印''
    打印workingCell
    #建立
    preSite =HTTP://www.'+workingCell
    theSite =的urlopen(preSite).read()
    currentSite = BeautifulSoup(theSite)
    destinationSheet.write(运行startx,0,workingCell)

和这里的错误:

 回溯(最后最近一次调用):
  文件< pyshell#2>中,1号线,上述<&模块GT;
    homeMenu()
  文件C:\\ Python27 \\ farming.py31行,在homeMenu
    openList()
  文件C:\\ Python27 \\ farming.py,行79,在openList
    openList()
  文件C:\\ Python27 \\ farming.py线83,在openList
    openList()
  文件C:\\ Python27 \\ farming.py,行86,在openList
    homeMenu()
  文件C:\\ Python27 \\ farming.py34行,在homeMenu
    startScrape()
  文件C:\\ Python27 \\ farming.py,线路112,在startScrape
    theSite =的urlopen(preSite).read()
  文件C:\\ Python27 \\ lib目录\\ urllib.py,线路84中的urlopen
    返回opener.open(URL)
  文件C:\\ Python27 \\ lib目录\\ urllib.py,205线,开放
    返回GETATTR(个体经营,名称)(URL)
  文件C:\\ Python27 \\ lib目录\\ urllib.py,线路342,在open_http
    h.endheaders(数据)
  文件C:\\ Python27 \\ lib目录\\ httplib.py,线路951,在endheaders
    self._send_output(MESSAGE_BODY)
  文件C:\\ Python27 \\ lib目录\\ httplib.py,线路811,在_send_output
    self.send(MSG)
  文件C:\\ Python27 \\ lib目录\\ httplib.py,线路773,在发送
    self.connect()
  文件C:\\ Python27 \\ lib目录\\ httplib.py,754线,在连接
    self.timeout,self.source_address)
  文件C:\\ Python27 \\ lib目录\\ socket.py,553线,在create_connection
    在资源的getaddrinfo(主机,端口0,SOCK_STREAM):
IO错误:[错误套接字错误] [错误11004]的getaddrinfo失败


解决方案

嗯,看起来像错误我得到的时候我的互联网连接已断开。 HTTP 404错误是,当你有你指定无法找到一个连接,但URL你会得到什么。

有没有if语句来处理异常;你需要捕捉他们使用尝试/ except结构。

更新:这里有一个演示:

 进口的urllib高清getconn(URL):
    尝试:
        康恩=了urllib.urlopen(URL)
        康涅狄格州返回,无
    除了IO错误为e:
        返回None,E网址=
    QWERTY
    http://www.foo.bar.net
    http://www.google.com
    http://www.google.com/nonesuch
    
在urls.split URL():
    打印
    打印网址
    康涅狄格州,EXC = getconn(URL)
    如果参数conn:
        打印相连; HTTP响应,conn.get code()
    其他:
        打印失败
        打印EXC .__类__.__ name__
        打印STR(EXC)
        打印exc.args

输出:

  QWERTY键盘
失败
IO错误
[错误2]系统找不到指定的文件:QWERTY
(2,'系统找不到指定的文件)http://www.foo.bar.net
失败
IO错误
[错误套接字错误] [错误11004]的getaddrinfo失败
(套接字错误,gaierror(11004,'的getaddrinfo失败'))http://www.google.com
连接的; HTTP响应是200http://www.google.com/nonesuch
连接的; HTTP响应是404

请注意,到目前为止,我们刚刚开了连接。现在,你需要做的就是检查HTTP响应code和决定是否有什么值得使用 conn.read()

检索

Encountered an error while building a web scrapper to compile data and output into XLS format; when testing again a list of domains in which I wish to scrape from, the program faulters when it recieves a socket error. Hoping to find an 'if' statement that would null parsing a broken website and continue through my while-loop. Any ideas?

workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
destinationList = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('Gathered')
startX = 1
startY = 0
while startX != 21:
    workingCell = workingSheet.cell(startX,startY).value
    print ''
    print ''
    print ''
    print workingCell
    #Setup
    preSite = 'http://www.'+workingCell
    theSite = urlopen(preSite).read()
    currentSite = BeautifulSoup(theSite)
    destinationSheet.write(startX,0,workingCell)

And here's the error:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    homeMenu()
  File "C:\Python27\farming.py", line 31, in homeMenu
    openList()
  File "C:\Python27\farming.py", line 79, in openList
    openList()
  File "C:\Python27\farming.py", line 83, in openList
    openList()
  File "C:\Python27\farming.py", line 86, in openList
    homeMenu()
  File "C:\Python27\farming.py", line 34, in homeMenu
    startScrape()
  File "C:\Python27\farming.py", line 112, in startScrape
    theSite = urlopen(preSite).read()
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 342, in open_http
    h.endheaders(data)
  File "C:\Python27\lib\httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 811, in _send_output
    self.send(msg)
  File "C:\Python27\lib\httplib.py", line 773, in send
    self.connect()
  File "C:\Python27\lib\httplib.py", line 754, in connect
    self.timeout, self.source_address)
  File "C:\Python27\lib\socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

解决方案

Ummm that looks like the error I get when my internet connection is down. HTTP 404 errors are what you get when you do have a connection but the URL that you specify can't be found.

There's no if statement to handle exceptions; you need to "catch" them using the try/except construct.

Update: Here's a demonstration:

import urllib

def getconn(url):
    try:
        conn = urllib.urlopen(url)
        return conn, None
    except IOError as e:
        return None, e

urls = """
    qwerty
    http://www.foo.bar.net
    http://www.google.com
    http://www.google.com/nonesuch
    """
for url in urls.split():
    print
    print url
    conn, exc = getconn(url)
    if conn:
        print "connected; HTTP response is", conn.getcode()
    else:
        print "failed"
        print exc.__class__.__name__
        print str(exc)
        print exc.args

Output:

qwerty
failed
IOError
[Errno 2] The system cannot find the file specified: 'qwerty'
(2, 'The system cannot find the file specified')

http://www.foo.bar.net
failed
IOError
[Errno socket error] [Errno 11004] getaddrinfo failed
('socket error', gaierror(11004, 'getaddrinfo failed'))

http://www.google.com
connected; HTTP response is 200

http://www.google.com/nonesuch
connected; HTTP response is 404

Note that so far we have just opened the connection. Now what you need to do is check the HTTP response code and decide whether there is anything worth retrieving using conn.read()

这篇关于Python的刮板 - 如果目标404'd套接字错误打破脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆