Python的刮板 - 如果目标404'd套接字错误打破脚本 [英] Python Scraper - Socket Error breaks script if target is 404'd
问题描述
遇到错误,而建设一个网站刮板编译数据,并输出到XLS格式;再次测试域列表时,我想从,程序faulters刮时临危一个套接字错误。希望能找到一个'如果'的语句,将空解析一个破碎的网站,并继续通过我while循环。有任何想法吗?
workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
DESTINATIONLIST = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('聚集')
运行startx = 1
startY = 0
而运行startx = 21!
workingCell = workingSheet.cell(startx的,startY).value的
打印''
打印''
打印''
打印workingCell
#建立
preSite =HTTP://www.'+workingCell
theSite =的urlopen(preSite).read()
currentSite = BeautifulSoup(theSite)
destinationSheet.write(运行startx,0,workingCell)
和这里的错误:
回溯(最后最近一次调用):
文件< pyshell#2>中,1号线,上述<&模块GT;
homeMenu()
文件C:\\ Python27 \\ farming.py31行,在homeMenu
openList()
文件C:\\ Python27 \\ farming.py,行79,在openList
openList()
文件C:\\ Python27 \\ farming.py线83,在openList
openList()
文件C:\\ Python27 \\ farming.py,行86,在openList
homeMenu()
文件C:\\ Python27 \\ farming.py34行,在homeMenu
startScrape()
文件C:\\ Python27 \\ farming.py,线路112,在startScrape
theSite =的urlopen(preSite).read()
文件C:\\ Python27 \\ lib目录\\ urllib.py,线路84中的urlopen
返回opener.open(URL)
文件C:\\ Python27 \\ lib目录\\ urllib.py,205线,开放
返回GETATTR(个体经营,名称)(URL)
文件C:\\ Python27 \\ lib目录\\ urllib.py,线路342,在open_http
h.endheaders(数据)
文件C:\\ Python27 \\ lib目录\\ httplib.py,线路951,在endheaders
self._send_output(MESSAGE_BODY)
文件C:\\ Python27 \\ lib目录\\ httplib.py,线路811,在_send_output
self.send(MSG)
文件C:\\ Python27 \\ lib目录\\ httplib.py,线路773,在发送
self.connect()
文件C:\\ Python27 \\ lib目录\\ httplib.py,754线,在连接
self.timeout,self.source_address)
文件C:\\ Python27 \\ lib目录\\ socket.py,553线,在create_connection
在资源的getaddrinfo(主机,端口0,SOCK_STREAM):
IO错误:[错误套接字错误] [错误11004]的getaddrinfo失败
嗯,看起来像错误我得到的时候我的互联网连接已断开。 HTTP 404错误是,当你有你指定无法找到一个连接,但URL你会得到什么。
有没有if语句来处理异常;你需要捕捉他们使用尝试/ except结构。
更新:这里有一个演示:
进口的urllib高清getconn(URL):
尝试:
康恩=了urllib.urlopen(URL)
康涅狄格州返回,无
除了IO错误为e:
返回None,E网址=
QWERTY
http://www.foo.bar.net
http://www.google.com
http://www.google.com/nonesuch
在urls.split URL():
打印
打印网址
康涅狄格州,EXC = getconn(URL)
如果参数conn:
打印相连; HTTP响应,conn.get code()
其他:
打印失败
打印EXC .__类__.__ name__
打印STR(EXC)
打印exc.args
输出:
QWERTY键盘
失败
IO错误
[错误2]系统找不到指定的文件:QWERTY
(2,'系统找不到指定的文件)http://www.foo.bar.net
失败
IO错误
[错误套接字错误] [错误11004]的getaddrinfo失败
(套接字错误,gaierror(11004,'的getaddrinfo失败'))http://www.google.com
连接的; HTTP响应是200http://www.google.com/nonesuch
连接的; HTTP响应是404
请注意,到目前为止,我们刚刚开了连接。现在,你需要做的就是检查HTTP响应code和决定是否有什么值得使用 conn.read()
Encountered an error while building a web scrapper to compile data and output into XLS format; when testing again a list of domains in which I wish to scrape from, the program faulters when it recieves a socket error. Hoping to find an 'if' statement that would null parsing a broken website and continue through my while-loop. Any ideas?
workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
destinationList = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('Gathered')
startX = 1
startY = 0
while startX != 21:
workingCell = workingSheet.cell(startX,startY).value
print ''
print ''
print ''
print workingCell
#Setup
preSite = 'http://www.'+workingCell
theSite = urlopen(preSite).read()
currentSite = BeautifulSoup(theSite)
destinationSheet.write(startX,0,workingCell)
And here's the error:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
homeMenu()
File "C:\Python27\farming.py", line 31, in homeMenu
openList()
File "C:\Python27\farming.py", line 79, in openList
openList()
File "C:\Python27\farming.py", line 83, in openList
openList()
File "C:\Python27\farming.py", line 86, in openList
homeMenu()
File "C:\Python27\farming.py", line 34, in homeMenu
startScrape()
File "C:\Python27\farming.py", line 112, in startScrape
theSite = urlopen(preSite).read()
File "C:\Python27\lib\urllib.py", line 84, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 342, in open_http
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 951, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 811, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 773, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 754, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
Ummm that looks like the error I get when my internet connection is down. HTTP 404 errors are what you get when you do have a connection but the URL that you specify can't be found.
There's no if statement to handle exceptions; you need to "catch" them using the try/except construct.
Update: Here's a demonstration:
import urllib
def getconn(url):
try:
conn = urllib.urlopen(url)
return conn, None
except IOError as e:
return None, e
urls = """
qwerty
http://www.foo.bar.net
http://www.google.com
http://www.google.com/nonesuch
"""
for url in urls.split():
print
print url
conn, exc = getconn(url)
if conn:
print "connected; HTTP response is", conn.getcode()
else:
print "failed"
print exc.__class__.__name__
print str(exc)
print exc.args
Output:
qwerty
failed
IOError
[Errno 2] The system cannot find the file specified: 'qwerty'
(2, 'The system cannot find the file specified')
http://www.foo.bar.net
failed
IOError
[Errno socket error] [Errno 11004] getaddrinfo failed
('socket error', gaierror(11004, 'getaddrinfo failed'))
http://www.google.com
connected; HTTP response is 200
http://www.google.com/nonesuch
connected; HTTP response is 404
Note that so far we have just opened the connection. Now what you need to do is check the HTTP response code and decide whether there is anything worth retrieving using conn.read()
这篇关于Python的刮板 - 如果目标404'd套接字错误打破脚本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!