使用BeautifulSoup的HTML链接解析 [英] HTML Link parsing using BeautifulSoup
问题描述
这是我的 Python 代码,用于从我作为参数发送的页面链接中提取特定HTML .我正在使用 BeautifulSoup .该代码有时可能会卡住!
here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!
import urllib
from bs4 import BeautifulSoup
rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):
#iterate url and capture content
sock = urllib.urlopen(url+ str(i))
html = sock.read()
sock.close()
rawHtml += html
print i
在这里,我正在打印循环变量,以找出卡住的位置.它告诉我在任何循环序列中它都被随机卡住了.
Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.
soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
t += str(link.get('href')) + "</br>"
#t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()
可能是什么问题.是 socket 配置问题还是其他问题?
what could be the possible issue. Is it the problem with the socket configuration or some other issue.
这是我得到的错误.我检查了这些链接- python-gaierror-errno-11004 , ioerror-errno-socket-error-errno-11004 -getaddrinfo-failed 以获取解决方案.但是我发现它没有太大帮助.
This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.
d:\python>python ext.py
Traceback (most recent call last):
File "ext.py", line 8, in <module>
sock = urllib.urlopen(url+ str(i))
File "d:\python\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "d:\python\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "d:\python\lib\urllib.py", line 350, in open_http
h.endheaders(data)
File "d:\python\lib\httplib.py", line 1049, in endheaders
self._send_output(message_body)
File "d:\python\lib\httplib.py", line 893, in _send_output
self.send(msg)
File "d:\python\lib\httplib.py", line 855, in send
self.connect()
File "d:\python\lib\httplib.py", line 832, in connect
self.timeout, self.source_address)
File "d:\python\lib\socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed
当我在个人笔记本电脑上运行它时,它运行得非常好.但是,当我在Office Desktop上运行它时,它给出了错误.另外,我的Python版本是2.7.希望这些信息会有所帮助.
It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.
推荐答案
最后,伙计们…….当我在其他PC上进行检查时,相同的脚本也起作用.可能是因为我的办公室桌面的 防火墙设置 或 代理设置 .阻止了该网站.
Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.
这篇关于使用BeautifulSoup的HTML链接解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!