使用BeautifulSoup的HTML链接解析 [英] HTML Link parsing using BeautifulSoup

查看:147
本文介绍了使用BeautifulSoup的HTML链接解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的 Python 代码,用于从我作为参数发送的页面链接中提取特定HTML .我正在使用 BeautifulSoup .该代码有时可能会卡住!

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!

import urllib
from bs4 import BeautifulSoup

rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):  
    #iterate url and capture content
    sock = urllib.urlopen(url+ str(i))
    html = sock.read()  
    sock.close()
    rawHtml += html
    print i

在这里,我正在打印循环变量,以找出卡住的位置.它告诉我在任何循环序列中它都被随机卡住了.

Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.

soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
    t += str(link.get('href')) + "</br>"
    #t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()

可能是什么问题.是 socket 配置问题还是其他问题?

what could be the possible issue. Is it the problem with the socket configuration or some other issue.

这是我得到的错误.我检查了这些链接- python-gaierror-errno-11004 ioerror-errno-socket-error-errno-11004 -getaddrinfo-failed 以获取解决方案.但是我发现它没有太大帮助.

This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.

 d:\python>python ext.py
Traceback (most recent call last):
  File "ext.py", line 8, in <module>
    sock = urllib.urlopen(url+ str(i))
  File "d:\python\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "d:\python\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "d:\python\lib\urllib.py", line 350, in open_http
    h.endheaders(data)
  File "d:\python\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "d:\python\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "d:\python\lib\httplib.py", line 855, in send
    self.connect()
  File "d:\python\lib\httplib.py", line 832, in connect
    self.timeout, self.source_address)
  File "d:\python\lib\socket.py", line 557, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

当我在个人笔记本电脑上运行它时,它运行得非常好.但是,当我在Office Desktop上运行它时,它给出了错误.另外,我的Python版本是2.7.希望这些信息会有所帮助.

It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.

推荐答案

最后,伙计们…….当我在其他PC上进行检查时,相同的脚本也起作用.可能是因为我的办公室桌面的 防火墙设置 代理设置 .阻止了该网站.

Finally, guys.... It worked! Same script worked when I checked on other PC's too. So probably the problem was because of the firewall settings or proxy settings of my office desktop. which was blocking this website.

这篇关于使用BeautifulSoup的HTML链接解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆