使用BeautifulSoup的HTML链接解析 [英] HTML Link parsing using BeautifulSoup

查看：147 发布时间：2020/9/20 7:39:50 python url beautifulsoup filereader filewriter

本文介绍了使用BeautifulSoup的HTML链接解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我的 Python 代码，用于从我作为参数发送的页面链接中提取特定HTML .我正在使用 BeautifulSoup .该代码有时可能会卡住！

here is my Python code which I'm using to extract the Specific HTML from the Page links I'm sending as parameter. I'm using BeautifulSoup. This code works fine for sometimes and sometimes it is getting stuck!

import urllib
from bs4 import BeautifulSoup

rawHtml = ''
url = r'http://iasexamportal.com/civilservices/tag/voice-notes?page='
for i in range(1, 49):  
    #iterate url and capture content
    sock = urllib.urlopen(url+ str(i))
    html = sock.read()  
    sock.close()
    rawHtml += html
    print i

在这里，我正在打印循环变量，以找出卡住的位置.它告诉我在任何循环序列中它都被随机卡住了.

Here I'm printing the loop variable to find out where it is getting stuck. It shows me that it's getting stuck randomly at any of the loop sequence.

soup = BeautifulSoup(rawHtml, 'html.parser')
t=''
for link in soup.find_all('a'):
    t += str(link.get('href')) + "</br>"
    #t += str(link) + "</br>"
f = open("Link.txt", 'w+')
f.write(t)
f.close()

可能是什么问题.是 socket 配置问题还是其他问题?

what could be the possible issue. Is it the problem with the socket configuration or some other issue.

这是我得到的错误.我检查了这些链接- python-gaierror-errno-11004 ， ioerror-errno-socket-error-errno-11004 -getaddrinfo-failed 以获取解决方案.但是我发现它没有太大帮助.

This is the error I got. I checked these links - python-gaierror-errno-11004,ioerror-errno-socket-error-errno-11004-getaddrinfo-failed for the solution. But I didn't find it much helpful.

 d:\python>python ext.py
Traceback (most recent call last):
  File "ext.py", line 8, in <module>
    sock = urllib.urlopen(url+ str(i))
  File "d:\python\lib\urllib.py", line 87, in urlopen
    return opener.open(url)
  File "d:\python\lib\urllib.py", line 213, in open
    return getattr(self, name)(url)
  File "d:\python\lib\urllib.py", line 350, in open_http
    h.endheaders(data)
  File "d:\python\lib\httplib.py", line 1049, in endheaders
    self._send_output(message_body)
  File "d:\python\lib\httplib.py", line 893, in _send_output
    self.send(msg)
  File "d:\python\lib\httplib.py", line 855, in send
    self.connect()
  File "d:\python\lib\httplib.py", line 832, in connect
    self.timeout, self.source_address)
  File "d:\python\lib\socket.py", line 557, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

当我在个人笔记本电脑上运行它时，它运行得非常好.但是，当我在Office Desktop上运行它时，它给出了错误.另外，我的Python版本是2.7.希望这些信息会有所帮助.

It's running perfectly fine when I'm running it on my personal laptop. But It's giving error when I'm running it on Office Desktop. Also, My version of Python is 2.7. Hope these information will help.

使用BeautifulSoup的HTML链接解析 [英] HTML Link parsing using BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用BeautifulSoup的HTML链接解析 [英] HTML Link parsing using BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭