Python:urllib.error.HTTPError:HTTP 错误 404:未找到 [英] Python: urllib.error.HTTPError: HTTP Error 404: Not Found

查看:45
本文介绍了Python:urllib.error.HTTPError:HTTP 错误 404:未找到的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了一个脚本来查找 SO 问题标题中的拼写错误.我用了大约一个月.这很好用.

但是现在,当我尝试运行它时,我得到了这个.

回溯(最近一次调用最后一次): 中的文件copyeditor.py",第 32 行find_bad_qn(i)文件copyeditor.py",第 15 行,在 find_bad_qn 中html = urlopen(url)文件/usr/lib/python3.4/urllib/request.py",第 161 行,在 urlopen返回 opener.open(url, data, timeout)文件/usr/lib/python3.4/urllib/request.py",第469行,打开响应 = 甲基(请求,响应)文件/usr/lib/python3.4/urllib/request.py",第 579 行,在 http_response'http'、请求、响应、代码、味精、hdrs)文件/usr/lib/python3.4/urllib/request.py",第 507 行,出错返回 self._call_chain(*args)_call_chain 中的文件/usr/lib/python3.4/urllib/request.py",第 441 行结果 = func(*args)文件/usr/lib/python3.4/urllib/request.py",第 587 行,在 http_error_default 中引发 HTTPError(req.full_url, code, msg, hdrs, fp)urllib.error.HTTPError:HTTP 错误 404:未找到

这是我的代码

导入json从 urllib.request 导入 urlopen从 bs4 导入 BeautifulSoupfrom 附魔 导入 DictWithPWL从 enchant.checker 导入 SpellCheckermy_dict = DictWithPWL("en_US", pwl="terms.dict")chkr = SpellChecker(lang=my_dict)结果 = []def find_bad_qn(a):url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"html = urlopen(url)bsObj = BeautifulSoup(html, "html5lib")que = bsObj.find_all("div", class_="问题摘要")对于 que 中的 div:link = div.a.get('href')名称 = div.a.textchkr.set_text(name.lower())列表 1 = []对于 chkr 中的错误:list1.append(chkr.word)如果 (len(list1) > 1):str1 = ' '.join(list1)result.append({'link': link, 'name': name, 'words': str1})print("请稍等...需要一些时间")对于 i 在范围内(298314,298346):find_bad_qn(i)对于 qn 结果:qn['link'] = "https://stackoverflow.com" + qn['link']对于 qn 结果:打印(qn['link'],错误词:",qn['words'])url = qn['链接']

更新

这是导致问题的网址.即使此网址存在.

https://stackoverflow.com/questions?page=298314&sort=active

我尝试将范围更改为一些较低的值.现在运行正常.

为什么上面的网址会发生这种情况?

解决方案

显然每页的默认显示问题数是 50,因此您在循环中定义的范围超出了每页 50 个问题的可用页数.范围应调整在总页数内,每页有 50 个问题.

此代码将捕获 404 错误,这是您收到错误的原因并忽略它,以防万一您超出范围.

from urllib.request import urlopendef find_bad_qn(a):url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"尝试:网址打开(网址)除了:经过print("请稍等...需要一些时间")对于 i 在范围内(298314,298346):find_bad_qn(i)

I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.

But now, when I try to run it, I am getting this.

Traceback (most recent call last):
  File "copyeditor.py", line 32, in <module>
    find_bad_qn(i)
  File "copyeditor.py", line 15, in find_bad_qn
    html = urlopen(url)
  File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

This is my code

import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker

my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []


def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    html = urlopen(url)
    bsObj = BeautifulSoup(html, "html5lib")
    que = bsObj.find_all("div", class_="question-summary")
    for div in que:
        link = div.a.get('href')
        name = div.a.text
        chkr.set_text(name.lower())
        list1 = []
        for err in chkr:
            list1.append(chkr.word)
        if (len(list1) > 1):
            str1 = ' '.join(list1)
            result.append({'link': link, 'name': name, 'words': str1})


print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)
for qn in result:
    qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
    print(qn['link'], " Error Words:", qn['words'])
    url = qn['link']

UPDATE

This is the url causing the problem.Even though this url exists.

https://stackoverflow.com/questions?page=298314&sort=active

I tried changing the range to some lower values. It works fine now.

Why this happened with above url?

解决方案

So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.

This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.

from urllib.request import urlopen

def find_bad_qn(a):
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
    try:
        urlopen(url)
    except:
        pass

print("Please Wait.. it will take some time")
for i in range(298314,298346):
    find_bad_qn(i)

这篇关于Python:urllib.error.HTTPError:HTTP 错误 404:未找到的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆