为什么lxml.html.parse()末尾的斜杠很重要? [英] Why is the slash at the end of lxml.html.parse() important?

查看:164
本文介绍了为什么lxml.html.parse()末尾的斜杠很重要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml抓取html.该代码有效.

I am using lxml to scrape html. This code works.

lxml.html.parse( "http://google.com/" )

此代码没有.

lxml.html.parse( "http://google.com" )

为什么URL末尾的斜杠很重要?谢谢.

Why does the slash at the end of the URL matter? Thank you.

要清楚,这是python从后面的代码中给我的错误日志.

To be clear, here is the error log that python is giving me from the latter code.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/davidfaux/epd-7.2-2-rh5-x86/lib/python2.7/site-packages/lxml/html/__init__.py", line 692, in parse
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2953, in lxml.etree.parse (src/lxml/lxml.etree.c:56204)
  File "parser.pxi", line 1533, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:82287)
  File "parser.pxi", line 1562, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:82580)
  File "parser.pxi", line 1462, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:81619)
  File "parser.pxi", line 1002, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:78528)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 588, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74665)
IOError: Error reading file 'http://google.com': failed to load HTTP resource

推荐答案

由于没有斜线,Google不会向您发送页面,而是向您发送重定向.实际上,它会将您重定向到带有斜杠的URL!重定向的正文可能为空.

Because without the slash, Google isn't sending you a page, it's sending you a redirect. In fact, it's redirecting you to the URL with the slash! The body of the redirect is probably empty.

这篇关于为什么lxml.html.parse()末尾的斜杠很重要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆