Python RE 在/ref= 之后不返回任何内容 [英] Python RE does not return anything after /ref=

查看:79
本文介绍了Python RE 在/ref= 之后不返回任何内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从亚马逊的畅销商品列表中检索 URL 和类别名称.出于某种原因,我正在使用的 RE 在遇到 /ref= 时停止,我真的不明白为什么?我在 Windows 7 机器上使用 Python 2.7.

一个典型的记录是

<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android应用商店</a></li>

我的 RE 是

Regex = "
  • (.*?)
  • "类别=重新编译(正则表达式)

    返回一个元组

    [][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps[][1] 安卓应用商店

    我确实获得了所有正确的记录,但正如您所看到的,缺少 URL /ref=zg_bs_nav_0.

    类别层次结构中的其他级别也存在同样的问题;URL 中的所有内容(以/ref= 开头并包括)都丢失了.

    这是我采纳了 Martijn 的建议后的代码片段

    # Best Seller类别列表第一页URL = "http://www.amazon.ca/gp/bestsellers"# 获取页面源HTMLFile = urllib.urlopen(URL)HTMLText = HTMLFile.read()汤 = BeautifulSoup(HTMLText)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href']打印 link.get_text()

    解决方案

    您正在使用正则表达式,但将 XML 与此类表达式匹配变得太复杂、太快.不要那样做.

    改用 HTML 解析器,Python 有多种选择:

    后两者也可以非常优雅地处理格式错误的 HTML,这对许多拙劣的网站来说意义重大.事实上,如果安装了 BeautifulSoup 4,BeautifulSoup 4 在底层使用 lxml 作为选择的解析器.

    BeautifulSoup 示例:

    from bs4 import BeautifulSoup汤 = BeautifulSoup(htmlsource)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href'], link.get_text()

    这使用 CSS 选择器来查找直接包含在

  • 元素中的所有 元素,其中 href 属性以文本 http://www.amazon.ca/Best-Sellers 开头.

    演示:

    <预><代码>>>>从 bs4 导入 BeautifulSoup>>>htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android 应用商店</a>
  • '>>>汤 = BeautifulSoup(htmlsource)>>>对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):... 打印链接['href'], link.get_text()...http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Android 应用商店

    请注意,亚马逊会根据标头更改响应:

    <预><代码>>>>进口请求>>>从 bs4 导入 BeautifulSoup>>>r = requests.get('http://www.amazon.ca/gp/bestsellers')>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Android 应用商店</a>>>>r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={...用户代理":Mozilla/5.0(Macintosh;Intel Mac OS X 10_9_3)AppleWebKit/537.36(KHTML,如 Gecko)Chrome/35.0.1916.153 Safari/537.36'})>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">安卓应用商店</a>

    I am trying to retrieve the URL and category name from Amazon's best sellers list. For some reason the RE I'm using stops, when it encounters /ref= and I truly don't see why? I'm using Python 2.7 on a Windows 7 box.

    A typical record is

    <li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>
    

    and my RE is

    Regex = "<li><a href='(http://www.amazon.ca/Best-Sellers.*?)'>(.*?)</a></li>"
    Category = re.compile(Regex)
    

    which return a tuple

    [][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps
    [][1] Appstore for Android
    

    I do get all the right records but as you can see, the URL is missing /ref=zg_bs_nav_0.

    Other levels in the category hierarchy exhibit the same issue; everything in the URL, starting with and including /ref= is missing.

    Here is my code fragment after I took Martijn's suggestion

    # First page of the list of Best Sellers categories
    URL = "http://www.amazon.ca/gp/bestsellers"
    
    # Retrieve the page source
    HTMLFile = urllib.urlopen(URL)
    HTMLText = HTMLFile.read()
    
    soup = BeautifulSoup(HTMLText)
    for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
        print link['href']
        print link.get_text()
    

    解决方案

    You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.

    Use a HTML parser instead, Python has several to choose from:

    The latter two also handle malformed HTML quite gracefully as well, making decent sense of many a botched website. In fact, BeautifulSoup 4 uses lxml under the hood as the parser of choice if it is installed.

    BeautifulSoup example:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(htmlsource)
    for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
        print link['href'], link.get_text()
    

    This uses a CSS selector to find all <a> elements contained directly in a <li> element where the href attribute starts with the text http://www.amazon.ca/Best-Sellers.

    Demo:

    >>> from bs4 import BeautifulSoup
    >>> htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'
    >>> soup = BeautifulSoup(htmlsource)
    >>> for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    ...     print link['href'], link.get_text()
    ... 
    http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android
    

    Note that Amazon alters the response based on the headers:

    >>> import requests
    >>> from bs4 import BeautifulSoup
    >>> r = requests.get('http://www.amazon.ca/gp/bestsellers')
    >>> soup = BeautifulSoup(r.content)
    >>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
    <a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
    >>> r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={
    ...     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
    >>> soup = BeautifulSoup(r.content)
    >>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
    <a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>
    

    这篇关于Python RE 在/ref= 之后不返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆