Python RE 在/ref= 之后不返回任何内容 [英] Python RE does not return anything after /ref=

查看：79 发布时间：2021/6/26 19:59:30 python regex python-2.7

本文介绍了Python RE 在/ref= 之后不返回任何内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从亚马逊的畅销商品列表中检索 URL 和类别名称.出于某种原因，我正在使用的 RE 在遇到 /ref= 时停止，我真的不明白为什么?我在 Windows 7 机器上使用 Python 2.7.

一个典型的记录是

<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android应用商店</a></li>

我的 RE 是

Regex = "(.*?)
"类别=重新编译(正则表达式)

返回一个元组

[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps[][1] 安卓应用商店

我确实获得了所有正确的记录，但正如您所看到的，缺少 URL /ref=zg_bs_nav_0.

类别层次结构中的其他级别也存在同样的问题；URL 中的所有内容(以/ref= 开头并包括)都丢失了.

这是我采纳了 Martijn 的建议后的代码片段

# Best Seller类别列表第一页URL = "http://www.amazon.ca/gp/bestsellers"# 获取页面源HTMLFile = urllib.urlopen(URL)HTMLText = HTMLFile.read()汤 = BeautifulSoup(HTMLText)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href']打印 link.get_text()

解决方案

您正在使用正则表达式，但将 XML 与此类表达式匹配变得太复杂、太快.不要那样做.

改用 HTML 解析器，Python 有多种选择:

ElementTree 是标准库的一部分
BeautifulSoup 是一个流行的 3rd 方库
lxml 是一个快速且功能丰富的基于 C 的库.

后两者也可以非常优雅地处理格式错误的 HTML，这对许多拙劣的网站来说意义重大.事实上，如果安装了 BeautifulSoup 4，BeautifulSoup 4 在底层使用 lxml 作为选择的解析器.

BeautifulSoup 示例:

from bs4 import BeautifulSoup汤 = BeautifulSoup(htmlsource)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href'], link.get_text()

这使用 CSS 选择器来查找直接包含在

元素中的所有元素，其中 href 属性以文本 http://www.amazon.ca/Best-Sellers 开头.

演示:

<预><代码>>>>从 bs4 导入 BeautifulSoup>>>htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android 应用商店</a>

'>>>汤 = BeautifulSoup(htmlsource)>>>对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):... 打印链接['href'], link.get_text()...http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Android 应用商店

请注意，亚马逊会根据标头更改响应:

<预><代码>>>>进口请求>>>从 bs4 导入 BeautifulSoup>>>r = requests.get('http://www.amazon.ca/gp/bestsellers')>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Android 应用商店</a>>>>r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={...用户代理":Mozilla/5.0(Macintosh；Intel Mac OS X 10_9_3)AppleWebKit/537.36(KHTML，如 Gecko)Chrome/35.0.1916.153 Safari/537.36'})>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">安卓应用商店</a>

I am trying to retrieve the URL and category name from Amazon's best sellers list. For some reason the RE I'm using stops, when it encounters /ref= and I truly don't see why? I'm using Python 2.7 on a Windows 7 box.

A typical record is

<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>

and my RE is

Regex = "<li><a href='(http://www.amazon.ca/Best-Sellers.*?)'>(.*?)</a></li>"
Category = re.compile(Regex)

which return a tuple

[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps
[][1] Appstore for Android

I do get all the right records but as you can see, the URL is missing /ref=zg_bs_nav_0.

Other levels in the category hierarchy exhibit the same issue; everything in the URL, starting with and including /ref= is missing.

Here is my code fragment after I took Martijn's suggestion

# First page of the list of Best Sellers categories
URL = "http://www.amazon.ca/gp/bestsellers"

# Retrieve the page source
HTMLFile = urllib.urlopen(URL)
HTMLText = HTMLFile.read()

soup = BeautifulSoup(HTMLText)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    print link['href']
    print link.get_text()

解决方案

You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.

Use a HTML parser instead, Python has several to choose from:

ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.

The latter two also handle malformed HTML quite gracefully as well, making decent sense of many a botched website. In fact, BeautifulSoup 4 uses lxml under the hood as the parser of choice if it is installed.

BeautifulSoup example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
    print link['href'], link.get_text()

This uses a CSS selector to find all <a> elements contained directly in a <li> element where the href attribute starts with the text http://www.amazon.ca/Best-Sellers.

Demo:

>>> from bs4 import BeautifulSoup
>>> htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'
>>> soup = BeautifulSoup(htmlsource)
>>> for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
...     print link['href'], link.get_text()
... 
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android

Note that Amazon alters the response based on the headers:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers')
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={
...     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>

这篇关于Python RE 在/ref= 之后不返回任何内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python RE 在/ref= 之后不返回任何内容 [英] Python RE does not return anything after /ref=

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python RE 在/ref= 之后不返回任何内容 [英] Python RE does not return anything after /ref=

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭