Python RE 在/ref= 之后不返回任何内容 [英] Python RE does not return anything after /ref=
问题描述
我正在尝试从亚马逊的畅销商品列表中检索 URL 和类别名称.出于某种原因,我正在使用的 RE 在遇到 /ref=
时停止,我真的不明白为什么?我在 Windows 7 机器上使用 Python 2.7.
一个典型的记录是
<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android应用商店</a></li>
我的 RE 是
Regex = "(.*?) "类别=重新编译(正则表达式)
返回一个元组
[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps[][1] 安卓应用商店
我确实获得了所有正确的记录,但正如您所看到的,缺少 URL /ref=zg_bs_nav_0
.
类别层次结构中的其他级别也存在同样的问题;URL 中的所有内容(以/ref= 开头并包括)都丢失了.
这是我采纳了 Martijn 的建议后的代码片段
# Best Seller类别列表第一页URL = "http://www.amazon.ca/gp/bestsellers"# 获取页面源HTMLFile = urllib.urlopen(URL)HTMLText = HTMLFile.read()汤 = BeautifulSoup(HTMLText)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href']打印 link.get_text()
您正在使用正则表达式,但将 XML 与此类表达式匹配变得太复杂、太快.不要那样做.
改用 HTML 解析器,Python 有多种选择:
- ElementTree 是标准库的一部分
- BeautifulSoup 是一个流行的 3rd 方库
- lxml 是一个快速且功能丰富的基于 C 的库.
后两者也可以非常优雅地处理格式错误的 HTML,这对许多拙劣的网站来说意义重大.事实上,如果安装了 BeautifulSoup 4,BeautifulSoup 4 在底层使用 lxml
作为选择的解析器.
BeautifulSoup 示例:
from bs4 import BeautifulSoup汤 = BeautifulSoup(htmlsource)对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):打印链接['href'], link.get_text()
这使用 CSS 选择器来查找直接包含在 元素中的所有
元素,其中
href
属性以文本 http://www.amazon.ca/Best-Sellers
开头.
演示:
<预><代码>>>>从 bs4 导入 BeautifulSoup>>>htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Android 应用商店</a>'>>>汤 = BeautifulSoup(htmlsource)>>>对于soup.select中的链接('li > a[href^=http://www.amazon.ca/Best-Sellers]'):... 打印链接['href'], link.get_text()...http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Android 应用商店请注意,亚马逊会根据标头更改响应:
<预><代码>>>>进口请求>>>从 bs4 导入 BeautifulSoup>>>r = requests.get('http://www.amazon.ca/gp/bestsellers')>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Android 应用商店</a>>>>r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={...用户代理":Mozilla/5.0(Macintosh;Intel Mac OS X 10_9_3)AppleWebKit/537.36(KHTML,如 Gecko)Chrome/35.0.1916.153 Safari/537.36'})>>>汤 = BeautifulSoup(r.content)>>>汤.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">安卓应用商店</a>I am trying to retrieve the URL and category name from Amazon's best sellers list. For some reason the RE I'm using stops, when it encounters /ref=
and I truly don't see why? I'm using Python 2.7 on a Windows 7 box.
A typical record is
<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>
and my RE is
Regex = "<li><a href='(http://www.amazon.ca/Best-Sellers.*?)'>(.*?)</a></li>"
Category = re.compile(Regex)
which return a tuple
[][0] http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps
[][1] Appstore for Android
I do get all the right records but as you can see, the URL is missing /ref=zg_bs_nav_0
.
Other levels in the category hierarchy exhibit the same issue; everything in the URL, starting with and including /ref= is missing.
Here is my code fragment after I took Martijn's suggestion
# First page of the list of Best Sellers categories
URL = "http://www.amazon.ca/gp/bestsellers"
# Retrieve the page source
HTMLFile = urllib.urlopen(URL)
HTMLText = HTMLFile.read()
soup = BeautifulSoup(HTMLText)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
print link['href']
print link.get_text()
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
- ElementTree is part of the standard library
- BeautifulSoup is a popular 3rd party library
- lxml is a fast and feature-rich C-based library.
The latter two also handle malformed HTML quite gracefully as well, making decent sense of many a botched website. In fact, BeautifulSoup 4 uses lxml
under the hood as the parser of choice if it is installed.
BeautifulSoup example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlsource)
for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
print link['href'], link.get_text()
This uses a CSS selector to find all <a>
elements contained directly in a <li>
element where the href
attribute starts with the text http://www.amazon.ca/Best-Sellers
.
Demo:
>>> from bs4 import BeautifulSoup
>>> htmlsource = '<li><a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0">Appstore for Android</a></li>'
>>> soup = BeautifulSoup(htmlsource)
>>> for link in soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]'):
... print link['href'], link.get_text()
...
http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0 Appstore for Android
Note that Amazon alters the response based on the headers:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers')
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps">Appstore for Android</a>
>>> r = requests.get('http://www.amazon.ca/gp/bestsellers', headers={
... 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
>>> soup = BeautifulSoup(r.content)
>>> soup.select('li > a[href^=http://www.amazon.ca/Best-Sellers]')[0]
<a href="http://www.amazon.ca/Best-Sellers-Appstore-Android/zgbs/mobile-apps/ref=zg_bs_nav_0/185-3312534-9864113">Appstore for Android</a>
这篇关于Python RE 在/ref= 之后不返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!