使用Python抓取多个网页具有与第一页相同的结果 [英] Scraping multiple web pages has the same results as the first page using Python

查看:61
本文介绍了使用Python抓取多个网页具有与第一页相同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是我试图从芝商所网站上获得产品名称.但是,尽管我在循环中更改了URL,为什么为什么该代码仍无法访问下一页?有什么想法和意见吗?预先感谢.

My question is about that I tried to get the product names from CME group website. However, why the code be wouldn't be able to access the next page although I changed the URLs in the loop? Any ideas and opinions on this? Thanks in advance.

from urllib.request import Request
from urllib.request import urlopen
from bs4 import BeautifulSoup

for i in range(1,6):
 url='http://www.cmegroup.com/trading/products/#pageNumber='+str(i)+'&sortAsc=false'

 CMEacess=Request(url,headers={'User-Agent':'Mozilla/5.0'})
 print(url)
 print('page: '+str(i))

 CMEpage=urlopen(CMEacess).read()
 CMEsoup=BeautifulSoup(CMEpage,'html.parser')

 namelist=CMEsoup.findAll('th',attrs={'class','cmeTableLeft'})

  for name in namelist:
    print(name.get_text())

  print('\n')

推荐答案

您可以尝试使用请求库而不是urllib.我只是使用与您相似的代码成功访问了第5页.

You could try using the requests library rather than urllib. I just accessed page 5 successfully using code similar to yours with this difference.

请注意,文字'D3'出现在第五页,而不是出现在第一页.

Note that the literal 'D3' appears on page five but not on page one.

>>> import requests
>>> i = 5
>>> url='http://www.cmegroup.com/trading/products/#pageNumber='+str(i)+'&sortAsc=false'
>>> page = requests.get(url).content
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.find_all(string='D3')
['D3', 'D3']

这篇关于使用Python抓取多个网页具有与第一页相同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆