使用Python抓取多个网页具有与第一页相同的结果 [英] Scraping multiple web pages has the same results as the first page using Python
本文介绍了使用Python抓取多个网页具有与第一页相同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我的问题是我试图从芝商所网站上获得产品名称.但是,尽管我在循环中更改了URL,为什么为什么该代码仍无法访问下一页?有什么想法和意见吗?预先感谢.
My question is about that I tried to get the product names from CME group website. However, why the code be wouldn't be able to access the next page although I changed the URLs in the loop? Any ideas and opinions on this? Thanks in advance.
from urllib.request import Request
from urllib.request import urlopen
from bs4 import BeautifulSoup
for i in range(1,6):
url='http://www.cmegroup.com/trading/products/#pageNumber='+str(i)+'&sortAsc=false'
CMEacess=Request(url,headers={'User-Agent':'Mozilla/5.0'})
print(url)
print('page: '+str(i))
CMEpage=urlopen(CMEacess).read()
CMEsoup=BeautifulSoup(CMEpage,'html.parser')
namelist=CMEsoup.findAll('th',attrs={'class','cmeTableLeft'})
for name in namelist:
print(name.get_text())
print('\n')
推荐答案
您可以尝试使用请求库而不是urllib.我只是使用与您相似的代码成功访问了第5页.
You could try using the requests library rather than urllib. I just accessed page 5 successfully using code similar to yours with this difference.
请注意,文字'D3'出现在第五页,而不是出现在第一页.
Note that the literal 'D3' appears on page five but not on page one.
>>> import requests
>>> i = 5
>>> url='http://www.cmegroup.com/trading/products/#pageNumber='+str(i)+'&sortAsc=false'
>>> page = requests.get(url).content
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> soup.find_all(string='D3')
['D3', 'D3']
这篇关于使用Python抓取多个网页具有与第一页相同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文