不迭代网页抓取中的列表 [英] not iterating the list in web scraping

查看:85
本文介绍了不迭代网页抓取中的列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过一个链接创建两个列表:一个用于国家/地区,另一个用于货币.但是,在某些时候,我只能给出第一个国家名称,而不会重复列出所有国家/地区.任何有关我如何解决此问题的帮助将不胜感激.

From a link , I am trying to create two lists: one for country and the other for currency. However, I'm stuck at some point where it only gives me the first country name but doesn't iterate to list of all countries. Any help as to how I can fix this will be appreciated.Thanks in advance.

这是我的尝试:

from bs4 import BeautifulSoup
import urllib.request

url = "http://www.worldatlas.com/aatlas/infopage/currency.htm"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 
Safari/537.36'}

req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
html = resp.read()

soup = BeautifulSoup(html, "html.parser")
attr = {"class" : "miscTxt"}

countries = soup.find_all("div", attrs=attr)
countries_list = [tr.td.string for tr in countries]

for country in countries_list:
    print(country)

推荐答案

尝试使用此脚本.它应该给您国家名称以及相应的货币.您无需为此网站使用标题.

Try this script. It should give you the country names along with corresponding currencies. You didn't require to use headers for this site.

from bs4 import BeautifulSoup
import urllib.request

url = "http://www.worldatlas.com/aatlas/infopage/currency.htm"
resp = urllib.request.urlopen(urllib.request.Request(url)).read()
soup = BeautifulSoup(resp, "lxml")

for item in soup.select("table tr"):
    try:
        country = item.select("td")[0].text.strip()
    except IndexError:
        country = ""
    try:
        currency = item.select("td")[0].find_next_sibling().text.strip()
    except IndexError:
        currency = ""
    print(country,currency)

部分输出:

Afghanistan afghani
Algeria dinar
Andorra euro
Argentina peso
Australia dollar

这篇关于不迭代网页抓取中的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆