使用BeautifulSoup进行网页抓取 [英] Web scraping with BeautifulSoup

查看:85
本文介绍了使用BeautifulSoup进行网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要从此链接中抓取国家名称和国家大写: https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order

I want to scrape the country names and country capitals from this link: https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order

从html代码中,我正在寻找所有这些:

From the html code, I'm looking for all of these:

from bs4 import BeautifulSoup
import requests

BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"

html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
countries = soup.find_all("td")

print (countries)

但是我不知道如何真正获得标签之间的内容,特别是因为其中没有任何信息.

But I don't know how to actually get what's in between the tags, especially since there are ones with no information in them.

我觉得这很简单,但是我无法真正理解所有教程,因为它们使用类,并且此Wiki页面的表中没有用于其信息的类.

I feel like it's pretty simple but I can't really understand all the tutorials since they use classes and this wiki page doesn't have classes for its info inside the table.

推荐答案

您只需要添加一些代码即可遍历表列,如下所示:

You just need to add some code to iterate over the table columns as follows:

from bs4 import BeautifulSoup
import requests

BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"

capitals_countries = []

html = requests.get(BASE_URL).text
soup = BeautifulSoup(html, "html.parser")
country_table = soup.find('table', {"class" : "wikitable sortable"})

for row in country_table.find_all('tr'):
    cols = row.find_all('td')

    if len(cols) == 3:
        capitals_countries.append((cols[0].text.strip(), cols[1].text.strip()))

for capital, country in capitals_countries:
    print('{:35} {}'.format(capital, country))

这将显示首字母对和国家对,如下所示:

This would display the capital and country pairs starting as follows:

Abu Dhabi                           United Arab Emirates
Abuja                               Nigeria
Accra                               Ghana
Adamstown                           Pitcairn Islands
Addis Ababa                         Ethiopia
Algiers                             Algeria
Alofi                               Niue
Amman                               Jordan

这篇关于使用BeautifulSoup进行网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆