如何从不同的维基百科页面抓取数据? [英] How to scrape data from different Wikipedia pages?

查看:32
本文介绍了如何从不同的维基百科页面抓取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经刮取了使用 Python Beautifulsoup 的维基百科表格(https://en.wikipedia.org/wiki/Districts_of_Hong_Kong).但除了提供的数据(即人口、面积、密度和地区),我想获得每个地区的位置坐标.数据应该从每个区的另一个页面获取(表上有超链接).

I've scraped the wikipedia table using Python Beautifulsoup (https://en.wikipedia.org/wiki/Districts_of_Hong_Kong). But except for the offered data (i.e. population, area, density and region), I would like to get the location coordinates for each district. The data should get from another page of each district (there are the hyperlinks on the table).

以第一区'中西区'为例,DMS坐标(22°17′12″N 114°09′18″E)可以在页面上找到.通过进一步单击链接,我可以获得十进制坐标 (22.28666, 114.15497).

Take the first district 'Central and Western District' for example, the DMS coordinates (22°17′12″N 114°09′18″E) can be found on the page. By further clicking the link, I could get the decimal coordinates (22.28666, 114.15497).

那么,是否可以为每个地区创建一个带有纬度经度的表格?

So, is it possible to create a table with Latitude and Longitude for each district?

编程世界的新手,如果问题很愚蠢,请见谅...

New to the programming world, sorry if the question is stupid...

参考:

DMS 坐标:https://en.wikipedia.org/wiki/Central_and_Western_District

小数坐标:https://tools.wmflabs.org/geohack/geohack.php?pagename=Central_and_Western_District&params=22.28666_N_114.15497_E_type:adm2nd_region:HK

推荐答案

import requests
from bs4 import BeautifulSoup

res = requests.get('https://en.wikipedia.org/wiki/Districts_of_Hong_Kong')
result = {}
soup = BeautifulSoup(res.content,'lxml')
tables = soup.find_all('table',{'class':'wikitable'})
table = tables[0].find('tbody')
districtLinks = table.find_all('a',href=True)

for link in districtLinks:
    if link.getText() in link.attrs.get('title','') or link.attrs.get('title','') in link.getText():
        district = link.attrs.get('title','')
        if district:
            url = link.attrs.get('href', '')
        else:
            continue
    else:
        continue
    try:
        res = requests.get("https://en.wikipedia.org/{}".format(url))
    except:
        continue
    else:
        soup = BeautifulSoup(res.content, 'lxml')
        try:
            tables = soup.find_all('table',{'class':'infobox geography vcard'})
            table = tables[0].find('tbody')
        except:
            continue
        else:
            for row in table.find_all('tr',{'class':'mergedbottomrow'}):
                geoLink = row.find('span',{'class': 'geo'}) # 'plainlinks nourlexpansion'
                locationSplit = geoLink.getText().split("; ")
                result.update({district : {"Latitude ": locationSplit[0], "Longitude":locationSplit[1]}})

print(result)

结果:

{'Central and Western District': {'Latitude ': '22.28666', 'Longitude': '114.15497'}, 'Eastern District, Hong Kong': {'Latitude ': '22.28411', 'Longitude': '114.22414'}, 'Southern District, Hong Kong': {'Latitude ': '22.24725', 'Longitude': '114.15884'}, 'Wan Chai District': {'Latitude ': '22.27968', 'Longitude': '114.17168'}, 'Sham Shui Po District': {'Latitude ': '22.33074', 'Longitude': '114.16220'}, 'Kowloon City District': {'Latitude ': '22.32820', 'Longitude': '114.19155'}, 'Kwun Tong District': {'Latitude ': '22.31326', 'Longitude': '114.22581'}, 'Wong Tai Sin District': {'Latitude ': '22.33353', 'Longitude': '114.19686'}, 'Yau Tsim Mong District': {'Latitude ': '22.32138', 'Longitude': '114.17260'}, 'Islands District, Hong Kong': {'Latitude ': '22.26114', 'Longitude': '113.94608'}, 'Kwai Tsing District': {'Latitude ': '22.35488', 'Longitude': '114.08401'}, 'North District, Hong Kong': {'Latitude ': '22.49471', 'Longitude': '114.13812'}, 'Sai Kung District': {'Latitude ': '22.38143', 'Longitude': '114.27052'}, 'Sha Tin District': {'Latitude ': '22.38715', 'Longitude': '114.19534'}, 'Tai Po District': {'Latitude ': '22.45085', 'Longitude': '114.16422'}, 'Tsuen Wan District': {'Latitude ': '22.36281', 'Longitude': '114.12907'}, 'Tuen Mun District': {'Latitude ': '22.39163', 'Longitude': '113.9770885'}, 'Yuen Long District': {'Latitude ': '22.44559', 'Longitude': '114.02218'}}

这篇关于如何从不同的维基百科页面抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆