网络数据（维基）刮蟒蛇 [英] Web data(wiki) scraping python

查看：137 发布时间：2016/8/5 19:15:29 python web-scraping beautifulsoup

本文介绍了网络数据（维基）刮蟒蛇的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图获取经纬度LNG从维基百科的一些大学，我有一个基本的URL ='的https： //de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien 的与大学的名单，我从HREF得到各大学的维基页面，以获得他们的维基页面上的纬度经度present。我收到此错误的错误NoneType'对象有没有属性'文本'我无法纠正这一点，我在哪里做错了？

I am trying to obtain lat lng for some university from wikipedia, I have a base url= 'https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien' with list of universities and i am from the href getting the the wiki page of each university to get the lat lng present on their wiki page. I am getting this error an error "NoneType' object has no attribute 'text'" i am unable to rectify this, where am i doing wrong?

import time
import csv
from bs4 import BeautifulSoup
import re
import requests
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien')
html = driver.page_source
base_url = 'https://de.wikipedia.org'
url = 'https://de.wikipedia.org/wiki/Liste_altsprachlicher_Gymnasien'
res = requests.get(url)
soup = BeautifulSoup(res.text)

university = []
while True:
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    links = soup.find_all('a', href=re.compile('.*\/wiki\/.*'))
    for l in links:
        full_link = base_url + l['href']
        town = l['title']
        res = requests.get(full_link)
        soup = BeautifulSoup(res.text)
        info = soup.find('span', attrs={"title":["Breitengrad","Längengrad"]})
        latlong = info.text
        university.append(dict(town_name=town, lat_long=latlong))
        print(university)

修改1
由于@rll我做编辑：

Edit 1 Thanks to @rll i made the edit:

if info is not None:
           latlong = info.text
           university.append(dict(town_name=town, postal_code=latlong))
           print(university)

现在的code的作品，但我看到的只是纬度而不是经度

Now the code works but I see just the lat but not longitude

输出样本： {'postal_ code'：'49°\\ xa072 \\ xa036,73 \\ xa0N'，'town_name'：'施波恩-体育馆布鲁赫扎尔'}，{'postal_ code'：'49°\\ xa072 \\ xa030,73 \\ xa0N'，'town_name'：'圣Paulusheim'}
反正至于如何格式化这个输出得到经度为好，同时格式化输出抱歉，我在正则表达式不佳。

sample output : {'postal_code': '49°\xa072\xa036,73\xa0N', 'town_name': 'Schönborn-Gymnasium Bruchsal'}, {'postal_code': '49°\xa072\xa030,73\xa0N', 'town_name': 'St. Paulusheim'} anyways as to how to format this output to get longitude as well , and also format the output sorry i am poor in regex.

编辑2

我摸索出获得经度以及使用更新的code

I worked out to get the the longitude as well with updated code

info = soup.find('span', attrs={"title":"Breitengrad"})
info1 = soup.find('span',attrs={"title":"Längengrad"})
        if info is not None:
           latlong = info.text
           longitude = info1.text
           university.append(dict(town_name=town, postal_code=latlong,postal_code1=longitude))
           print(university)

现在我的输出是这样的：

Now my output looks like:

{'postal_code': '48°\xa045′\xa046,9″\xa0N',
  'postal_code1': '8°\xa014′\xa044,8″\xa0O',
  'town_name': 'Gymnasium Hohenbaden'},

所以我需要在格式化lat和长期帮助，因为我无法弄清楚如何，例如转换： 48°\\ xa045'\\ xa046,9\\ xa0N至48°45' 9N
谢谢

推荐答案

对不起，不回答直接，但我总是preFER使用MediaWiki的API。我们很幸运，有 mwclient 在Python ，这使得工作与API更容易。

Sorry for not answering directly, but I always prefer to use MediaWiki's API. And we're lucky to have mwclient in Python, which makes working with the API even easier.

因此，对于它的价值，这是我将如何做到这一点 mwclient ：

So, for what it's worth, here's how I would do it with mwclient:

import re
import mwclient

site = mwclient.Site('de.wikipedia.org')
start_page = site.Pages['Liste_altsprachlicher_Gymnasien']

results = {}
for link in start_page.links():
    page = site.Pages[link['title']]
    text = page.text()

    try:
        pattern = re.compile(r'Breitengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/N')
        breiten = [float(b) for b in pattern.search(text).group(1).split('/')]

        pattern = re.compile(r'Längengrad.+?([0-9]+/[0-9]+/[\.0-9]+)/E')
        langen = [float(b) for b in pattern.search(text).group(1).split('/')]
    except:
        continue

    results[link['title']] = breiten, langen

这给出了一个列表的元组 [度，分，秒] 每一个环节它成功地找到坐标：

This gives a tuple of lists [deg, min, sec] for each link it succeeds in finding coordinates in:

>>> results

{'Akademisches Gymnasium (Wien)': ([48.0, 12.0, 5.0], [16.0, 22.0, 34.0]),
 'Akademisches Gymnasium Salzburg': ([47.0, 47.0, 39.9], [13.0, 2.0, 2.9]),
 'Albertus-Magnus-Gymnasium (Friesoythe)': ([53.0, 1.0, 19.13], [7.0, 51.0, 46.44]),
 'Albertus-Magnus-Gymnasium Regensburg': ([49.0, 1.0, 23.95], [12.0, 4.0, 32.88]),
 'Albertus-Magnus-Gymnasium Viersen-Dülken': ([51.0, 14.0, 46.29], [6.0, 19.0, 42.1]),
 ...
}

您可以格式化你喜欢的任何方式：

You could format any way you like:

for uni, location in results.items():
    lat, lon = location
    string = """University {} is at {}˚{}'{}"N, {}˚{}'{}"E"""
    print(string.format(uni, *lat+lon))

或转换DMS坐标为十进制度：

Or convert the DMS coordinates to decimal degrees:

def dms_to_dec(coord):
    d, m, s = coord
    return d + m/60 + s/(60*60)

decimal = {uni: (dms_to_dec(b), dms_to_dec(l)) for uni, (b, l) in results.items()}

请注意，并非所有的链接的页面可能是大学的;我没检查仔细。

Note, not all of the linked pages might be universities; I didn't check that carefully.

这篇关于网络数据（维基）刮蟒蛇的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网络数据（维基）刮蟒蛇 [英] Web data(wiki) scraping python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网络数据（维基）刮蟒蛇 [英] Web data(wiki) scraping python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭