搜集Wikipedia信息(表) [英] Scraping Wikipedia information (table)
问题描述
我需要在Wikipedia上按区域刮取有关Elenco dei comuni的信息.我想创建一个数组,该数组可以让我将每个乡镇与相应区域相关联,即:
I would need to scrape information regarding Elenco dei comuni per regione on Wikipedia. I would like to create an array that can allow me to associate each comune to the corresponding region, i.e. something like this:
'Abbateggio': 'Pescara' -> Abruzzo
我试图使用 BeautifulSoup
和 requests
来获取信息,如下所示:
I tried to get information using BeautifulSoup
and requests
as follows:
from bs4 import BeautifulSoup as bs
import requests
with requests.Session() as s: # use session object for efficiency of tcp re-use
s.headers = {'User-Agent': 'Mozilla/5.0'}
r = s.get('https://it.wikipedia.org/wiki/Comuni_d%27Italia')
soup = bs(r.text, 'html.parser')
for ele in soup.find_all('h3')[:6]:
tx = bs(str(ele),'html.parser').find('span', attrs={'class': "mw-headline"})
if tx is not None:
print(tx['id'])
但是它不起作用(它向我返回了一个空列表).我使用Google Chrome浏览器查看的信息如下:
however it does not work (it returns me an empty list). The information that I have looked at using Inspect of Google Chrome are the following:
<span class="mw-headline" id="Elenco_dei_comuni_per_regione">Elenco dei comuni per regione</span> (table)
<a href="/wiki/Comuni_dell%27Abruzzo" title="Comuni dell'Abruzzo">Comuni dell'Abruzzo</a>
(此字段应随每个区域而变化)
(this field should change for each region)
然后< table class ="wikitable可排序查询表排序">
请问您如何获得这种结果的建议?任何帮助和建议,将不胜感激.
Could you please give me advice on how to get such results? Any help and suggestion will be appreciated.
示例:
我有一个字: comunediabbateggio
.这个词包括 Abbateggio
.我想知道哪个地区可以与该城市相关联(如果存在).来自Wikipedia的信息需要创建一个数据集,该数据集可以让我检查该字段并将其关联到一个地区的城市/城市.我应该期望的是:
I have a word: comunediabbateggio
. This word includes Abbateggio
. I would like to know which region can be associated with that city, if it exists.
Information from Wikipedia needs to create a dataset that can allow me to check the field and associate to comuni/cities a region.
What I should expect is:
WORD REGION/STATE
comunediabbateggio Pescara
我希望这可以为您提供帮助.很抱歉,如果不清楚.另一个可能会更好地理解英语的英语示例如下:
I hope this can help you. Sorry if it was not clear. Another example for English speaker that might be slightly better for understanding is the following:
除了上面的意大利语链接,您还可以考虑以下内容: https://en.wikipedia.org/wiki/List_of_comuni_of_Italy .对于每个地区(伦巴第大区,威尼托大区,西西里岛……),我需要收集有关各省公社列表
的信息.如果您点击 ...的社区列表
的链接,则会有一个列出该社区的表,例如 https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento .
Instead of the Italian link above, you can also consider the following: https://en.wikipedia.org/wiki/List_of_comuni_of_Italy . For each region (Lombardia, Veneto, Sicily, ... ) I would need to collect information about the list of communes of the Provinces
.
if you click in a link of List of Communes of ...
, there is a table that list the comune, e.g. https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento.
推荐答案
import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
target = "https://en.wikipedia.org/wiki/List_of_comuni_of_Italy"
def main(url):
with requests.Session() as req:
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
provinces = [item.find_next("span").text for item in soup.findAll(
"span", class_="tocnumber", text=re.compile(r"\d[.]\d"))]
search = [item.replace(
" ", "_") if " " in item else item for item in provinces]
nested = []
for item in search:
for a in soup.findAll("span", id=item):
goes = [b.text.split("of ")[-1]
for b in a.find_next("ul").findAll("a")]
nested.append(goes)
dictionary = dict(zip(provinces, nested))
urls = [f'{url[:24]}{b.get("href")}' for item in search for a in soup.findAll(
"span", id=item) for b in a.find_next("ul").findAll("a")]
return urls, dictionary
def parser():
links, dics = main(target)
com = []
for link in tqdm(links):
try:
df = pd.read_html(link)[0]
com.append(df[df.columns[1]].to_list()[:-1])
except ValueError:
com.append(["N/A"])
com = iter(com)
for x in dics:
b = dics[x]
dics[x] = dict(zip(b, com))
print(dics)
parser()
这篇关于搜集Wikipedia信息(表)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!