如何使用Python从Wikipedia抓取链接 [英] How to scrape links from Wikipedia with Python

查看:54
本文介绍了如何使用Python从Wikipedia抓取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python从Wikipedia上的海军战役清单"中删除所有战役链接.问题是我无法弄清楚如何将所有包含单词"/wiki/Battle"的链接导出到我的CSV文件.我已经习惯了C ++,所以python对我来说有点陌生.有任何想法吗?这是我到目前为止所拥有的...

I am trying to scrape all the Links to battles from the "List of Naval Battles" on Wikipedia using python. The trouble is that I cannot figure out how to export all of the links containing the words "/wiki/Battle" to my CSV file. I am used to C++, so python is kind of foreign to me. Any ideas? Here is what I have so far...

from bs4 import BeautifulSoup
import urllib2

rootUrl = "https://en.wikipedia.org/wiki/List_of_naval_battles"


def get_soup(url,header):
    return
BeautifulSoup(
    urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')

# soup settings    
url = rootUrl + item
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}

soup = get_soup(url,header)

battle = soup.findAll("/wiki/Battle")

推荐答案

尝试一下:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://en.wikipedia.org/wiki/List_of_naval_battles")
soup = bs(res.text, "html.parser")
naval_battles = {}
for link in soup.find_all("a"):
    url = link.get("href", "")
    if "/wiki/Battle" in url:
        naval_battles[link.text.strip()] = url

print(naval_battles)

这篇关于如何使用Python从Wikipedia抓取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆