使用python网页抓取数据? [英] Web Scraping data using python?

查看:207
本文介绍了使用python网页抓取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚开始学习网页使用Python刮。不过,我已经遇到了一些问题。

I just started learning web scraping using Python. However, I've already ran into some problems.

我的目标是网络报废不同种类的金枪鱼的名字从fishbase.org(http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

My goal is to web scrap the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)

问题:我不能提取所有物种的名字

The problem: I'm unable to extract all of the species names.

这是我迄今为止:

import urllib2
from bs4 import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(html_doc)

spans = soup.find_all(

从这里,我不知道我怎么会去提取物种名称。我想过使用正则表达式(即 soup.find_all(A,文本= re.compile(\\ D + \\ S + \\ D +))捕获文本在标签内...

From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...

任何投入将是非常美联社preciated!

Any input will be highly appreciated!

推荐答案

jozek 建议是正确的做法,但我不能让他的片断的工作(但是这也许是因为我没有运行BeautifulSoup 4测试版)。什么工作对我来说是:

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:

import urllib2
from BeautifulSoup import BeautifulSoup

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)

soup = BeautifulSoup(page)

scientific_names = [it.text for it in soup.table.findAll('i')]

print scientific_names

这篇关于使用python网页抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆