使用python网页抓取数据? [英] Web Scraping data using python?
问题描述
我刚开始学习网页使用Python刮。不过,我已经遇到了一些问题。
I just started learning web scraping using Python. However, I've already ran into some problems.
我的目标是网络报废不同种类的金枪鱼的名字从fishbase.org(http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
My goal is to web scrap the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
问题:我不能提取所有物种的名字
The problem: I'm unable to extract all of the species names.
这是我迄今为止:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
从这里,我不知道我怎么会去提取物种名称。我想过使用正则表达式(即 soup.find_all(A,文本= re.compile(\\ D + \\ S + \\ D +))
捕获文本在标签内...
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+"))
to capture the texts inside the tag...
任何投入将是非常美联社preciated!
Any input will be highly appreciated!
推荐答案
jozek 建议是正确的做法,但我不能让他的片断的工作(但是这也许是因为我没有运行BeautifulSoup 4测试版)。什么工作对我来说是:
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
这篇关于使用python网页抓取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!