用Python从HTML中提取数据 [英] Extracting data from HTML with Python

查看:193

获取 BeautifulSoup 并使用它。很好。

  $> easy_install pip 
$> pip安装BeautifulSoup
$> python
>>>从BeautifulSoup导入BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>>汤= BS(html)
>>> elem = soup.findAll('a',{'title':'title here'})
>>> elem [0] .text


I have following text processed by my code in Python:

<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>

Could you advice me how to extract data from within <td>? My idea is to put it in a CSV file with the following format: some link, some data 1, some data 2, some data 3.

I expect that without regular expression it might be hard but truly I still struggle against regular expressions.

I used my code more or less in following manner:

tabulka = subpage.find("table")

for row in tabulka.findAll('tr'):
    col = row.findAll('td')
print col[0]

and ideally would be to get each td contend in some array. Html above is a result from python.

解决方案

Get BeautifulSoup and just use it. It's great.

$> easy_install pip
$> pip install BeautifulSoup
$> python
>>> from BeautifulSoup import BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>> soup = BS(html)
>>> elem = soup.findAll('a', {'title': 'title here'})
>>> elem[0].text

这篇关于用Python从HTML中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆