用Python从HTML中提取数据 [英] Extracting data from HTML with Python
问题描述
我的代码在Python中处理了以下文本:
< td>
< br />
某些资料1< br />
一些数据2< br />
某些资料3< / td>
您能否告诉我如何从< td>
?
我的想法是使用以下格式将其存储在CSV文件中:某链接,某些数据1,某些数据2,某些数据3
。
我希望如果没有正则表达式,它可能会很难,但是我仍然很难对付正则表达式。
我用我的代码或多或少以下面的方式:
tabulka = subpage.find(table)
for row in tabulka.findAll('tr'):
col = row.findAll('td')
print col [0]
,理想情况是每个td在某个数组中竞争。上面的Html是python的结果。 解决方案
获取 BeautifulSoup 并使用它。很好。
$> easy_install pip
$> pip安装BeautifulSoup
$> python
>>>从BeautifulSoup导入BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>>汤= BS(html)
>>> elem = soup.findAll('a',{'title':'title here'})
>>> elem [0] .text
I have following text processed by my code in Python:
<td>
<a href="http://www.linktosomewhere.net" title="title here">some link</a>
<br />
some data 1<br />
some data 2<br />
some data 3</td>
Could you advice me how to extract data from within <td>
?
My idea is to put it in a CSV file with the following format: some link, some data 1, some data 2, some data 3
.
I expect that without regular expression it might be hard but truly I still struggle against regular expressions.
I used my code more or less in following manner:
tabulka = subpage.find("table")
for row in tabulka.findAll('tr'):
col = row.findAll('td')
print col[0]
and ideally would be to get each td contend in some array. Html above is a result from python.
Get BeautifulSoup and just use it. It's great.
$> easy_install pip
$> pip install BeautifulSoup
$> python
>>> from BeautifulSoup import BeautifulSoup as BS
>>> import urllib2
>>> html = urllib2.urlopen(your_site_here)
>>> soup = BS(html)
>>> elem = soup.findAll('a', {'title': 'title here'})
>>> elem[0].text
这篇关于用Python从HTML中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!