如何用漂亮的汤从html收集数据并将其列出 [英] How to collect data from html with beautiful soup and put it a list
问题描述
我想从html文件中收集数据,然后将其放入变量或列表中. 但是我不太了解美丽的汤.尤其是如何浏览结构.
I want to collect data from html files and then put some into variable or list. But i don't understand Beautiful Soup very much. especially how to navigate through structure.
这是获取src url属性的最佳方法是什么? :
Here what is the best way to get src url attribute ? :
<div id="headshot">
<img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
</div>
这里如何导航和将p类值放在列表中? :
Here how to navigate and put p class values in list ? :
<p class="bioheading">value</p>
<div class="biodata">value</div>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata"><a href"http://url.com/month=01&year=2018&day=02">January 01, 1900</a> (117 years old)</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
与此相同:
<div id="vitalbox" class="tab-content">
<div role="tabpanel" class="tab-pane active" id="home">
<div class="row">
<div class="col-xs-12 col-sm-4">
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
<p class="bioheading">value</p>
<p class="biodata">value</p>
</div>
这里是如何获得性别价值的? :
Here how to get the gender value ? :
<input name="Gender" value="m" type="hidden">
尤其是此html格式可能不正确. 很抱歉这个初学者的问题.
Especially this html can be malformed. Sorry for this beginner question.
最诚挚的问候.
k=0
a_table=[]
bday1=''
for link in soup.findAll('a'):
a_table.append(str(link.get('href')))
#out.write(str(i)+'\t'+str(p.text)+'\n')
if re.match(regs4,str(link.get('href')),re.M) != None:
bday1 = re.search(regs1,str(link.get('href')),re.M)
else:
bday1 = 'http://url.com/calendar.asp?calmonth=01&calyear=2018&calday=01'
k=k+1
我尝试使用此方法收集href =并检查所需的网址.与正则表达式 .find_All()无法正常工作,并显示错误消息:
I try this to collect a href= and check when it is wanted url. with regex .find_All() will not work get the error:
builtins.TypeError: 'NoneType' object is not callable
所以我正在使用.findAll()
So I am using .findAll()
这也行不通,有几个输入内容:
This will not work also there is several input:
for _input in soup.findAll('input'):
if str(_input.attrs['name']) == 'Gender':
if str(_input.attrs['value']) == 'f':
out.write('F')
elif str(_input.attrs['value']) == 'm':
out.write('M')
else:
out.write('—')
得到此错误:
builtins.KeyError: 'name'
推荐答案
只需对法案的答案进行一些修改/改进:
Just some modifications/improvements to the Bill's answer:
- 您可以使用
.select_one()
代替.select()[0]
通过CSS选择器查找单个元素 -
您不需要
attrs
并使用类似于字典的访问标签属性:
- you can use
.select_one()
instead of.select()[0]
to find a single element by a CSS selector you don't need
attrs
and use a dictionary-like access to tag attributes:
soup.select_one('#headshot img')['src']
.get_text()
is a bit more robust than accessing .text
directly
您可以改进用于获取p
元素的CSS选择器,并使用类名称以bio
开头的事实:
you can improve the CSS selector used to get the p
elements and use the fact that class names start with bio
:
#vitalbox #home p[class^=bio]
您应该使用find_all()
,而不是已弃用的findAll()
you should be using find_all()
and not a deprecated findAll()
这篇关于如何用漂亮的汤从html收集数据并将其列出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!