如何用漂亮的汤从html收集数据并将其列出 [英] How to collect data from html with beautiful soup and put it a list

查看:79
本文介绍了如何用漂亮的汤从html收集数据并将其列出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从html文件中收集数据,然后将其放入变量或列表中. 但是我不太了解美丽的汤.尤其是如何浏览结构.

I want to collect data from html files and then put some into variable or list. But i don't understand Beautiful Soup very much. especially how to navigate through structure.

这是获取src url属性的最佳方法是什么? :

Here what is the best way to get src url attribute ? :

<div id="headshot">
<img title="Photo of someone" alt="Photo of somenone" src="url/file.jpg">
</div>

这里如何导航和将p类值放在列表中? :

Here how to navigate and put p class values in list ? :

                <p class="bioheading">value</p>
                <div class="biodata">value</div>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata"><a href"http://url.com/month=01&amp;year=2018&amp;day=02">January 01, 1900</a> (117 years old)</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>
                <p class="bioheading">value</p>
                <p class="biodata">value</p>

与此相同:

<div id="vitalbox" class="tab-content">
<div role="tabpanel" class="tab-pane active" id="home">
    <div class="row">
        <div class="col-xs-12 col-sm-4">
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
            <p class="bioheading">value</p>
            <p class="biodata">value</p>
        </div>

这里是如何获得性别价值的? :

Here how to get the gender value ? :

<input name="Gender" value="m" type="hidden">

尤其是此html格式可能不正确. 很抱歉这个初学者的问题.

Especially this html can be malformed. Sorry for this beginner question.

最诚挚的问候.

k=0
a_table=[]
bday1=''
for link in soup.findAll('a'):
    a_table.append(str(link.get('href')))
    #out.write(str(i)+'\t'+str(p.text)+'\n')
    if re.match(regs4,str(link.get('href')),re.M) != None:
        bday1 = re.search(regs1,str(link.get('href')),re.M)
    else:
        bday1 = 'http://url.com/calendar.asp?calmonth=01&amp;calyear=2018&amp;calday=01'
    k=k+1

我尝试使用此方法收集href =并检查所需的网址.与正则表达式 .find_All()无法正常工作,并显示错误消息:

I try this to collect a href= and check when it is wanted url. with regex .find_All() will not work get the error:

builtins.TypeError: 'NoneType' object is not callable

所以我正在使用.findAll()

So I am using .findAll()

这也行不通,有几个输入内容:

This will not work also there is several input:

for _input in soup.findAll('input'):
    if str(_input.attrs['name']) == 'Gender':
        if str(_input.attrs['value']) == 'f':
            out.write('F') 
        elif str(_input.attrs['value']) == 'm':
            out.write('M')
        else:
            out.write('—')

得到此错误:

builtins.KeyError: 'name'

推荐答案

只需对法案的答案进行一些修改/改进:

Just some modifications/improvements to the Bill's answer:

  • 您可以使用.select_one()代替.select()[0]通过CSS选择器查找单个元素
  • 您不需要attrs并使用类似于字典的访问标签属性:

  • you can use .select_one() instead of .select()[0] to find a single element by a CSS selector
  • you don't need attrs and use a dictionary-like access to tag attributes:

soup.select_one('#headshot img')['src']

  • .get_text()比直接访问.text

  • .get_text() is a bit more robust than accessing .text directly

    您可以改进用于获取p元素的CSS选择器,并使用类名称以bio开头的事实:

    you can improve the CSS selector used to get the p elements and use the fact that class names start with bio:

    #vitalbox #home p[class^=bio]
    

  • 您应该使用find_all(),而不是已弃用的findAll()

  • you should be using find_all() and not a deprecated findAll()

    这篇关于如何用漂亮的汤从html收集数据并将其列出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆