在python中以特定宽度存储来自td标签的信息 [英] Storing information from td tags with a specific width, in python

查看:39
本文介绍了在python中以特定宽度存储来自td标签的信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试存储td标签中所有具有 width ="82" 的信息,或者也许有一种更有效的方法.

I am trying to store all the information from the td tags that have width="82" or maybe there is a more efficient method.

<a name="AAKER"> </a> 
<table border="" width="100%" cellpadding="5"><tbody><tr><td bgcolor="#FFFFFF"><b>AAKER</b> 
    <small>(<a href="http://google.com">Soundex
    A260</a>)
    — <i>See also</i> 
    <a href="http://google.com">ACKER</a>,
    <a href="http://google.com">KEAR</a>,
    <a href="http://google.com">TAAKE</a>.
    </small>
</td></tr></tbody></table><br clear="all">


<table align="left" cellpadding="5"> 
    
    <tbody><tr><td width="82" align="right" valign="top">&nbsp;</td><td valign="top">
        <img src="rd.gif" width="13" height="13">
        <b><a name="954.35.65">Aaker, Casper Drengman</a> (b.1883)</b>
        &nbsp;— also known as 
        <b>Casper D. Aaker</b>&nbsp;— of Minot, 
        <a href="http://google.com">WardCounty</a> , N.Dak. Born in Ridgeway, 
        <a href="http://google.com">Winneshiek County</a> , Iowa, August, 
        <a href="http://google.com">1883</a>. Republican. 
        <a href="http://google.com">Lawyer</a>; organizer, Trinity 
        <a href="http://google.com">Hospital</a>,
        1922; delegate to Republican National Convention from North Dakota.
        
        <table width="100%" align="left">
            <tbody>
                <tr><td width="20">&nbsp;</td> 
                    <td width="26" valign="top"><img src="hand.gif" width="26" height="17"></td>
                    <td valign="top">
                        <span style="font-size:8pt;"><i>Relatives:</i> 
                            Son of Drengman Aaker and Christine (Ellefson) Aaker; married, 
                        <a href="http://google.com">December 15, 1914</a>, 
                        to Leda Mansfield.</span>
                    </td>
                </tr>
            </tbody>
        </table> 
        </td></tr> 
    
        
        <tr><td width="82" align="right" valign="top">&nbsp;</td>
        <td valign="top"><img src="rd.gif" width="13" height="13">
            <b><a name="949.93.45">Aaker, H. H.</a></b>&nbsp;— of 
            <a href="http://google.com">Norman County</a>
            , Minn. Prohibition candidate for 
            <a href="http://google.com">secretary of state of Minnesota</a>
            , 1892.
            <a href="http://google.com">Burial location unknown</a>.
        </td></tr> 

    </tbody>
</table><br clear="all"><br> 

<a name="AALL"> </a> 
    <table border="" width="100%" cellpadding="5">
        <tbody><tr><td bgcolor="#FFFFFF"><b>AALL</b> <small>(
            <a href="http://google.com">SoundexA400</a>
            )— <i>See also</i> 
            <a href="http://google.com">AHL</a>,
            <a href="http://google.com">AL</a>,
            <a href="http://google.com">ALL</a>,
            </small>
            </td></tr>
        </tbody></table><br clear="all"> 

<tbody><tr><td width="82" align="right" valign="top">&nbsp;</td>
    <td valign="top"><img src="rd.gif" width="13" height="13">
        <b><a name="961.32.34">Aamodt, Gary</a></b>&nbsp;— of Madison, 
        <a href="http://google.com">Dane County</a>, Wis.
        Democrat. Delegate to Democratic National Convention from Wisconsin,
        <a href="http://google.com">1976</a>. Still living as of 1976. 
    </td></tr> 
    
    <tr><td width="82" align="right" valign="top">&nbsp;</td>
        <td valign="top"><img src="rd.gif" width="13" height="13">
            <b><a name="030.75.75">Aamodt, Marjorie M.</a></b>&nbsp;— 
            Democrat. Candidate for 
            <a href="http://google.com">Pennsylvania
            state house of representatives</a> 13th District, 1980.
            <a href="http://google.com">Female</a>. 
            Still living as of 1980. 
        </td>
    </tr> 
    
</tbody></table><br clear="all"><br> 

到目前为止,我已经尝试定义一个对象:

So far I have tried defining an object:

ta = driver.find_element_by_tag_name('tbody').get_attribute('innerHTML') 
pd.read_html(ta)

但是我希望将所有pd.read_html(ta)[i]存储在数据框中,而忽略表宽度="100"

But I wish to have all pd.read_html(ta)[i] stored in a dataframe ignoring the table width ="100"

推荐答案

您可以通过汤中的 widht =" 100% .extract()然后获取所有行.

You can .extract() the tables with widht="100% from the soup and then get all rows.

例如( txt 包含问题中的HTML代码段):

For example (txt contains your HTML snippet from the question):

soup = BeautifulSoup(txt, 'html.parser')

for t in soup.select('table[width="100%"]'):
    t.extract()

all_data = []
for row in soup.select('tr'):
    name, desc = row.get_text(strip=True, separator=' ').split('—', maxsplit=1)
    all_data.append([name, desc.strip()])

df = pd.DataFrame(all_data, columns=['name', 'description'])
print(df)

df.to_csv('data.csv')

打印:

                               name                                        description
0  Aaker, Casper Drengman (b.1883)   also known as Casper D. Aaker — of Minot, Ward...
1                     Aaker, H. H.   of Norman County , Minn. Prohibition candidate...
2                     Aamodt, Gary   of Madison, Dane County , Wis.\n        Democr...
3              Aamodt, Marjorie M.   Democrat. Candidate for Pennsylvania\n        ...

并保存 data.csv (来自LibreOffice的屏幕截图):

And saves data.csv (screenshot from LibreOffice):

这篇关于在python中以特定宽度存储来自td标签的信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆