Beautifulsoup获得表中的值 [英] Beautifulsoup get value in table

查看:156
本文介绍了Beautifulsoup获得表中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想刮
http://www.co.jefferson.co.us/ats /displaygeneral.do?sch=000104
并获得了业主姓名(或名称)
我有工作,但实在是太丑了,而不是我肯定是最好的,所以我在寻找一种更好的方式。
以下是我有:

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have:

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

有关HTML是

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

哇,有许多关于beautifulsoup问题,我通过他们看了,但没有找到答案,帮助我,希望这不是一个重复的问题

Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question

推荐答案

修改:显然是HTML的OP张贴谎言 - 有没有事实 TBODY 标签来寻找,尽管他做了这其中包括在HTML的一个点。所以,改变使用而不是 TBODY )。

(Edit: apparently the HTML the OP posted lies -- there is in fact no tbody tag to look for, even though he made it a point of including in that HTML. So, changing to use table instead of tbody).

由于有可能是你想要的(例如,见兄弟网址你给的人,与去年的数字,4,变成了五)几个表行,我建议一个循环,如以下内容:

As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

这是相当稳健的页面结构细微的变化:具有位于感兴趣的细胞,直到它找到该表的标签,然后在非空(或只是空格该表中的所有通航字符串它循环了其父母),不包括所有者头。

this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all navigable strings within that table that aren't empty (or just whitespace), excluding the owner header.

这篇关于Beautifulsoup获得表中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆