Beautifulsoup获得表中的值 [英] Beautifulsoup get value in table
问题描述
我想刮
http://www.co.jefferson.co.us/ats /displaygeneral.do?sch=000104
并获得了业主姓名(或名称)
我有工作,但实在是太丑了,而不是我肯定是最好的,所以我在寻找一种更好的方式。
以下是我有:
I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have:
soup = BeautifulSoup(url_opener.open(url))
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next
有关HTML是
<td valign="top">
<table border="1" cellpadding="1" cellspacing="0" align="right">
<tbody><tr class="tableheaders">
<td>Owner Name(s)</td>
</tr>
<tr>
<td>PILCHER DONALD L </td>
</tr>
</tbody></table>
</td>
哇,有许多关于beautifulsoup问题,我通过他们看了,但没有找到答案,帮助我,希望这不是一个重复的问题
Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question
推荐答案
(修改:显然是HTML的OP张贴谎言 - 有没有事实 TBODY
标签来寻找,尽管他做了这其中包括在HTML的一个点。所以,改变使用表
而不是 TBODY
)。
(Edit: apparently the HTML the OP posted lies -- there is in fact no tbody
tag to look for, even though he made it a point of including in that HTML. So, changing to use table
instead of tbody
).
由于有可能是你想要的(例如,见兄弟网址你给的人,与去年的数字,4,变成了五)几个表行,我建议一个循环,如以下内容:
As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:
# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
print x
这是相当稳健的页面结构细微的变化:具有位于感兴趣的细胞,直到它找到该表的标签,然后在非空(或只是空格该表中的所有通航字符串它循环了其父母),不包括所有者
头。
this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all navigable strings within that table that aren't empty (or just whitespace), excluding the owner
header.
这篇关于Beautifulsoup获得表中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!