如何选择一些网址BeautifulSoup？ [英] How to select some urls with BeautifulSoup?

查看：173 发布时间：2016/8/5 19:05:42 python screen-scraping beautifulsoup web-scraping

本文介绍了如何选择一些网址BeautifulSoup？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想除了刮最后一行以下信息，阶级=地区的行：

I want to scrape the following information except the last row and "class="Region" row:

...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> 
<td bgcolor="" align="left">New York</td> 
<td bgcolor="" align="left" class="Region">N/A</td> 
<td bgcolor="" align="left">1,863</td> 
<td bgcolor="" align="left">565</td> 
<td bgcolor="" align="left">1,133</td> 
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...

我这个处理程序进行测试：

I tested with this handler:

class TestUrlOpen(webapp.RequestHandler):
    def get(self):
        soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
        link_list = []
        for a in soup.findAll('a',href=True):
            link_list.append(a["href"])
        self.response.out.write("""<p>link_list: %s</p>""" % link_list)

这工作，但它也获得了查看资料链接，我不想要的：

This works but it also get the "View Profile" link which I don't want:

link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]

我可以轻松地删除u'http：//www.ilrg.com/'刮网站后，但它会是不错的名单没有它。什么是做到这一点的最好方法是什么？谢谢你。

I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.

推荐答案

我想这可能是你在找什么。该ATTRS参数可以是分离你想要的部分有帮助的。

I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.

from BeautifulSoup import BeautifulSoup
import urllib

soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))

rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
    number = row.find('td').text
    tds = row.findAll(name='td',attrs={'align':'left'})
    link = tds[0].find('a')['href']
    firm = tds[0].text
    office = tds[1].text
    attorneys = tds[3].text
    partners = tds[4].text
    associates = tds[5].text
    salary = tds[6].text
    print number, firm, office, attorneys, partners, associates, salary

这篇关于如何选择一些网址BeautifulSoup？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何选择一些网址BeautifulSoup？ [英] How to select some urls with BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何选择一些网址BeautifulSoup？ [英] How to select some urls with BeautifulSoup?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭