在python中使用bs4从多个链接中获取律师详细信息 [英] Fetching lawyer details from multiple links using bs4 in python

查看:76
本文介绍了在python中使用bs4从多个链接中获取律师详细信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用Python进行Web爬网的绝对入门者,并且对Python编程了解甚少.我只是想提取田纳西州律师的信息.在该网页中,有多个链接,其中还有更多关于律师类别的链接,而这些链接中还有律师的详细信息.

I am an absolute beginner to Web Scraping using Python and just knowing very little about programming i Python. I am just trying to extract the information of the lawyers in the Tennesse location. In the webpage ,there are multiple links, within which there are further more links about the categories of lawyers and within those are the lawyers details.

我已经将各个城市的链接提取到一个列表中,并且还提取了每个城市链接中可用的各种律师.现在,我试图获取每个城市每个类别下律师的个人资料链接,从中可以检索律师的详细信息.但是正在返回空列表.有什么可以做的.如果可能的话.

I have already extracted the links of the various city into a list and have also extracted the various categories of lawyers available in each of the city links . Now I am trying to fetch each of the profile link of the lawyers under each category of every city from where I will retrieve the details of the lawyers. But empty list is being returned . What can be done . If possible suggest.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

cities = [item['href'] for item in soup.select('#browse_view a')]
for c in cities:
    r=requests.get(c)
    s1=bs(r.content,'lxml')
    categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
    #print(categories)
    for c1 in categories:
        r1=requests.get(c1)
        s2=bs(r1.content,'lxml')
        lawyers = [item['href'] for item in s2.select('.directory_profile a')]
        print(lawyers)

我希望输出的结果是每个类别的律师个人资料的链接,但它返回的是空列表."

"I expected the output to be with the links of each of the profile of the lawyers of each category, but it is returning empty list."

"[][][][][][][]"

推荐答案

在使用类选择器(这是您的第一个问题)时,您已经处于a标记级别.

You are already at the level of a tag when using the class selector which is your first issue.

我在下面使用了一个不同的选择器,并测试了一些伪装成同一律师的网址.我将其分隔为结尾URL,以便可以使用set删除重复项.

I use a different selector below and test for urls which disguise the fact they are for the same lawyer. I separate into end urls so I can use set to remove duplicates.

我使用Session来提高重用连接的效率.我将律师简介添加到列表中,并通过集合理解将列表扁平化,以删除所有重复项.

I use Session for efficiency of re-using connection. I add the lawyers profiles to a list and flatten list via set comprehension to remove any duplicates.

import requests
from bs4 import BeautifulSoup as bs

final = []
with requests.Session() as s:
    res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
    soup = bs(res.content, 'lxml')
    cities = [item['href'] for item in soup.select('#browse_view a')]
    for c in cities:
        r = s.get(c)
        s1 = bs(r.content,'lxml')
        categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
        for c1 in categories:
            r1 = s.get(c1)
            s2 = bs(r1.content,'lxml')
            lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in s2.select('.indigo_text .directory_profile')]
            final.append(lawyers)
final_list = {item for sublist in final for item in sublist}

这篇关于在python中使用bs4从多个链接中获取律师详细信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆