在python中使用bs4从网站的不同链接中获取律师详细信息 [英] Fetching lawyer details from different links of a website using bs4 in python

查看:86
本文介绍了在python中使用bs4从网站的不同链接中获取律师详细信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是使用Python进行Web爬网的绝对初学者,对Python编程的了解很少.我只是想提取田纳西州律师的信息.在该网页中,有多个链接,其中还有更多关于律师类别的链接,而这些链接中还有律师的详细信息.

I am an absolute beginner to Web Scraping using Python with very little knowledge in Python programming. I am just trying to extract the information of the lawyers in the Tennesse location. In the webpage ,there are multiple links, within which there are further more links about the categories of lawyers and within those are the lawyers details.

我已经将各个城市的链接提取到一个列表中,并且还提取了每个城市链接中可用的各种律师.此外,我将每个律师的URL分别提取为不同的类别,并将它们存储在单独的集中.现在,我已经遍历了这些URL,以获取律师的详细信息,最后将它们写到.xls文件中.但是,当我运行我的代码时,该过程将继续执行并无限期地执行代码.我必须强制停止该过程,并且甚至在程序中指定的所需位置都没有创建.xls文件.该怎么办?如果可能的话.

I have already extracted the links of the various city into a list and have also extracted the various categories of lawyers available in each of the city links. Furthermore I have extracted each of the lawyers url in separate categories and stored them in a separate set. Now I have iterated over those urls for fetching the details of the lawyers and finally writing them on to a .xls file. But while I am running my code the process is going on executing for and indefinite amount of time. I have to force stop the process and even no .xls file is being created at the desired location as stated in the program. What can be done? If possible suggest.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

final=[]
records=[]
with requests.Session() as s:
    res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
    soup = bs(res.content, 'lxml')

    cities = [item['href'] for item in soup.select('#browse_view a')]
    for c in cities:
        r=s.get(c)
        s1=bs(r.content,'lxml')
        categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
        for c1 in categories:
            r1=s.get(c1)
            s2=bs(r1.content,'lxml')
            lawyers = [item['href'].split('*')[1] if '*' in item['href'] else item['href'] for item in
                       s2.select('.indigo_text .directory_profile')]



            for i in lawyers:
                r2 = s.get(i)
                s3 = bs(r2.content, 'lxml')
                name = s3.select_one('#lawyer_name').text
                category = s3.select_one('#attorney_profile_heading').text
                firm = s3.select_one('#firm_profile_page').text
                address = ' '.join([string for string in s3.select_one('#poap_postal_addr_block').stripped_strings][1:])
                practices = ' '.join([item.text for item in s3.select('#pa_list li')])
                records.append({'Names': name,'Category':category,'Address': address, 'Firm Name': firm,'Practice Area':practices})
df = pd.DataFrame(records,columns=['Names','Category','Address','Farm Name','Practice Areas'])
df=df.drop_duplicates()
df.to_excel(r'C:\Users\laptop\Desktop\lawyers.xls', sheet_name='MyData2', index = False, header=True)

我希望程序完成其执行并创建一个.xls文件,但是它仍在继续执行,甚至我也不知道完成它需要多长时间.无限循环是否有可能?发生了吗?如果可能,建议."

"I expected the program to complete its execution and create an .xls file, but it is going on executing and even I have no idea for how long will it require to complete it's execution. Is there any possibility that an infinite loop has occoured? If possible suggest."

推荐答案

我看到您要抓取的数据太多,使用BeautifulSoup将花费很多时间,我尝试抓取

I see the data that you're trying to scrape is too much and using BeautifulSoup will take a lot of time, I tried scraping this website and even after dividing the search into four different program files it took about 12 hours to successfully complete the execution,I also tried your code for only brentwood city and it took about an hour. I see no infinite loop in your code! let it run and have patience.

您的程序还没有创建任何.xml,因为它尚未到达代码的那一部分.

Also your program has not created any .xml because it has not reached that part of code.

PS:我知道这应该是一条评论,但我目前没有足够的声望来做到这一点.希望这个答案能帮助您和我也赢得声誉,以便下次可以发表评论.

PS: I know this should be a comment but I currently don't have enough reputatuion to do so. Hope this answer helps you and also me to gain reputation to be able to comment next time.

这篇关于在python中使用bs4从网站的不同链接中获取律师详细信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆