如何抓取表格及其链接 [英] How to scrape a table and its links

查看:54
本文介绍了如何抓取表格及其链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做的就是浏览以下网站

  • 我实际上继续前进,并添加了获取所有罪犯最后陈述的代码.

     导入随机导入时间将熊猫作为pd导入汇入要求从lxml导入htmlbase_url ="https://www.tdcj.texas.gov/death_row"响应= requests.get(f"{base_url}/dr_executed_offenders.html")statement_xpath ='//* [@ id ='content_right']/p [6]/text()'def clean(first_and_last_name:list)->str:名称=".join(first_and_last_name).replace(",").lower()返回name.replace(,Jr.",").replace(,Sr.",")).replace('",")def get_last_statement(statement_url:str)->str:页面= requests.get(statement_url).text语句= html.fromstring(页面).xpath(statement_xpath)文字= next(iter(statement),")返回" .join(text.split())df = pd.read_html(response.text,flavor ="bs4")df = pd.concat(df)df.rename(column = {'Link':违规者信息","Link.1":最后声明URL"},inplace = True,)df [违规者信息"] = df [[姓氏",名字"]] .apply(lambda x:f"{base_url}/dr_info/{clean(x)}.html",axis = 1)df [最后声明URL"] = df [[姓氏",名字"]] .apply(lambda x:f"{base_url}/dr_info/{clean(x)} last.html",axis = 1)offender_data =列表(压缩(df ["First Name"],df [姓氏"],df ["Last Statement URL"],))陈述= []对于offender_data中的项目:*名称,网址=项目print(f''正在获取{''.join(names)} ...的语句")statement.append(get_last_statement(statement_url = url))time.sleep(random.randint(1,4))df ["Last Statement"] =语句df.to_csv("offenders_data.csv",index = False) 

    这将花费几分钟,因为代码"sleeps"会被删除.在 1 4 秒之间的任何时间(或多或少)内,因此服务器不会受到滥用.

    完成此操作后,您将得到一个 .csv 文件,其中包含所有违规者的数据及其陈述(如果有的话).

    What I want to do is to take thw following website

    And pick the year of execution, enter the Last Statement Link, and retrieve the statement... perhaps I would be creating 2 dictionaries, both with the execution number as key.

    Afterwards, I would classify the statements by length, besides "flagging" the refusals to give it or if it was just not given.

    Finally, all would be compiled in a SQLite database, and I would display a graph that shows how many messages, clustered by type, have been given each year.

    Beautiful Soup seems to be the path to follow, I'm already having troubles with just printing the year of execution... Of course, I'm not ultimately interested in printing the years of execution, but it seems like a good way of checking if at least my code is properly locating the tags I want.

    tags = soup('td')
    for tag in tags:
        print(tag.get('href', None))
    

    Why does the previous code only print None?

    Thanks beforehand.

    解决方案

    Use pandas to get and manipulate the table. The links are static and by that I mean they can be easily recreated with offender's first and last name.

    Then, you can use requests and BeautifulSoup to scrape for offender's last statement, which are quite moving.

    Here's how:

    import requests
    import pandas as pd
    
    def clean(first_and_last_name: list) -> str:
        name = "".join(first_and_last_name).replace(" ", "").lower()
        return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
    
    
    base_url = "https://www.tdcj.texas.gov/death_row"
    response = requests.get(f"{base_url}/dr_executed_offenders.html")
    
    df = pd.read_html(response.text, flavor="bs4")
    df = pd.concat(df)
    df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)
    
    df["Offender Information"] = df[
        ["Last Name", 'First Name']
    ].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
    
    df["Last Statement URL"] = df[
        ["Last Name", 'First Name']
    ].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
    
    df.to_csv("offenders.csv", index=False)
    

    This gets you:

    EDIT:

    I actually went ahead and added the code that fetches all offenders' last statements.

    import random
    import time
    
    import pandas as pd
    import requests
    from lxml import html
    
    base_url = "https://www.tdcj.texas.gov/death_row"
    response = requests.get(f"{base_url}/dr_executed_offenders.html")
    statement_xpath = '//*[@id="content_right"]/p[6]/text()'
    
    
    def clean(first_and_last_name: list) -> str:
        name = "".join(first_and_last_name).replace(" ", "").lower()
        return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
    
    
    def get_last_statement(statement_url: str) -> str:
        page = requests.get(statement_url).text
        statement = html.fromstring(page).xpath(statement_xpath)
        text = next(iter(statement), "")
        return " ".join(text.split())
    
    
    df = pd.read_html(response.text, flavor="bs4")
    df = pd.concat(df)
    
    df.rename(
        columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
        inplace=True,
    )
    
    df["Offender Information"] = df[
        ["Last Name", 'First Name']
    ].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
    
    df["Last Statement URL"] = df[
        ["Last Name", 'First Name']
    ].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
    
    offender_data = list(
        zip(
            df["First Name"],
            df["Last Name"],
            df["Last Statement URL"],
        )
    )
    
    statements = []
    for item in offender_data:
        *names, url = item
        print(f"Fetching statement for {' '.join(names)}...")
        statements.append(get_last_statement(statement_url=url))
        time.sleep(random.randint(1, 4))
    
    df["Last Statement"] = statements
    df.to_csv("offenders_data.csv", index=False)
    
    

    This will take a couple of minutes because the code "sleeps" for anywhere between 1 to 4 seconds, more or less, so the server doesn't get abused.

    Once this gets done, you'll end up with a .csv file with all offenders' data and their statements, if there was one.

    这篇关于如何抓取表格及其链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆