如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内的链接 [英] How to obtain link inside of a DIV nested in a TD with BeautifulSoup

查看:30
本文介绍了如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:简单"使用 Pandas 获取表信息的方式 (pd.read_html()) 不适用于我的用例.

Problem: "Easy" way (pd.read_html()) of obtaining table information with Pandas isn't working for my use case.

它只是提取了我认为是标签文本的内容,这让这个新手感到困惑.我需要的至少是链接(to pdf) 文本.

It's only pulling what I believe is the label text, and it's got this newb confuzzled. What I need is at least the link (to pdf) text.

表格是通过 Requests/BeautifulSoup 从 ASPX 页面获得的.我能够毫无问题地将该表放入 Pandas DataFrame.

Table was obtained from an ASPX page via Requests/BeautifulSoup. I was able to get that table into a Pandas DataFrame without issue.

如果您使用下面的链接,请复制并粘贴它以删除引用网址.我的运气,一些 IT 人员更改了代码,破坏了脚本早于需要.哈哈

链接到页面 (您必须使用中定义的变量手动搜索脚本).

Scraper.py:

import requests
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup as bs

# User-defined variables
SearchBy = 'DateFiled'
FiledStartDate = '2020-01-01'
FiledEndDate = '2020-01-01'
County = 'Luzerne'
MDJSCourtOffice = 'MDJ-11-1-01'

host = "ujsportal.pacourts.us"
base_url = "https://" + host
search_url = base_url + "/CaseSearch"

# Headers are required. Do not change.
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,\
               image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Host': host,
    'Origin': base_url,
    'Referer': search_url,
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) \
                   Gecko/20100101 Firefox/88.0'
}

# Open session request to obtain proper cookies
ses = requests.session()
req = ses.get(search_url, headers=headers)

# Get required hidden token so we can search
tree = html.fromstring(req.content)
veri_token = tree.xpath("/html/body/div[3]/div[2]/div/form/input/@value")[0]

# Import search criteria from user-defined variables
payload = {
    'SearchBy': SearchBy,
    'AdvanceSearch': 'true',
    'FiledStartDate': FiledStartDate,
    'FiledEndDate': FiledEndDate,
    'County': County,
    'MDJSCourtOffice': MDJSCourtOffice,
    '__RequestVerificationToken': veri_token
}

# Make search request
results = ses.post(
    search_url,
    data=payload,
    headers=headers
)

# Save html page to disk
with open("tmp/test_draft1.html", "w") as f:
    f.write(results.text)

# Open local HTML page for processing
with open("tmp/test_draft1.html") as html:
    page = bs(html, 'lxml')

table = page.find('table', {'id': 'caseSearchResultGrid'})

# Save table as sperate HTML for later audit
with open("tmp/test_draft1_table.html", "w") as f:
    f.write(table.prettify())


# Remove unneeded tags so we don't have to do it in Pandas
def clean_tags(page):
    for tag in table.select('div.bubble-text'):
        tag.decompose()
    for tag in table.select('div.modal'):
        tag.decompose()
    for tag in table.find_all(['th', 'tr', 'td'], class_="display-none"):
        tag.decompose()
    for tag in table.select('tfoot'):
        tag.decompose()


clean_tags(page)

# Start constructing dataset
columns = table.find('thead').find_all('th')
column_names = [c.get_text() for c in columns]

table_rows = table.find('tbody').find_all('tr')

case_info = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.get_text() for tr in td]
    case_info.append(row)

# Forward dataset to Pandas for analysis
df = pd.DataFrame(case_info, columns=column_names)
df.columns.values[16] = "Docket URL"

if SearchBy == 'DateFiled':
    df.drop(columns=['Event Type',
            'Event Status', 'Event Date', 'Event Location'], inplace=True)

df
exit("Scrape Complete!")

这可以将 docket pdf 链接本身拉到一个单独的列表中.但没有正确更新单元格.

for row in table_rows:
    row_processed = []
    cells = row.find_all("td")
    if len(cells) == 17:
        docket_url = base_url + cells[16].find('a')['href']
        row_processed.append(docket_url)

当前 print(df) 截断的输出片段:

Current print(df) truncated output snippet:

              Docket Number  ...                 Docket URL
0  MJ-11101-CR-0000001-2020  ...  Docket SheetCourt Summary
1  MJ-11101-CR-0000003-2020  ...  Docket SheetCourt Summary
2  MJ-11101-CR-0000006-2020  ...  Docket SheetCourt Summary
3  MJ-11101-NT-0000081-2020  ...  Docket SheetCourt Summary

需要 print(df) 截断的输出片段:

Needed print(df) truncated output snippet:

              Docket Number  ...                 Docket URL
0  MJ-11101-CR-0000001-2020  ...  https://link/to/docketPDF
1  MJ-11101-CR-0000003-2020  ...  https://link/to/docketPDF
2  MJ-11101-CR-0000006-2020  ...  https://link/to/docketPDF
3  MJ-11101-NT-0000081-2020  ...  https://link/to/docketPDF

推荐答案

好的,在 QHarr 的帮助和指导下 评论 原始问题,我想出了一个解决方案.与任何与编码相关的事情一样,我相信这不是唯一的答案.

Ok, with the help and guidance of QHarr in the comments of the original question, I was able to come up with a solution. As with anything coding related, I am sure this isn't the only answer.

无论如何...在尝试以我想要的方式集成这两个迭代循环之后,连接两个 Pandas DataFrame 工作.

Anyway... After trying to get these two iteration loops integrated the way I wanted, concatenating two Pandas DataFrames works.

案例信息数据:

case_info = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.get_text() for tr in td]
    case_info.append(row)

Docket URLs: (生成包含 URL 的列表)

docket_urls = []
for drow in table_rows:
    docket_sheets = []
    cells = drow.find_all("td")
    if len(cells) == 17:
        docket_url = base_url + cells[16].find('a')['href']
        docket_sheets.append(docket_url)
    docket_urls.append(docket_sheets)

数据帧:

# Import case info dataset to Pandas
df_case_info = pd.DataFrame(case_info, columns=column_names)
df_case_info.columns.values[16] = "Docket Text"  # Rename col = easy to drop

df_case_info.drop(columns=['Docket Text'], inplace=True)

if SearchBy == 'DateFiled':
    df_case_info.drop(columns=['Event Type',
            'Event Status', 'Event Date', 'Event Location'], inplace=True)

# Import docket URLs into Pandas
df_docket_urls = pd.DataFrame(docket_urls, columns={'Docket URL'})

# Concatonate both DataFrames into one
df_mdj = pd.concat([df_case_info, df_docket_urls], axis=1)

我仍然想学习如何按照我最初计划的方式做到这一点.但如果它没有坏,就不要修理它,对吗?

I still want to learn how to do this the way I originally planned. But if it's not broke, don't fix it, right?

感谢帮助过的人.这就是我所说的成功.:)

这篇关于如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆