如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内的链接 [英] How to obtain link inside of a DIV nested in a TD with BeautifulSoup
问题描述
问题:简单"使用 Pandas 获取表信息的方式 (pd.read_html()
) 不适用于我的用例.
Problem: "Easy" way (pd.read_html()
) of obtaining table information with Pandas isn't working for my use case.
它只是提取了我认为是标签文本的内容,这让这个新手感到困惑.我需要的至少是链接(to pdf) 文本.
It's only pulling what I believe is the label text, and it's got this newb confuzzled. What I need is at least the link (to pdf) text.
表格是通过 Requests/BeautifulSoup 从 ASPX 页面获得的.我能够毫无问题地将该表放入 Pandas DataFrame.
Table was obtained from an ASPX page via Requests/BeautifulSoup. I was able to get that table into a Pandas DataFrame without issue.
如果您使用下面的链接,请复制并粘贴它以删除引用网址.我的运气,一些 IT 人员更改了代码,破坏了脚本早于需要.哈哈
链接到页面 (您必须使用中定义的变量手动搜索脚本).
Scraper.py:
import requests
import pandas as pd
from lxml import html
from bs4 import BeautifulSoup as bs
# User-defined variables
SearchBy = 'DateFiled'
FiledStartDate = '2020-01-01'
FiledEndDate = '2020-01-01'
County = 'Luzerne'
MDJSCourtOffice = 'MDJ-11-1-01'
host = "ujsportal.pacourts.us"
base_url = "https://" + host
search_url = base_url + "/CaseSearch"
# Headers are required. Do not change.
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,\
image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': host,
'Origin': base_url,
'Referer': search_url,
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) \
Gecko/20100101 Firefox/88.0'
}
# Open session request to obtain proper cookies
ses = requests.session()
req = ses.get(search_url, headers=headers)
# Get required hidden token so we can search
tree = html.fromstring(req.content)
veri_token = tree.xpath("/html/body/div[3]/div[2]/div/form/input/@value")[0]
# Import search criteria from user-defined variables
payload = {
'SearchBy': SearchBy,
'AdvanceSearch': 'true',
'FiledStartDate': FiledStartDate,
'FiledEndDate': FiledEndDate,
'County': County,
'MDJSCourtOffice': MDJSCourtOffice,
'__RequestVerificationToken': veri_token
}
# Make search request
results = ses.post(
search_url,
data=payload,
headers=headers
)
# Save html page to disk
with open("tmp/test_draft1.html", "w") as f:
f.write(results.text)
# Open local HTML page for processing
with open("tmp/test_draft1.html") as html:
page = bs(html, 'lxml')
table = page.find('table', {'id': 'caseSearchResultGrid'})
# Save table as sperate HTML for later audit
with open("tmp/test_draft1_table.html", "w") as f:
f.write(table.prettify())
# Remove unneeded tags so we don't have to do it in Pandas
def clean_tags(page):
for tag in table.select('div.bubble-text'):
tag.decompose()
for tag in table.select('div.modal'):
tag.decompose()
for tag in table.find_all(['th', 'tr', 'td'], class_="display-none"):
tag.decompose()
for tag in table.select('tfoot'):
tag.decompose()
clean_tags(page)
# Start constructing dataset
columns = table.find('thead').find_all('th')
column_names = [c.get_text() for c in columns]
table_rows = table.find('tbody').find_all('tr')
case_info = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.get_text() for tr in td]
case_info.append(row)
# Forward dataset to Pandas for analysis
df = pd.DataFrame(case_info, columns=column_names)
df.columns.values[16] = "Docket URL"
if SearchBy == 'DateFiled':
df.drop(columns=['Event Type',
'Event Status', 'Event Date', 'Event Location'], inplace=True)
df
exit("Scrape Complete!")
这可以将 docket pdf 链接本身拉到一个单独的列表中.但没有正确更新单元格.
for row in table_rows:
row_processed = []
cells = row.find_all("td")
if len(cells) == 17:
docket_url = base_url + cells[16].find('a')['href']
row_processed.append(docket_url)
当前 print(df)
截断的输出片段:
Current print(df)
truncated output snippet:
Docket Number ... Docket URL
0 MJ-11101-CR-0000001-2020 ... Docket SheetCourt Summary
1 MJ-11101-CR-0000003-2020 ... Docket SheetCourt Summary
2 MJ-11101-CR-0000006-2020 ... Docket SheetCourt Summary
3 MJ-11101-NT-0000081-2020 ... Docket SheetCourt Summary
需要 print(df)
截断的输出片段:
Needed print(df)
truncated output snippet:
Docket Number ... Docket URL
0 MJ-11101-CR-0000001-2020 ... https://link/to/docketPDF
1 MJ-11101-CR-0000003-2020 ... https://link/to/docketPDF
2 MJ-11101-CR-0000006-2020 ... https://link/to/docketPDF
3 MJ-11101-NT-0000081-2020 ... https://link/to/docketPDF
推荐答案
好的,在 QHarr 的帮助和指导下 评论 原始问题,我想出了一个解决方案.与任何与编码相关的事情一样,我相信这不是唯一的答案.
Ok, with the help and guidance of QHarr in the comments of the original question, I was able to come up with a solution. As with anything coding related, I am sure this isn't the only answer.
无论如何...在尝试以我想要的方式集成这两个迭代循环之后,连接两个 Pandas DataFrame 工作.
Anyway... After trying to get these two iteration loops integrated the way I wanted, concatenating two Pandas DataFrames works.
案例信息数据:
case_info = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.get_text() for tr in td]
case_info.append(row)
Docket URLs: (生成包含 URL 的列表)
docket_urls = []
for drow in table_rows:
docket_sheets = []
cells = drow.find_all("td")
if len(cells) == 17:
docket_url = base_url + cells[16].find('a')['href']
docket_sheets.append(docket_url)
docket_urls.append(docket_sheets)
数据帧:
# Import case info dataset to Pandas
df_case_info = pd.DataFrame(case_info, columns=column_names)
df_case_info.columns.values[16] = "Docket Text" # Rename col = easy to drop
df_case_info.drop(columns=['Docket Text'], inplace=True)
if SearchBy == 'DateFiled':
df_case_info.drop(columns=['Event Type',
'Event Status', 'Event Date', 'Event Location'], inplace=True)
# Import docket URLs into Pandas
df_docket_urls = pd.DataFrame(docket_urls, columns={'Docket URL'})
# Concatonate both DataFrames into one
df_mdj = pd.concat([df_case_info, df_docket_urls], axis=1)
我仍然想学习如何按照我最初计划的方式做到这一点.但如果它没有坏,就不要修理它,对吗?
I still want to learn how to do this the way I originally planned. But if it's not broke, don't fix it, right?
感谢帮助过的人.这就是我所说的成功.:)
这篇关于如何使用 BeautifulSoup 获取嵌套在 TD 中的 DIV 内的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!