美丽的汤AssertionError [英] Beautiful Soup AssertionError

查看:220
本文介绍了美丽的汤AssertionError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将该网站抓取为.CSV,但出现错误消息: AssertionError: 9 columns passed, passed data had 30 columns.我的代码在下面,有点混乱,因为我是从Jupyter Notebook导出的

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)

type(soup)  # we see that soup is a BeautifulSoup object

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

df = pd.DataFrame(candidate_data, columns=column_headers)
df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

解决方案

页面上的数据[肯定有一个表,您解析出列标题并将其传递给CSV.可视地,该表具有8列,但是您解析了9个标题.在这一点上,您可能应该检查数据以查看发现的内容-可能不是您所期望的.但是,好的,您去检查,发现其中之一是表中的空白列,该列将为空或垃圾,然后继续.

这些行:

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
        for i in range(len(data_rows))]

在页面中

查找每个 <th>实例,然后在每个<th>内找到每个 <td>实例,这才是真正脱离常规的地方.我猜您不是网络开发人员,但是表及其子元素(行aka <tr>,标头aka <th>和单元格aka <td>)在大多数页面上用于组织大量的可视元素和有时也用于组织表格数据.

猜猜是什么?您发现许多不是该可视表的表,因为您正在整个页面中搜索<th>元素.

我建议您先查找仅包含您感兴趣的表格数据的<table><div>,然后再在该范围内进行搜索. /p>

I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)

type(soup)  # we see that soup is a BeautifulSoup object

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

df = pd.DataFrame(candidate_data, columns=column_headers)
df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

解决方案

The data on the page [ definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.

These lines:

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
        for i in range(len(data_rows))]

find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.

Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.

I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.

这篇关于美丽的汤AssertionError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆