带有名称的表的Python scrape网站w/BeautifulSoup4 shwoing属性错误 [英] Python scrape website w/BeautifulSoup4 shwoing attribute error for table with class name

查看:37
本文介绍了带有名称的表的Python scrape网站w/BeautifulSoup4 shwoing属性错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在关注本教程:

它显示表类是 sortable dl sortable d1 ,所以我在脚本中都尝试了它们:

 ""通过链接更多"获得背包食品表和桌子.跟随:https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup"将熊猫作为pd导入从bs4导入BeautifulSoup汇入要求将lxml.html导入为lhpd.set_option('display.max_columns',无)pd.set_option('display.max_rows',无)url ="http://www.knapsackfamily.com/LunchBox/top.php#res"#food_df = pd.read_csv(url)#print(food_df)页面= request.get(url).text汤= BeautifulSoup(页面,"lxml")打印(soup.prettify())food_table = soup.find("table",attrs = {"class":"sortable d1"})food_table_data = food_table.tbody.find_all("tr")标题= []#获取所有标题对于food_table_data.find_all("th")中的th:headings.append(th.b.text.replace('\ n','').strip())打印(标题) 

但是我得到了

  Traceback(最近一次通话最近):在< module>中的文件"get_knapsack_tables_to_csv.py"第24行.food_table_data = food_table.tbody.find_all("tr")AttributeError:"NoneType"对象没有属性"tbody" 

我该如何解决?我要抓取它而不是在Pandas中使用某些方法,因为我需要访问该页面上最后一个称为 more 的链接,然后抓取该网站的英语表单元格并将其作为列添加到数据框中我正在努力.

解决方案

要从服务器获取数据,请对正确的表单数据使用 requests.post().例如:

 导入请求从bs4导入BeautifulSoupurl ='http://www.knapsackfamily.com/LunchBox/top.php'数据= {'mode':'list3','fword1':'','mode1':'列出所有'}汤= BeautifulSoup(request.post(url,data = data).content,'html.parser')对于汤中的行.select('table.d1 tr'):tds = [td.get_text(strip = True)for row中的td.select('td,th')]打印(* tds) 

打印:

 大分类KNApSAcK种名学名(一般名)详细データ植物Abelmoschus esculentus [秋葵,おくら,オクラ,秋葵,あめりかねり,アメリカネリ,おかれんこん,オカレンコン,陆莲根]更多植物Abelmoschus moschatus Abelmoschus moschatus,木槿abelmoschus [Ambrette,麝香种子,麝香锦葵,とろろあおいもどき,トロロアオイモドキ,においとろろあおい,ニオイトロロロオアオイ,じゃこうあおい,ジャコウアオイ,りゅうきゅうとろろあおいュウキュウリ,トル动物(鱼类)Abudefduf sexfasciatus Abudefduf sexfasciatus [剪刀尾中士,六棒中士大,ろくせんすずめだい,ロクセスズメダイ,六线雀鲷]更多动物(鱼类)Abudefduf vaigiensis Abudefduf vaigiensis [五个带状的damsefish,军士长,waigieu damoiselle,おやびっちゃ,オヤビッチャ]更多植物相思树相思树[Prickly Moses,Cassie,きんごうかん,キンゴウカン,金合歓]更多动物(鱼类)Acanthocepola krusensternii [黄斑带鱼,あかたち,アカタチ,赤太刀]更多动物(鱼类)Acanthocepola limbata [blackspot bandfish,いってんあかたち,イッテンアカタチ,一点赤太刀]更多动物(鱼类)Acanthocybium solandri [wahoo,かますさわら,カマスサワラ,魳鰆]更多动物(鱼类)Acanthogobius flavimanus [常见的黑虾虎鱼,真虾虎鱼,多刺虾虎鱼,まはぜ,マハマ,真鲨鱼,真沙鱼,ごり,ゴリ,かじか,カジカ]更多...等等. 

I am following this tutorial: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup

To download the table on this page: http://www.knapsackfamily.com/LunchBox/top.php#res

Edit: That table appears after I click the button "List All" which is an input in a form with action=top.php#res.

I inspected the table:

and it shows the table classes are either sortable dl or sortable d1 so I tried them both in my script:

"""
get knapsack food table and table at link "more"
follow: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
"""

import pandas as pd
from bs4 import BeautifulSoup
import requests
import lxml.html as lh

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
url = "http://www.knapsackfamily.com/LunchBox/top.php#res"
#food_df = pd.read_csv(url)

#print(food_df)

page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
print(soup.prettify())

food_table = soup.find("table", attrs={"class": "sortable d1"})

food_table_data = food_table.tbody.find_all("tr")

headings=[]
# get all heading 
for th in food_table_data.find_all("th"):
    headings.append(th.b.text.replace('\n', ' ').strip())

print(headings)

but I get:

Traceback (most recent call last):
  File "get_knapsack_tables_to_csv.py", line 24, in <module>
    food_table_data = food_table.tbody.find_all("tr")
AttributeError: 'NoneType' object has no attribute 'tbody'

How can I fix this? I want to scrape it rather than using some methods in Pandas because I need to access the link the last column on that page called more and scrape that site's English language table cells and add them as columns to the dataframe I'm trying to make.

解决方案

To get data from server, use requests.post() with correct form data. For example:

import requests
from bs4 import BeautifulSoup

url = 'http://www.knapsackfamily.com/LunchBox/top.php'
data = {
    'mode': 'list3',
    'fword1': '',
    'mode1': ' List All'
}

soup = BeautifulSoup( requests.post(url, data=data).content, 'html.parser' )

for row in soup.select('table.d1 tr'):
    tds = [td.get_text(strip=True) for td in row.select('td, th')]
    print(*tds)

Prints:

大分類 KNApSAcK 種名 学名(一般名) 詳細データ
植物  Abelmoschus esculentus Abelmoschus esculentus[okra、おくら、オクラ、秋葵、あめりかねり、アメリカネリ、おかれんこん、オカレンコン、陸蓮根] more
植物  Abelmoschus moschatus Abelmoschus moschatus、Hibiscus abelmoschus[Ambrette、Musk seed、Musk mallow、とろろあおいもどき、トロロアオイモドキ、においとろろあおい、ニオイトロロアオイ、じゃこうあおい、ジャコウアオイ、りゅうきゅうとろろあおい、リュウキュウトロロアオイ] more
動物(魚類)  Abudefduf sexfasciatus Abudefduf sexfasciatus[scissor-tail sergeant、six-barred sergeant-major、ろくせんすずめだい、ロクセンスズメダイ、六線雀鯛] more
動物(魚類)  Abudefduf vaigiensis Abudefduf vaigiensis[five banded damsefish、sergeant major、waigieu damoiselle、おやびっちゃ、オヤビッチャ] more
植物  Acacia farnesiana Acacia farnesiana[Prickly Moses、Cassie、きんごうかん、キンゴウカン、金合歓] more
動物(魚類)  Acanthocepola krusensternii Acanthocepola krusensternii[yellowspotted bandfish、あかたち、アカタチ、赤太刀] more
動物(魚類)  Acanthocepola limbata Acanthocepola limbata[blackspot bandfish、いってんあかたち、イッテンアカタチ、一点赤太刀] more
動物(魚類)  Acanthocybium solandri Acanthocybium solandri[wahoo、かますさわら、カマスサワラ、魳鰆] more
動物(魚類)  Acanthogobius flavimanus Acanthogobius flavimanus[common blackish goby、genuine goby、spiny goby、まはぜ、マハゼ、真鯊、真沙魚、ごり、ゴリ、かじか、カジカ] more

...and so on.

这篇关于带有名称的表的Python scrape网站w/BeautifulSoup4 shwoing属性错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆