美丽的汤只提取表头 [英] Beautiful soup just extract header of a table

查看:188
本文介绍了美丽的汤只提取表头的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取Python中使用3.5美丽的汤以下网站从表中的信息。

I want to extract information from the table in the following website using beautiful soup in python 3.5.

http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT

我必须先保存网页,因为我的程序需要脱机工作。

I have to save the web-page first, since my program needs to work off-line.

我在我的电脑中保存的网页,我用下面的codeS中提取表信息。但问题是,code只提取表的标题。

I saved the web-page in my computer and I used the following codes to extract table information. But the problem is that the code just extract heading of the table.

这是我的code:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
url = "file:///Users/MD/Desktop/ZoloftPage01.html"


home_page= urlopen(url)
soup = BeautifulSoup(home_page, "html.parser")
table = soup.find("table", attrs={"class":"ratingsTable" } )
comments = [td.get_text() for td in table.findAll("td")]
print(comments)

这是code的输出:

And this is the output of the code:

['RATING', '\xa0 REASON', 'SIDE EFFECTS FOR ZOLOFT', 'COMMENTS', 'SEX', 'AGE', 'DURATION/DOSAGE', 'DATE ADDED ', '\xa0’]

我需要在表中的所有行的信息。
感谢您的帮助!

I need all the information in the table’s rows. Thanks for your help !

推荐答案

这是因为破HTML 的页面。你需要切换到更的宽松的解析器的喜欢的 html5lib 。下面是我的什么作品:

This is because of the broken HTML of the page. You need to switch to a more lenient parser like html5lib. Here is what works for me:

from pprint import pprint

import requests
from bs4 import BeautifulSoup

url = "http://www.askapatient.com/viewrating.asp?drug=19839&name=ZOLOFT"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})

# HTML parsing part
soup = BeautifulSoup(response.content, "html5lib")
table = soup.find("table", attrs={"class":"ratingsTable"})
comments = [[td.get_text() for td in row.find_all("td")] 
            for row in table.find_all("tr")]
pprint(comments)

这篇关于美丽的汤只提取表头的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆