Python将url中的HTML解析为PD ValueError:未找到表 [英] Python Parsing HTML from url into PD ValueError: No tables found
本文介绍了Python将url中的HTML解析为PD ValueError:未找到表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试将下面的HTML解析为一个数据帧,尽管我可以清楚地看到HTML中定义的表格,但仍不断出错。感谢您的帮助
I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help
<table><tr><td><a <table><tr><td><a
Error
ValueError: No tables found
我的代码
import pandas as pd
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()
也尝试了feedparser也没有运气我没有任何数据
Also tried with feedparser and no luck. I dont get any data
import feedparser
import pandas as pd
import time
rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()
推荐答案
此页面上的数据量太大,无法超时。另外,我得到的内容似乎与您的内容有所不同。
The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.
import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
timeout=600)
posts = {'title': [], 'link': [], 'description': []}
doc = SimplifiedDoc(html)
items = doc.selects('item')
for item in items:
posts['title'].append(item.title.text)
posts['link'].append(item.link.text)
posts['description'].append(item.description.text)
df = pd.DataFrame(posts)
df.tail()
从描述中获取数据
posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
doc = SimplifiedDoc(html)
descriptions = doc.selects('item').description # Get all descriptions
for table in descriptions:
d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
img = d.img.src # Get the image src
listPrice = d.getElementByText('List Price:')
if listPrice:
listPrice=listPrice.strike.text
else: listPrice = ''
dealPrice = d.getElementByText('Deal Price: ')
if dealPrice:
dealPrice = dealPrice.text[len('Deal Price: '):]
else: dealPrice = ''
expires = d.getElementByText('Expires ')
if expires:
expires = expires.text[len('Expires '):]
else: expires = ''
posts['listPrice'].append(listPrice)
posts['dealPrice'].append(dealPrice)
posts['expires'].append(expires)
df = pd.DataFrame(posts)
df.tail()
我得到的页面数据如下:
The page data I get is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
<channel>
<title>Amazon.com Gold Box Deals</title>
<link>http://www.amazon.com/gp/goldbox</link>
<description>Amazon.com Gold Box Deals</description>
<pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
<dc:date>2018-06-28T08:50:16Z</dc:date>
<image>
<title>Amazon.com Gold Box Deals</title>
<url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
<link>http://www.amazon.com/gp/goldbox</link>
</image>
<item>
<title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
<link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&tag=rssfeeds-20</link>
<description><table><tr><td><a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&tag=rssfeeds-20" target="_blank"><img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/></a></td><td><tr><td>Withings Activit? Steel - Activity and Sleep Tracking Watch</td></tr><tr><td>Expires Jun 29, 2018</td></tr></td></tr></table></description>
<pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
<guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
<dc:date>2018-06-28T07:00:10Z</dc:date>
</item>
这篇关于Python将url中的HTML解析为PD ValueError:未找到表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文