Python将url中的HTML解析为PD ValueError:未找到表 [英] Python Parsing HTML from url into PD ValueError: No tables found

查看:101
本文介绍了Python将url中的HTML解析为PD ValueError:未找到表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将下面的HTML解析为一个数据帧,尽管我可以清楚地看到HTML中定义的表格,但仍不断出错。感谢您的帮助

I'm trying to parse the below HTML into a dataframe and i keep getting error, eventhough i can clearly see a table defined in the HTML. Appreciate your help

<table><tr><td><a <table><tr><td><a 

Error

ValueError: No tables found

我的代码

import pandas as pd 
url='http://rssfeeds.s3.amazonaws.com/goldbox?'
#dfs = pd.read_html(requests.get(url).text)
dfs = pd.read_html(url)
dfs[0].head()

也尝试了feedparser也没有运气我没有任何数据

Also tried with feedparser and no luck. I dont get any data

import feedparser
import pandas as pd
import time

rawrss = ('http://rssfeeds.s3.amazonaws.com/goldbox')
    
posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.dealUrl, post.discountPercentage))
df = pd.DataFrame(posts, columns=['title', 'dealUrl', 'discountPercentage'])
df.tail()


推荐答案

此页面上的数据量太大,无法超时。另外,我得到的内容似乎与您的内容有所不同。

The amount of data on this page is too large to time out. In addition, the content I got seems to be different from yours.

import pandas as pd
from simplified_scrapy import SimplifiedDoc, utils, req
html = req.get('http://rssfeeds.s3.amazonaws.com/goldbox',
               timeout=600)

posts = {'title': [], 'link': [], 'description': []}
doc = SimplifiedDoc(html)
items = doc.selects('item')
for item in items:
    posts['title'].append(item.title.text)
    posts['link'].append(item.link.text)
    posts['description'].append(item.description.text)

df = pd.DataFrame(posts)
df.tail()

从描述中获取数据

posts = {'listPrice': [], 'dealPrice': [], 'expires': []}
doc = SimplifiedDoc(html)
descriptions = doc.selects('item').description # Get all descriptions
for table in descriptions:
    d = SimplifiedDoc(table.unescape()) # Using description to build a doc object
    img = d.img.src # Get the image src
    listPrice = d.getElementByText('List Price:')
    if listPrice:
        listPrice=listPrice.strike.text
    else: listPrice = ''

    dealPrice = d.getElementByText('Deal Price: ')
    if dealPrice:
        dealPrice = dealPrice.text[len('Deal Price: '):]
    else: dealPrice = ''

    expires = d.getElementByText('Expires ')
    if expires:
        expires = expires.text[len('Expires '):]
    else: expires = ''

    posts['listPrice'].append(listPrice)
    posts['dealPrice'].append(dealPrice)
    posts['expires'].append(expires)
df = pd.DataFrame(posts)
df.tail()

我得到的页面数据如下:

The page data I get is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Amazon.com Gold Box Deals</title>
    <link>http://www.amazon.com/gp/goldbox</link>
    <description>Amazon.com Gold Box Deals</description>
    <pubDate>Thu, 28 Jun 2018 08:50:16 GMT</pubDate>
    <dc:date>2018-06-28T08:50:16Z</dc:date>
    <image>
      <title>Amazon.com Gold Box Deals</title>
      <url>http://images.amazon.com/images/G/01/rcm/logo2.gif</url>
      <link>http://www.amazon.com/gp/goldbox</link>
    </image>
    <item>
      <title>Deal of the Day: Withings Activit? Steel - Activity and Sleep Tracking Watch</title>
      <link>https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20</link>
      <description>&lt;table&gt;&lt;tr&gt;&lt;td&gt;&lt;a href="https://www.amazon.com/Withings-Activit%C3%83-Steel-Activity-Tracking/dp/B018SL790Q/ref=xs_gb_rss_ADSW6RT7OG27P/?ccmID=380205&amp;tag=rssfeeds-20" target="_blank"&gt;&lt;img src="https://images-na.ssl-images-amazon.com/images/I/41O4Qc3FCBL._SL160_.jpg" alt="Product Image" style='border:0'/&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tr&gt;&lt;td&gt;Withings Activit? Steel - Activity and Sleep Tracking Watch&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Expires Jun 29, 2018&lt;/td&gt;&lt;/tr&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
      <pubDate>Thu, 28 Jun 2018 07:00:10 GMT</pubDate>
      <guid isPermaLink="false">http://promotions.amazon.com/gp/goldbox/</guid>
      <dc:date>2018-06-28T07:00:10Z</dc:date>
    </item>

这篇关于Python将url中的HTML解析为PD ValueError:未找到表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆