从 FeedParser 获取 Feed 并导入到 Pandas DataFrame [英] Get Feeds from FeedParser and Import to Pandas DataFrame

查看:56
本文介绍了从 FeedParser 获取 Feed 并导入到 Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习 Python.作为实践,我正在使用 feedparser 构建一个 rss 抓取器,将输出放入一个 Pandas 数据帧并尝试使用 NLTK 进行挖掘……但我首先从多个 RSS 提要中获取文章列表.

I'm learning python. As practice I'm building a rss scraper with feedparser putting the output into a pandas dataframe and trying to mine with NLTK...but I'm first getting a list of articles from multiple RSS feeds.

我使用了这篇关于如何传递多个提要 并将其与我之前在另一个关于如何将其放入 熊猫数据框.

I used this post on how to pass multiple feeds and combined it with an answer I got previously to another question on how to get it into a Pandas dataframe.

问题是什么,我希望能够查看数据框中所有提要的数据.目前我只能访问提要列表中的第一项.

What the problem is, I want to be able to see the data from all the feeds in my dataframe. Currently I'm only able to access the first item in the list of feeds.

FeedParser 似乎在做它的工作,但是当将它放入 Pandas df 时,它似乎只获取列表中的第一个 RSS.

FeedParser seems to be doing it's job but when putting it into the Pandas df it only seems to grab the first RSS in the list.

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = []
for url in rawrss:
    feeds.append(feedparser.parse(url))

for feed in feeds:
    for post in feed.entries:
        print(post.title, post.link, post.summary)

df = pd.DataFrame(columns=['title', 'link', 'summary'])

for i, post in enumerate(feed.entries):
    df.loc[i] =  post.title, post.link, post.summary

df.shape

df

推荐答案

您的代码将遍历每个帖子并打印其数据.将发布数据添加到数据帧的代码部分不是循环的一部分(在 Python 中缩进是有意义的!),因此您只能看到数据帧中一个提要的数据.

Your code will loop through each post and print its data. The part of your code that adds the post data to the dataframe is not part of the loop (in python indentation is meaningful!), so you only see the data from one feed in your dataframe.

您可以在循环访问提要时构建一个帖子列表,然后在最后创建一个数据框:

You can build a list of posts as you loop through the feeds, and then create a dataframe at the end:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = [] # list of feed objects
for url in rawrss:
    feeds.append(feedparser.parse(url))

posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

您可以通过组合两个 for 循环来优化这一点:

You could optimize this a little bit by combining the two for loops:

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

这篇关于从 FeedParser 获取 Feed 并导入到 Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆