使用 Python 拆分 Twitter RSS 字符串 [英] Split Twitter RSS string using Python

查看:56
本文介绍了使用 Python 拆分 Twitter RSS 字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 解析 Twitter RSS 提要并将信息放入 sqlite 数据库中.举个例子:

MiamiPete:今天的最后一次通话"现已开始 http://bit.ly/MGDzu #stocks #stockmarket #finance #money

我想要做的是为主要内容创建一列(Miami Pete…now up),为 URL 创建一列(http://bit.ly/MGDzu),以及四个单独的标签列(股票、股票市场、金融、货币).我一直在研究如何做到这一点.

任何建议将不胜感激!

附言我一直在玩的一些代码如下——你可以看到我最初尝试创建一个名为tiny_url"的变量并将其拆分,它似乎确实做到了,但这种微弱的尝试并没有接近解决所指出的问题以上.:)

def store_feed_items(id, items):""" 获取一个 feed_id 和一个项目列表并将它们存储在数据库 """用于输入项目:c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))tinyurl = entry.summary ### 我在打印 tinyurl.split('http') ### 我在如果 len(c.fetchall()) == 0:c.execute('INSERT INTO RSSEntries (id, url, title, content, tinyurl, date, tiny) VALUES (?,?,?,?,?,?,?)', (id, entry.link, entry.标题, entry.summary, tinyurl, strftime("%Y-%m-%d %H:%M:%S",entry.updated_pa​​rsed), tiny ))

解决方案

您的数据驱动设计似乎存在相当大的缺陷.除非您的所有条目都包含文本部分、网址和最多 4 个标签,否则它不会起作用.

您还需要将保存到数据库与解析分开.使用正则表达式(甚至字符串方法)可以轻松完成解析:

<预><代码>>>>s = your_string>>>s.split()['MiamiPete:', "today's", '"Last', 'Call"', 'is', 'now', 'up', 'http://bit.ly/MGDzu', '#stocks', '#stockmarket'、'#finance'、'#money']>>>url = [i for i in s.split() if i.startswith('http://')]>>>网址['http://bit.ly/MGDzu']>>>tags = [i for i in s.split() if i.startswith('#')]>>>标签['#stocks'、'#stockmarket'、'#finance'、'#money']>>>' '.join(i for i in s.split() 如果我不在 url+tags 中)'MiamiPete:今天的Last Call"现在开始'

不过,单表数据库设计可能不得不去.

I am trying to parse Twitter RSS feeds and put the information in a sqlite database, using Python. Here's an example:

MiamiPete: today's "Last Call" is now up http://bit.ly/MGDzu #stocks #stockmarket #finance #money

What I want to do is create one column for the main content (Miami Pete…now up), one column for the URL (http://bit.ly/MGDzu), and four separate columns for the hashtags (stocks, stockmarket, finance, money). I've been playing around with how to do this.

Any advice would be greatly appreciated!

P.S. Some code I've been playing around with is below--you can see I tried initially creating a variable called "tiny_url" and splitting it, which it does seem to do, but this feeble attempt is not anywhere close to solving the problem noted above. :)

def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        tinyurl = entry.summary    ### I added this in
        print tinyurl.split('http') ### I added this in 
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, tinyurl, date, tiny) VALUES (?,?,?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, tinyurl, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed), tiny ))

解决方案

It seems like your data-driven design is rather flawed. Unless all your entries have a text part, an url and up to 4 tags, it's not going to work.

You also need to separate saving to db from parsing. Parsing could be easily done with a regexep (or even string methods):

>>> s = your_string
>>> s.split()
['MiamiPete:', "today's", '"Last', 'Call"', 'is', 'now', 'up', 'http://bit.ly/MGDzu', '#stocks', '#stockmarket', '#finance', '#money']
>>> url = [i for i in s.split() if i.startswith('http://')]
>>> url
['http://bit.ly/MGDzu']
>>> tags = [i for i in s.split() if i.startswith('#')]
>>> tags
['#stocks', '#stockmarket', '#finance', '#money']
>>> ' '.join(i for i in s.split() if i not in url+tags)
'MiamiPete: today\'s "Last Call" is now up'

Single-table db design would probably have to go, though.

这篇关于使用 Python 拆分 Twitter RSS 字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆