将 Python 中的 twitter 提要解析为表格 [英] parsing a twitter feed in Python into a table

查看:18
本文介绍了将 Python 中的 twitter 提要解析为表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一组已保存到 .txt 文件的推文.

Have a set of tweets that have been saved to a .txt file.

我想在 Python 中的 sqlite 表中放置某些属性.我成功创建了表.

I want to place certain attributes in a sqlite table in Python. I successfully created the table.

import pandas
import sqlite3
conn = sqlite3.connect('twitter.db')
c = conn.cursor()

c.execute(CREATE TABLE Tweet
(
   created_at VARCHAR2(25),
   id VARCHAR2(25),
   text VARCHAR2(25)
   source VARCHAR2(25),
   in-reply_to_user_ID VARCHAR2(25), 
   retweet_Count VARCHAR2(25)

)

在我什至尝试将解析的数据添加到数据库之前,我尝试用它创建一个数据框只是为了查看.

Before I even attempted to add the parsed data into the db, I tried to create a data frame with it just to view.

tweets =pandas.read_table('file.txt', sep=',')

我收到错误:

CParserError: Error tokenizing data. C error: Expected 63 fields in line 3, saw 69

我的假设是有 ',' 不仅分隔字段,而且在字符串中也是如此.

My assumption is there are ',' not only separating the fields, but within the strings too.

此外,Twitter 数据的格式是我以前从未使用过的.每个字段以括号中的变量名称开头,一个冒号,然后是由更多括号分隔的数据.喜欢:

Also, twitter data comes in a format that I have not worked with before. Each field starts with the variable name in parenthesis, a colon, then the data separated by more parenthesis. Like:

"created_at":"Fri Oct 11 00:00:03 +0000 2013",

那么我怎样才能把它变成一个标准的表格格式,变量名在顶部?

So how can I get this into a standard table format with the variable names at the top?

一条完整的推文示例如下:

A full example of a tweet is this:

{"created_at":"Fri Oct 11 00:00:03 +0000 2013","id":388453908911095800,"id_str":"388453908911095809","text":"LAGI PUN VISITORS DATANG PUKUL 9 AH","source":"<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":447800506,"id_str":"447800506","name":"§yazwina·","screen_name":"_SAireen","location":"SSP","url":"http://flavors.me/syazwinaaireen#","description":"Absence makes the heart grow fonder. Stay us x @_DFitri's","protected":false,"followers_count":806,"friends_count":702,"listed_count":2,"created_at":"Tue Dec 27 08:29:53 +0000 2011","favourites_count":7478,"utc_offset":28800,"time_zone":"Beijing","geo_enabled":true,"verified":false,"statuses_count":32558,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http://a0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_tile":true,"profile_image_url":"http://a0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_image_url_https":"https://si0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_banner_url":"https://pbs.twimg.com/profile_banners/447800506/1369969522","profile_link_color":"FA0096","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"it"}

推荐答案

我想已经有一个 Python 库可以解决这个问题,但是一旦我替换了这些未加引号的术语,我就能够将您的推文字符串解析为字典.

I imagine there is a python library for this already, but I was able to get your tweet string to parse as a dictionary once I replaced these terms that appear unquoted.

 false to False 
 true to True
 null to None

我只是将整个括号内的表达式分配给一个变量,从而创建了一个字典.然后,您可以遍历并将键作为标题打印,将每个值作为条目打印.

I just assigned the whole bracketed expression to a variable, creating a dictionary. Then you can potentially go through and print the keys as a header and each value as an entry.

修复或引用这三个值也可能使 pandas 解析器更快乐,尽管我认为 csv 阅读器可能会更好地处理所有嵌入的逗号以及单引号和双引号.我认为 JSON 解析器仍然被带有冒号的 URL 阻塞.如果您打算尝试 JSON,您可以尝试转义它们.

Fixing or quoting those three values might also make the pandas parser happier too, although I think a csv reader might cope with all the embedded commas and single and double quotes better. The JSON parser still choked on the URL having a colon, I think. You might try escaping they if you are going to try JSON.

这篇关于将 Python 中的 twitter 提要解析为表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆