将大型Twitter JSON数据(7GB +)加载到Python中 [英] Loading Large Twitter JSON Data (7GB+) into Python

查看:189
本文介绍了将大型Twitter JSON数据(7GB +)加载到Python中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过AWS建立了一个公共流来收集推文,现在想做一些初步分析.我所有的数据都存储在一个S3存储桶中(文件大小为5mb).

I've set up a public stream via AWS to collect tweets and now want to do some preliminary analysis. All my data was stored on an S3 bucket (in 5mb files).

我下载了所有内容并将所有文件合并为一个.每个推文都按照Twitter规范存储为标准JSON对象.

I downloaded everything and merged all the files into one. Each tweet is stored as a standard JSON object as per Twitter specifications.

基本上,合并文件包含多个JSON对象.我添加了方括号([]),使它看起来像一本字典,用于将其读入Python.所以结构有点像这样(我不确定是否可以在这里发布Twitter数据):

Basically, the consolidated file contains multiple JSON objects. I added opening and closing square brackets ( [] ) to make it look like a list of dictionaries for when it gets read into Python. So the structure is kinda like this (I'm not sure if I can just post twitter data here):

[{"created_at":"Mon Sep 19 23:58:50 +000 2016", "id":<num>, "id_str":"<num>","text":"<tweet message>", etc.}, 
{same as above},
{same as above}]

删除第一条推文后,我将所有内容都放入了www.jsonlint.com,并确认它是有效的JSON数据结构.

After deleting the very first tweet, I put everything into www.jsonlint.com and confirmed that it is a valid JSON data structure.

现在,我正在尝试将此数据加载到Python中,并希望对推文中的不同术语进行一些基本计数(例如,在推文中提到@HillaryClinton多少次,等等).

Now, I'm trying to load this data into Python and hoping to do some basic counts of different terms in tweets (e.g. how many times is @HillaryClinton mentioned in the text of a tweet, etc.).

以前,使用较小的数据集,我能够摆脱像这样的代码:

Previously with smaller datasets, I was able to get away with code like this:

import json
import csv
import io
data_json = open('fulldata.txt', 'r', encoding='utf-8')
data_python = json.load(data.json)

然后我将各个字段的数据写到CSV文件中,并以此方式进行分析.这适用于2GB的文件.

I then wrote the data for respective fields into a CSV file and performed my analyses that way. This worked for a 2GB file.

现在我有一个7GB的文件,我注意到如果使用此方法,Python会在"json.load(data.json)"行中引发错误,提示"OSError:[Errno 22] Invalid争论.

Now that I have a 7GB file, I am noticing that if I use this method, Python throws an error in the "json.load(data.json)" line saying "OSError: [Errno 22] Invalid Argument.

我不确定为什么会这样,但是我希望这可能是因为它试图将整个文件立即加载到内存中.这样对吗?

I'm not sure why this is happening but I anticipate that it might be because it's trying to load the entire file at once into memory. Is this correct?

所以我试图使用ijson,显然可以让您解析json文件.我尝试编写以下代码:

So I was trying to use ijson which apparently lets you parse through the json file. I tried to write the following code:

import ijson
f = open('fulldata.txt', 'r', encoding='utf-8')
content = ijson.items(f, 'item')
for item in content:
    <do stuff here>

通过此实现,我在针对内容中的内容"行上收到一条错误消息,提示"ijson.backends.python.unexpectedsymbol:意外符号'/u201c'在1

With this implementation, I get an error on the line "for item in content" saying "ijson.backends.python.unexpectedsymbol: unexpected symbol '/u201c' at 1

我还尝试遍历数据文件的每一行,并将其作为JSON行格式进行遍历.因此,假设每一行都是一个JSON对象,我写道:

I also tried to go through each line of the data file and go through it as a JSON lines format. So, assuming each line was a JSON object, I wrote:

raw_tweets = []
with open('full_data.txt', 'r', encoding='utf-8') as full_file:
     for line in full_file:
         raw_tweets.append(json.dumps(line))
print(len(raw_tweets)) #this worked. got like 2 million something as expected!
enter code here

但是在这里,列表中的每个条目都是一个字符串,而不是一个字典,这使得解析我需要的数据变得非常困难.有没有一种方法可以修改此最后一个代码以使其按我的需要工作?但是即使那样,在给定内存限制的情况下,将整个数据集加载到列表中是否仍会使以后的分析变得困难呢?

But here, each entry into the list was a string and not a dictionary which made it really hard to parse the data I needed out of it. Is there a way to modify this last code to make it work as I need? But even then, wouldn't loading that whole dataset into a list make it still hard for future analyses given memory constraints?

我对进行此操作的最佳方法有些困惑.我真的很想在Python中执行此操作,因为我正在尝试学习如何使用Python工具进行此类分析.

I'm a little stuck about the best way to proceed with this. I really want to do this in Python because I'm trying to learn how to use Python tools for these kinds of analyses.

对此有任何经验吗?我是不是真的很愚蠢或误解了一些基本的东西?

Does any have any experience with this? Am I being really stupid or misunderstanding something really basic?

因此,我首先去www.jsonlint.com并粘贴了我的整个数据集,发现删除第一条推文后,它是有效的JSON格式.所以现在我只排除了那个文件.

So, I first went to www.jsonlint.com and pasted my entire dataset and found that after removing the first tweet, it was in valid JSON format. So for now I just excluded that one file.

我基本上有一个上述格式的数据集([{json1},{json2}],其中{}中的每个实体代表一条推文.

I basically have a dataset in the format mentioned above ([{json1}, {json2}] where each entity in the {} represents a tweet.

现在,我确认它是有效的JSON,我的目标是将其放入python中,并将每个JSON表示为一个字典(这样我就可以轻松地操作这些文件). 如果效率低下,有人可以在这里纠正我的思维过程吗?

Now that I confirmed that it was a valid JSON, my goal was to get it into python with each JSON being represented as a dictionary (so I could easily manipulate those files). can someone correct my thought-process here if it's inefficient?

为此,我做了:

raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
     for line in full_file:
         raw_tweets.append(json.dumps(line))
#This successfully wrote each line of my file into a list. Confirmed by checking length, as described previously.
#Now I want to write this out to a csv file. 
csv_out = io.open("parsed_data.csv", mode = 'w', encoding='ISO-8859-1')
fields = u'created_at,text,screen_name,followers<friends,rt,fav'
csv_out.write(fields) #Write the column headers out. 
csv_out.write(u'\n')
#Now, iterate through the list. Get each JSON object as a dictionary and pull out the relevant information.
for tweet in raw_tweets:
#Each "tweet" is {json#},\n'
    current_tweet = json.loads(tweet) #right now everything is a list of strings in the {} format but it's just a string and not a dictionary. If I convert it to a JSON object, I should be able to make a dictionary form of the data right?
row = [current_tweet.get('created_at'), '"' + line.get('text').replace('"','""') + '"', line.get('user).get('screen_name')] #and I continue this for all relevant headers

问题是,我说current_tweet.get的最后一行不起作用,因为它一直说'str'没有属性'get',所以我不确定为什么json.loads ()没有给我字典...

Problem is, that last line where I say current_tweet.get isn't working because it keeps saying that 'str' has no attribute 'get' so I'm not sure why json.loads() isn't giving me a dictionary...

EDIT#2

用户建议我删除[和]以及结尾的逗号,以便每行都具有有效的JSON.这样我就可以每行json.loads()了.我按照建议去掉了括号. 对于逗号,我这样做:

A user recommended I remove the [ and ] and also the trailing commas so that each line has valid JSON. That way I could just json.loads() each line. I removed the brackets as suggested. For the commas, I did this:

raw_tweets=[]
with open('fulldata.txt', 'r', encoding='ISO-8859-1') as full_file:
     for line in full_file:
         no_comma = line[:-2] #Printed this to confirm that final comma was removed
         raw_tweets.append(json.dumps(line))

这将显示一条错误消息,提示 ValueError:期望':'定界符:第1行第2305列(字符2304)

This is giving an error saying ValueError: Expecting ':' Delimiter: Line 1 Column 2305 (char 2304)

为了调试这一点,我打印了第一行(即我刚才说的print(no_comma)),我注意到Python打印的内容实际上内部有多个tweet ...当我在"UltraEdit"之类的编辑器中打开它时,我注意到每个推文都是不同的行,因此我假定每个JSON对象都由换行符分隔.但是在这里,当我逐行迭代后打印结果时,我看到它一次拉入了多个推文.

To debug this, I printed the first line (i.e. I just said print(no_comma)) and I noticed that what Python printed actually had multiple tweets inside... When I open it in an editor like "UltraEdit" I notice that each tweet is a distinct line so I assumed that each JSON object was separated by a newline character. But here, when I print the results after iterating by line, I see that it's pulling in multiple tweets at once.

我应该以不同的方式进行迭代吗?我删除逗号的方法是否合适,还是应该分别对文件进行预处理?

Should I be iterating differently? Is my method of removing the commas appropriate or should I be pre-processing the file separately?

我很确定我的JSON格式设置不正确,但是我不确定为什么以及如何解决它.这是我的JSON数据的样本.如果不允许这样做,我将其删除...

I'm pretty sure that my JSON is formatted poorly but I'm not sure why and how to go about fixing it. Here is a sample of my JSON data. If this isn't allowed, I'll remove it...

https://ufile.io/47b1

推荐答案

我是一个非常新的用户,但是我也许可以提供部分解决方案.我相信您的格式已关闭.您不能不以JSON格式将其导入为JSON.如果可以将这些推文放入一个数据帧(或单独的数据帧),然后使用"DataFrame.to_json"命令,则应该能够解决此问题.如果尚未安装,您将需要熊猫.

I'm a VERY new user, but I might be able to offer a partial solution. I believe your formatting is off. You can't just import it as JSON without it being in JSON format. You should be able to fix this if you can get the tweets into a data frame (or separate data frames) and then use the "DataFrame.to_json" command. You WILL need Pandas if not already installed.

Pandas- http://pandas.pydata.org/pandas-docs/stable/10min.html

Dataframe- http://pandas.pydata. org/pandas-docs/stable/generation/pandas.DataFrame.to_json.html

Dataframe - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

这篇关于将大型Twitter JSON数据(7GB +)加载到Python中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆