内存错误,执行情感分析大数据 [英] Memory Error, performing sentiment analysis large size data
问题描述
我正在尝试对社交网络中的大量数据进行情感分析.代码的一部分非常适合数据量较小的情况.
I am trying to perform sentiment analysis on the large set of data from social network. The part of the code works great with small size of data.
小于20mb的输入大小在计算时没有问题.但是,如果大小超过20mb,则会出现内存错误.
The input size less than 20mb has no problem computing. But if the size is more than 20mb I am getting memory error.
环境:Windows 10,带有更新版本软件包的anaconda3.x.
Environment: Windows 10, anaconda 3.x with updated version packages.
代码:
def captionsenti(F_name):
print ("reading from csv file")
F1_name="caption_senti.csv"
df=pd.read_csv(path+F_name+".csv")
filename=path+F_name+"_"+F1_name
df1=df['tweetText'] # reading caption from data5 file
df1=df1.fillna("h") # filling NaN values
df2=pd.DataFrame()
sid = SentimentIntensityAnalyzer()
print ("calculating sentiment")
for sentence in df1:
#print(sentence)
ss = sid.polarity_scores(sentence) # calculating sentiments
#print ss
df2=df2.append(pd.DataFrame({'tweetText':sentence ,'positive':ss['pos'],'negative':ss['neg'],'neutral':ss['neu'],
'compound':ss['compound']},index=[0]))
df2=df2.join(df.set_index('tweetText'), on='tweetText') # joining two data frames
df2=df2.drop_duplicates(subset=None, keep='first', inplace=False)
df2=df2.dropna(how='any')
df2=df2[['userID','tweetSource','tweetText','positive','neutral','negative','compound','latitude','longitude']]
#print df2
print ("Storing in csv file")
df2.to_csv(filename,encoding='utf-8',header=True,index=True,chunksize=100)
我需要包括些什么来避免内存错误 我在这里先向您的帮助表示感谢.
What extra do I need to include to avoid the memory error Thanks for the help in advance.
推荐答案
一些可能帮助您的常规提示:
Some general tips that might help you:
pd.read_csv
提供usecols参数以指定要读取的列
pd.read_csv
provide usecols parameters to specify which columns you want to read
df = pd.read_csv(path+F_name+".csv", usecols=['col1', 'col2'])
2.删除未使用的变量
如果您不再需要变量,请使用del variable_name
配置内存 memory_profiler .从文档中引用该示例的内存日志,您将获得如下所示的内存配置文件:
Profile the memory memory_profiler. Citing the example's memory log from the documentation, you get a memory profile like the following:
Line # Mem usage Increment Line Contents
==============================================
3 @profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
这篇关于内存错误,执行情感分析大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!