内存错误,执行情感分析大数据 [英] Memory Error, performing sentiment analysis large size data

查看:68
本文介绍了内存错误,执行情感分析大数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对社交网络中的大量数据进行情感分析.代码的一部分非常适合数据量较小的情况.

I am trying to perform sentiment analysis on the large set of data from social network. The part of the code works great with small size of data.

小于20mb的输入大小在计算时没有问题.但是,如果大小超过20mb,则会出现内存错误.

The input size less than 20mb has no problem computing. But if the size is more than 20mb I am getting memory error.

环境:Windows 10,带有更新版本软件包的anaconda3.x.

Environment: Windows 10, anaconda 3.x with updated version packages.

代码:

def captionsenti(F_name): 
    print ("reading from csv file")
    F1_name="caption_senti.csv"
    df=pd.read_csv(path+F_name+".csv")
    filename=path+F_name+"_"+F1_name
    df1=df['tweetText']   # reading caption from data5 file
    df1=df1.fillna("h") # filling NaN values
    df2=pd.DataFrame()
    sid = SentimentIntensityAnalyzer()
    print ("calculating sentiment")
    for sentence in df1:
        #print(sentence)
        ss = sid.polarity_scores(sentence)  # calculating sentiments
        #print ss
        df2=df2.append(pd.DataFrame({'tweetText':sentence ,'positive':ss['pos'],'negative':ss['neg'],'neutral':ss['neu'],
                                 'compound':ss['compound']},index=[0]))


    df2=df2.join(df.set_index('tweetText'), on='tweetText') # joining two data frames
    df2=df2.drop_duplicates(subset=None, keep='first', inplace=False)
    df2=df2.dropna(how='any') 
    df2=df2[['userID','tweetSource','tweetText','positive','neutral','negative','compound','latitude','longitude']]
    #print df2
    print ("Storing in csv file")
    df2.to_csv(filename,encoding='utf-8',header=True,index=True,chunksize=100)

我需要包括些什么来避免内存错误 我在这里先向您的帮助表示感谢.

What extra do I need to include to avoid the memory error Thanks for the help in advance.

推荐答案

一些可能帮助您的常规提示:

Some general tips that might help you:

pd.read_csv提供usecols参数以指定要读取的列

pd.read_csv provide usecols parameters to specify which columns you want to read

df = pd.read_csv(path+F_name+".csv", usecols=['col1', 'col2'])

2.删除未使用的变量

如果您不再需要变量,请使用del variable_name

配置内存 memory_profiler .从文档中引用该示例的内存日志,您将获得如下所示的内存配置文件:

Profile the memory memory_profiler. Citing the example's memory log from the documentation, you get a memory profile like the following:

Line #    Mem usage  Increment   Line Contents
==============================================
     3                           @profile
     4      5.97 MB    0.00 MB   def my_func():
     5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
     6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
     7     13.61 MB -152.59 MB       del b
     8     13.61 MB    0.00 MB       return a

这篇关于内存错误,执行情感分析大数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆