分类精度太低(Word2Vec) [英] Classification accuracy is too low (Word2Vec)

查看:154
本文介绍了分类精度太低(Word2Vec)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究要用word2vec解决的多标签情感分类问题。这是我从几个教程中学到的代码。现在精度很低。大约0.02告诉我我的代码有问题。但我找不到它。我尝试了TF-IDF和BOW的这段代码(显然word2vec部分除外),但我得到了更好的准确性得分,例如0.28,但似乎这是一个错误的问题:

  np.set_printoptions(threshold = sys.maxsize)
wv = gensim.models.KeyedVectors.load_word2vec_format( E:\\GoogleNews-vectors-negative300.bin,binary = True )
wv.init_sims(replace = True)

#预处理器功能
pre_processor = TextPreProcessor(
omit = ['url','email',' percent','money','phone','user',
'time','url','date','number'],

normalize = ['url' ,电子邮件,百分比,钱,电话,用户,
时间, url,日期,数字],

segmenter = twitter,

Corrector = twitter,

unpack_hashtags = True,
unpack_contractions = True,

tokenizer = SocialTokenizer(lowercase = True).tokenize,

dicts = [表情符号]


#平均单词向量以创建嵌入句子的句子
def word_averaging(wv,words):
all_words,平均值= set(),[]

表示单词中的单词:
if isinstance(word,np.ndarray):
mean.append(word)
elif wv.vocab中的单词:
mean.append(wv.syn0norm [wv。 vocab [word] .index])
all_words.add(wv.vocab [word] .index)

如果不是这样的话:
logging.warning(无法计算相似度没有输入%s,单词)
#FIXME:在预处理
中删除这些示例return np.zeros(wv.vector_size,)

Mean = gensim.matutils .unitvec(np.array(mean).mean(axis = 0))。astype(np.float32)
返回平均值

def word_averaging_list(wv,text_list):
返回np.vstack([text_list中用于发布的word_averaging(wv,post)])

#二次平均单词方法
def get_mean_vector(word2vec_model,单词):
#删除外来的ry words
words = [如果在word2vec_model.vocab中为word,则逐字逐字]如果len(words)> = 1则为

返回np.mean(word2vec_model [words],轴= 0)
其他:
return []

#正在加载数据
raw_train_tweets = pandas.read_excel('E:\\train.xlsx')。 iloc [:,1]#加载所有火车推文
train_labels = np.array(pandas.read_excel('E:\\train.xlsx')。iloc [:,2:13])#加载相应的训练标签(11种情感)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx')。iloc [:,1]#正在加载300条测试推文
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx')。iloc [:,2:13])#加载相应的测试标签(11种情感)
print( please wait )

#预处理
train_tweets = []
test_tweets = []
对于raw_train_tweets中的推文:
train_tweets.append(pre_processor.pre_process_doc( tweets))

raw_test_tweets中的tweets:
test_tweets.append(pre_processor.pre_process_doc(tweets))

#向量化
train_array = word_averaging_list(wv,train_tweets)
test_array = word_averaging_list(wv,test_tweets)

#预测和评估
clf = LabelPowerset(LogisticRegression(solver ='lbfgs',C = 1,class_weight = None))
clf.fit(train_array,train_labels)
预测= clf.predict(test_array)
相交= 0
union = 0
精度= []
for i在范围(0,3250)中:#i有3250条测试推文。 j在范围(0,11)中的
:#11情感
(如果预测[i,j]& test_gold_labels [i,j] == 1:
相交+ = 1
如果预测[i,j] | test_gold_labels [i,j] == 1:
联合+ = 1

精度。如果联合!= 0,则追加(相交/联合),否则精度。 append(0.0)
相交= 0
联合= 0
print(np.mean(accuracy))

结果:

  0.4674498168498169 

然后我打印了预测变量(用于tweet 0至10)以查看其外观:

 (0,0)1 
(0,2)1
(2,0)1
(2,2)1
(3,4)1
(3,6)1
(4,0)1
(4,2)1
(5,0)1
(5,2)1
(6,0)1
(6,2)1
(7,0)1
(7,2)1
(8,4)1
(8,6)1
(9,3)1
(9,8)1

它只显示1。例如(6,2)表示在推文编号6中,情感编号2为1。(9,8)表示在推文编号9中,情感编号8为1。其他情感被视为0。但是您可以这样想象以便更好地理解我在准确度方法中所做的工作:

  tweet的黄金情感0:[1 1 0 0 0 0 1 0 0 0 1 ] 
推特预测情绪0:[1 0 1 0 0 0 0 0 0 0 0 0]

i对索引使用了并集和相交。从1到1。从1到1.从0到1,直到黄金情感11到预测情感11。我在两个for循环中针对所有tweet进行了此操作。


在我的tweets上创建Word2Vec矢量:


现在我想使用gensim在我的tweet数据集上创建Word2Vec矢量。我更改了上面代码的某些部分,如下所示:

 #平均单词向量以创建语句嵌入
def word_averaging(wv,words) :
all_words,均值= set(),[]

表示单词中的单词:
if isinstance(word,np.ndarray):
mean.append( word)
wif.vocab中的elif词:
mean.append(wv.syn0norm [wv.vocab [word] .index])
all_words.add(wv.vocab [word]。索引)

,如果不是这样的话:
logging.warning(无法在没有输入%s的情况下计算相似度,单词)
#FIXME:在预处理中删除这些示例
return np.zeros(wv.vector_size,)

平均值= gensim.matutils.unitvec(np.array(mean).mean(axis = 0))。astype(np.float32 )
返回平均值

def word_averaging_list(wv,text_list):
return np.vstack([text_list中用于发布的word_averaging(wv,post)])

#加载数据
raw_aggregate_tweets = pandas.rea d_excel('E:\\aggregate.xlsx')。iloc [:,0]#加载所有火车推文

raw_train_tweets = pandas.read_excel('E:\\train.xlsx ').iloc [:,1]#加载所有火车推文
train_labels = np.array(pandas.read_excel('E:\\train.xlsx')。iloc [:,2:13]) #加载相应的火车标签(11种情感)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx')。iloc [:,1]#加载300条测试推文
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx')。iloc [:,2:13])#加载相应的测试标签(11种情感)
print(请稍候)

#预处理
gregate_tweets = []
train_tweets = []
test_tweets = []
for raw_aggregate_tweets中的tweets:
gregate_tweets.append(pre_processor.pre_process_doc(tweets))

raw_train_tweets中的tweets:
train_tweets.append(pre_processor.pre_process_doc(tweets))

对于raw_test_tweets中的推文:
test_tweets.append(pre_processor.pre_process_doc(tweets))

print(len(aggregate_tweets))
#向量化
w2v_model = gensim.models.Word2Vec(aggregate_tweets,min_count = 10,size = 300,window = 8)

print(w2v_model.wv.vectors.shape)

train_array = word_averaging_list(w2v_model.wv,train_tweets)
test_array = word_averaging_list(w2v_model.wv,test_tweets)

,但出现此错误:

  TypeError Traceback(最近一次通话)
< ipython-input-1-8a5fe4dbf144>在< module>中
110 print(w2v_model.wv.vectors.shape)
111
-> 112 train_array = word_averaging_list(w2v_model.wv,train_tweets)
113 test_array = word_averaging_list(w2v_model.wv,test_tweets)
114

< ipython-input-1-8a5fe4dbf144>在word_averaging_list(wv,text_list)中
70
71 def word_averaging_list(wv,text_list):
---> 72 return np.vstack([text_list中的帖子的word_averaging(wv,post)])
73
74#平均单词向量以创建嵌入

的句子< ipython-input -1-8a5fe4dbf144>在< listcomp>(。0)中
70
71 def word_averaging_list(wv,text_list):
---> 72 return np.vstack([text_list中的帖子的word_averaging(wv,post)])
73
74#平均单词向量以创建嵌入

的句子< ipython-input -1-8a5fe4dbf144>在word_averaging(wv,words)中
58 mean.append(word)
59在wv.vocab中的elif word:
---> 60 mean.append(wv.syn0norm [wv.vocab [word] .index])
61 all_words.add(wv.vocab [word] .index)
62

TypeError: NoneType对象不可下标


解决方案

清除您的 TextPreProcessor SocialTokenizer 类可以做什么。您应该编辑问题以显示其代码,或显示结果文本的一些示例,以确保其按预期运行。 (例如:显示 all_tweets 的前几项和后几项。)



您的行不太可能 all_tweets = train_tweets.append(test_tweets)正在做您期望的事情。 (它将整个列表 test_tweets 作为 all_tweets 的最后一个元素–但随后返回 all_tweets 的> None 您的 Word2Vec 模型可能为空-您应该使INFO日志记录能够监视其进度并查看输出是否有异常,并添加代码后训练以打印一些有关模型的详细信息,以确认进行了有用的训练。)



您确定 train_tweets 是用于 .fit()的管道的正确格式吗? (发送到 Word2Vec 培训的文本似乎已通过 .split()进行了标记化,但是 pandas.Series train_tweets 可能从未被标记过。)



<通常,一个好主意是启用日志记录,并在每个步骤之后通过检查属性值或打印较长集合的摘录来确认每个步骤已达到预期效果,然后添加更多代码。


i'm working on an Multi-Label Emotion Classification problem to be solved by word2vec. this is my code that i've learned from a couple of tutorials. now the accuracy is very low. about 0.02 which is telling me something is wrong in my code. but i cannot find it. i tried this code for TF-IDF and BOW (obviously except word2vec part) and i got much better accuracy scores such as 0.28, but it seems this one is somehow wrong:

np.set_printoptions(threshold=sys.maxsize)
wv = gensim.models.KeyedVectors.load_word2vec_format("E:\\GoogleNews-vectors-negative300.bin", binary=True)
wv.init_sims(replace=True)

#Pre-Processor Function
pre_processor = TextPreProcessor(
    omit=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
    
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'url', 'date', 'number'],
     
    segmenter="twitter", 
    
    corrector="twitter", 
    
    unpack_hashtags=True,
    unpack_contractions=True,
    
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    dicts=[emoticons]
)

#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
    all_words, mean = set(), []
    
    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

#Secondary Word-Averaging Method
def get_mean_vector(word2vec_model, words):
# remove out-of-vocabulary words
words = [word for word in words if word in word2vec_model.vocab]
if len(words) >= 1:
    return np.mean(word2vec_model[words], axis=0)
else:
    return []

#Loading data
raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")

#Pre-Processing
train_tweets=[]
test_tweets=[]
for tweets in raw_train_tweets:
    train_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_test_tweets:
    test_tweets.append(pre_processor.pre_process_doc(tweets))

#Vectorizing 
train_array = word_averaging_list(wv,train_tweets)
test_array = word_averaging_list(wv,test_tweets)

#Predicting and Evaluating    
clf = LabelPowerset(LogisticRegression(solver='lbfgs', C=1, class_weight=None))
clf.fit(train_array,train_labels)
predicted= clf.predict(test_array)
intersect=0
union=0
accuracy=[]
for i in range(0,3250): #i have 3250 test tweets.
    for j in range(0,11): #11 emotions
        if predicted[i,j]&test_gold_labels[i,j]==1:
            intersect+=1
        if predicted[i,j]|test_gold_labels[i,j]==1:
            union+=1
    
    accuracy.append(intersect/union) if union !=0 else accuracy.append(0.0)
    intersect=0
    union=0
print(np.mean(accuracy))

The Result:

0.4674498168498169

And i printed predicted variable (for tweet 0 to 10) to see how it looks like:

  (0, 0)    1
  (0, 2)    1
  (2, 0)    1
  (2, 2)    1
  (3, 4)    1
  (3, 6)    1
  (4, 0)    1
  (4, 2)    1
  (5, 0)    1
  (5, 2)    1
  (6, 0)    1
  (6, 2)    1
  (7, 0)    1
  (7, 2)    1
  (8, 4)    1
  (8, 6)    1
  (9, 3)    1
  (9, 8)    1

as you can see, it only show 1's. for example (6,2) means in tweet number 6, emotion number 2 is 1. (9,8) means in tweet number 9, emotion number 8 is 1. the other emotions considered as 0. but you can imagine it like this to better understand what i've done in Accuracy method:

gold emotion for tweet 0:      [1 1 0 0 0 0 1 0 0 0 1]
predicted emotion for tweet 0: [1 0 1 0 0 0 0 0 0 0 0]

i've used union and intersect for the indexes one by one. 1 to 1. 1 to 1. 0 to 1, until gold emotion 11 to predicted emotion 11. i did this for all tweets in two for loops.

Creating Word2Vec vectors on my tweets:

now i want to use gensim to create Word2Vec vectors on my tweet dataset. i changed some parts of the code above as below:

#Averaging Words Vectors to Create Sentence Embedding
def word_averaging(wv, words):
    all_words, mean = set(), []

    for word in words:
        if isinstance(word, np.ndarray):
            mean.append(word)
        elif word in wv.vocab:
            mean.append(wv.syn0norm[wv.vocab[word].index])
            all_words.add(wv.vocab[word].index)

    if not mean:
        logging.warning("cannot compute similarity with no input %s", words)
        # FIXME: remove these examples in pre-processing
        return np.zeros(wv.vector_size,)

    mean = gensim.matutils.unitvec(np.array(mean).mean(axis=0)).astype(np.float32)
    return mean

def  word_averaging_list(wv, text_list):
    return np.vstack([word_averaging(wv, post) for post in text_list ])

#Loading data
raw_aggregate_tweets = pandas.read_excel('E:\\aggregate.xlsx').iloc[:,0] #Loading all train tweets

raw_train_tweets = pandas.read_excel('E:\\train.xlsx').iloc[:,1] #Loading all train tweets
train_labels = np.array(pandas.read_excel('E:\\train.xlsx').iloc[:,2:13]) #Loading corresponding train labels (11 emotions)

raw_test_tweets = pandas.read_excel('E:\\test.xlsx').iloc[:,1] #Loading 300 test tweets
test_gold_labels = np.array(pandas.read_excel('E:\\test.xlsx').iloc[:,2:13]) #Loading corresponding test labels (11 emotions)
print("please wait")

#Pre-Processing
aggregate_tweets=[]
train_tweets=[]
test_tweets=[]
for tweets in raw_aggregate_tweets:
    aggregate_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_train_tweets:
    train_tweets.append(pre_processor.pre_process_doc(tweets))

for tweets in raw_test_tweets:
    test_tweets.append(pre_processor.pre_process_doc(tweets))
    
print(len(aggregate_tweets))
#Vectorizing 
w2v_model = gensim.models.Word2Vec(aggregate_tweets, min_count = 10, size = 300, window = 8)

print(w2v_model.wv.vectors.shape)

train_array = word_averaging_list(w2v_model.wv,train_tweets)
test_array = word_averaging_list(w2v_model.wv,test_tweets)

but i get this error:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-8a5fe4dbf144> in <module>
    110 print(w2v_model.wv.vectors.shape)
    111 
--> 112 train_array = word_averaging_list(w2v_model.wv,train_tweets)
    113 test_array = word_averaging_list(w2v_model.wv,test_tweets)
    114 

<ipython-input-1-8a5fe4dbf144> in word_averaging_list(wv, text_list)
     70 
     71 def  word_averaging_list(wv, text_list):
---> 72     return np.vstack([word_averaging(wv, post) for post in text_list ])
     73 
     74 #Averaging Words Vectors to Create Sentence Embedding

<ipython-input-1-8a5fe4dbf144> in <listcomp>(.0)
     70 
     71 def  word_averaging_list(wv, text_list):
---> 72     return np.vstack([word_averaging(wv, post) for post in text_list ])
     73 
     74 #Averaging Words Vectors to Create Sentence Embedding

<ipython-input-1-8a5fe4dbf144> in word_averaging(wv, words)
     58             mean.append(word)
     59         elif word in wv.vocab:
---> 60             mean.append(wv.syn0norm[wv.vocab[word].index])
     61             all_words.add(wv.vocab[word].index)
     62 

TypeError: 'NoneType' object is not subscriptable

解决方案

It's not clear what your TextPreProcessor or SocialTokenizer classes might do. You should edit your question to either show their code, or show a few examples of the resulting texts to make sure it's doing what you expect. (For example: show the first few and last few entries of all_tweets.)

It's not likely that your line all_tweets = train_tweets.append(test_tweets) is doing what you expect. (It'll put the entire list test_tweets as the final element of all_tweets – but then return None which you assign to all_tweets. Your Word2Vec model might then be empty - you should enable INFO logging to watch its progress & review the output for anomalies, and add code post-training to print some details about the model that confirm useful training occurred.)

Are you sure train_tweets is the right format for your pipeline to .fit() against? (The texts sent to Word2Vec training seem to have been tokenized via a .split(), but the texts in the pandas.Series train_tweets may never have been tokenized.)

Generally, a good idea is to enable logging, and add more code after each step confirming, via checking property values or printing excerpts of the longer collections, that each step has had the intended effect.

这篇关于分类精度太低(Word2Vec)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆