逻辑回归:每个样本X具有667个特征;期待74869 [英] Logistic regression: X has 667 features per sample; expecting 74869

查看:60
本文介绍了逻辑回归:每个样本X具有667个特征;期待74869的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用imdb电影评论数据集,我进行了逻辑回归以预测评论的情绪.

  tfidf = TfidfVectorizer(strip_accents = None,小写= False,预处理器= None,tokenizer =填充,use_idf = True,范数='l2',smooth_idf = True)y = df.sentiment.valuesX = tfidf.fit_transform(df.review)X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 1,test_size = 0.3,shuffle = False)clf = LogisticRegressionCV(cv = 5,评分=准确性",random_state = 1,n_jobs = -1,详细= 3,max_iter = 300).fit(X_train,y_train)yhat = clf.predict(X_test)打印(准确性:")打印(clf.score(X_test,y_test))model_performance(X_train,y_train,X_test,y_test,clf) 

在此之前已应用文本预处理.模型性能只是创建混淆矩阵的功能.所有这些都可以很好地工作并且具有很高的准确性.

我现在抓取了新的IMDB评论:

 #电影小丑" IMBD评论页面url_link ='https://www.imdb.com/title/tt7286456/reviews'html = urlopen(url_link)content_bs=BeautifulSoup(html)JokerReviews = []#所有评论均以html中称为text的div类结尾,可以在imdb源代码中找到对于content_bs.find_all('div',class _ ='text')中的b:JokerReviews.append(b)df = pd.DataFrame.from_records(JokerReviews)df ['sentiment'] ="0"jokerData = df [0]jokerData = jokerData.apply(预处理器) 

问题:现在,我希望测试相同的逻辑回归以预测情绪:

  tfidf2 = TfidfVectorizer(strip_accents = None,小写= False,preprocessor = None,tokenizer = fill,use_idf = True,norm ='l2',smooth_idf = True)y = df.sentiment.valuesXjoker = tfidf2.fit_transform(jokerData)yhat = Clf.predict(Xjoker) 

但是我得到了错误:ValueError:X每个样本具有667个功能;期望74869

我不明白为什么它必须具有与X_test相同的功能

prior to this text preprocessing have been applied. Model performance is just a function to create a confusion matrix. this all works well with a good accuracy.

I now scrape new IMDB reviews:

#The movie "Joker" IMBD review page
url_link='https://www.imdb.com/title/tt7286456/reviews'
html=urlopen(url_link)

content_bs=BeautifulSoup(html)

JokerReviews = []
#All the reviews ends in a div class called text in html, can be found in the imdb source code
for b in content_bs.find_all('div',class_='text'):
  JokerReviews.append(b)

df = pd.DataFrame.from_records(JokerReviews)
df['sentiment'] = "0" 
jokerData=df[0]
jokerData = jokerData.apply(preprocessor)

Problem: Now i wish to test the same logistic regression to predict the sentiment:

tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None, tokenizer=fill, use_idf=True, norm='l2', smooth_idf=True)
y = df.sentiment.values
Xjoker = tfidf2.fit_transform(jokerData)

yhat = Clf.predict(Xjoker)

But i get the error: ValueError: X has 667 features per sample; expecting 74869

I dont get why it has to have the same amount of features as X_test

解决方案

The problem is that your model was trained after a preprocessing that identified 74869 unique words, and the preprocessing of your input data for inference have identified 667 words, and you are supposed to send the data to the model with the same number of columns. Besides that, one of the 667 words identified for the inference may also don't be expected by the model as well.

To create a valid input for your model, you have to use an approach such as:

# check which columns are expected by the model, but not exist in the inference dataframe
not_existing_cols = [c for c in X.columns.tolist() if c not in Xjoker]
# add this columns to the data frame
Xjoker = Xjoker.reindex(Xjoker.columns.tolist() + not_existing_cols, axis=1)
# new columns dont have values, replace null by 0
Xjoker.fillna(0, inplace = True)
# use the original X structure as mask for the new inference dataframe
Xjoker = Xjoker[X.columns.tolist()]

After these steps, you can call the predict() method.

这篇关于逻辑回归:每个样本X具有667个特征;期待74869的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆