Doc2Vec的管道和GridSearch [英] Pipeline and GridSearch for Doc2Vec

查看:115
本文介绍了Doc2Vec的管道和GridSearch的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前有以下脚本,可以帮助您找到doc2vec模型的最佳模型.它的工作方式如下:首先根据给定的参数训练一些模型,然后针对分类器进行测试.最后,它会输出最佳的模型和分类器(我希望如此).

I currently have following script that helps to find the best model for a doc2vec model. It works like this: First train a few models based on given parameters and then test against a classifier. Finally, it outputs the best model and classifier (I hope).

数据

示例数据(data.csv)可以在此处下载: https://pastebin.com/takYp6T8 请注意,数据的结构应能以1.0的精度构成理想的分类器.

Example data (data.csv) can be downloaded here: https://pastebin.com/takYp6T8 Note that the data has a structure that should make an ideal classifier with 1.0 accuracy.

脚本

import sys
import os
from time import time
from operator import itemgetter
import pickle
import pandas as pd
import numpy as np
from argparse import ArgumentParser

from gensim.models.doc2vec import Doc2Vec
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models import KeyedVectors
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

from sklearn.base import BaseEstimator
from gensim import corpora

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


dataset = pd.read_csv("data.csv")

class Doc2VecModel(BaseEstimator):

    def __init__(self, dm=1, size=1, window=1):
        self.d2v_model = None
        self.size = size
        self.window = window
        self.dm = dm

    def fit(self, raw_documents, y=None):
        # Initialize model
        self.d2v_model = Doc2Vec(size=self.size, window=self.window, dm=self.dm, iter=5, alpha=0.025, min_alpha=0.001)
        # Tag docs
        tagged_documents = []
        for index, row in raw_documents.iteritems():
            tag = '{}_{}'.format("type", index)
            tokens = row.split()
            tagged_documents.append(TaggedDocument(words=tokens, tags=[tag]))
        # Build vocabulary
        self.d2v_model.build_vocab(tagged_documents)
        # Train model
        self.d2v_model.train(tagged_documents, total_examples=len(tagged_documents), epochs=self.d2v_model.iter)
        return self

    def transform(self, raw_documents):
        X = []
        for index, row in raw_documents.iteritems():
            X.append(self.d2v_model.infer_vector(row))
        X = pd.DataFrame(X, index=raw_documents.index)
        return X

    def fit_transform(self, raw_documents, y=None):
        self.fit(raw_documents)
        return self.transform(raw_documents)


param_grid = {'doc2vec__window': [2, 3],
              'doc2vec__dm': [0,1],
              'doc2vec__size': [100,200],
              'logreg__C': [0.1, 1],
}

pipe_log = Pipeline([('doc2vec', Doc2VecModel()), ('log', LogisticRegression())])

log_grid = GridSearchCV(pipe_log, 
                        param_grid=param_grid,
                        scoring="accuracy",
                        verbose=3,
                        n_jobs=1)

fitted = log_grid.fit(dataset["posts"], dataset["type"])

# Best parameters
print("Best Parameters: {}\n".format(log_grid.best_params_))
print("Best accuracy: {}\n".format(log_grid.best_score_))
print("Finished.")

关于脚本,我确实有以下问题(我在这里将它们合并以避免三个具有相同代码段的帖子):

I do have following questions regarding my script (I combine them here to avoid three posts with the same code snippet):

  1. def __init__(self, dm=1, size=1, window=1):的目的是什么?我可以以某种方式(尝试失败)删除此部分吗?
  2. 如何将RandomForest分类器(或其他分类器)添加到GridSearch工作流/管道?
  3. 由于当前脚本仅在完整数据集上进行训练,因此如何将训练/测试数据拆分添加到上面的代码中?
  1. What's the purpose of def __init__(self, dm=1, size=1, window=1):? Can I possibly remove this part, somehow (tried unsuccessfully)?
  2. How can I add a RandomForest classifier (or others) to the GridSearch workflow/pipeline?
  3. How could a train/test data split added to the code above, as the current script only trains on the full dataset?

推荐答案

1)init()可让您定义您希望类在初始化时采用的参数(等同于Java中的构造函数).

1) init() lets you define the parameters you would like your class to take at initialization (equivalent to contructor in java).

请查看以下问题以获取更多详细信息:

Please look at these questions for more details:

  • Python __init__ and self what do they do?
  • Python constructors and __init__

2)为什么要添加RandomForestClassifier,它的输入是什么?

2) Why do you want to add the RandomForestClassifier and what will be its input?

看看您的其他两个问题,是否要在此处将RandomForestClassifier的输出与LogisticRegression进行比较?如果是这样,您在这个问题中表现良好.

Looking at your other two questions, do you want to compare the output of RandomForestClassifier with LogisticRegression here? If so, you are doing good in this question of yours.

3)您已经导入了train_test_split,只需使用它即可.

3) You have imported the train_test_split, just use it.

X_train, X_test, y_train, y_test = train_test_split(dataset["posts"], dataset["type"])

fitted = log_grid.fit(X_train, y_train)

这篇关于Doc2Vec的管道和GridSearch的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆