如何将数据帧的行转换为特征向量? [英] How to turn rows of a dataframe into feature vectors?

查看:132
本文介绍了如何将数据帧的行转换为特征向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我有一个数据框,其每一行代表与更高级别的业务流程活动相关联的计算机上的一些低级别的用户活动。高级业务流程活动由每行表示的此类低级活动的序列组成。数据框如下所示:

So, I have a dataframe each rows of which represent some low-level user activity on a computer associated with a higher-level business process activity. The high-level business process activity is comprised of sequences of such low-level activities represented by each row. The data frame looks like this:

因此,这是一个序列分类问题,其中每个序列都是由案例ID标识,每一行代表序列的数据点。我需要训练一个模型来预测每个序列代表的业务流程活动。

So, it is a sequence classification problem where each sequence is identified by the case ID and each row represents the data point of the sequences. I need to train a model to predict the Business Process Activity that each sequence represents.

为此,我需要将数据帧的每一行转换为特征向量,但问题是数据帧的列每一行包含不同的信息,而某些数据是数字的,有些是文本的(例如:单词文档中的内容)。我需要使用所有数据进行培训。如何将这些行转换为特征向量以进行训练?

For this I need to transform each row of the dataframe into a feature vector but the problem is that the columns of the dataframe contain different information each row and some the data is numerical and some are textual (For example: the content inside the word document). I need to use of all the data for training. How do I convert these rows into feature vectors for training?

推荐答案

对于文本数据,您正在寻找所谓的嵌入。如果您只有几个唯一值,则文本字段将采用一种热编码(例如 sklearn.preprocessing.OneHotEncoder )。对于更复杂的序列(具有超过100k的不同值),您可能需要研究序列编码。例如,一种方法是使用Google提供的嵌入(例如,请参见 https:// code.google.com/archive/p/word2vec/ ),这会产生文本序列中每个单词的向量。然后,对于每个文本序列,对向量进行平均,以表示整个序列。

With textual data what you are looking for in called embedding. If you only have a few unique values a text field takes, then one hot encoding (for example sklearn.preprocessing.OneHotEncoder). For a more elaborate sequences (with more than 100k different values), you may want to look into sequence encoding. One of the ways is for example to use embedding provided by google (see for example https://code.google.com/archive/p/word2vec/) for every word, which yields vectors for every word in a text sequence. Afterwards, for every text sequence, vectors are averaged, which gives a representation of an entire sequence.

这篇关于如何将数据帧的行转换为特征向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆