numpy/pandas:如何将一系列由零和一组成的字符串转换为矩阵 [英] numpy/pandas: How to convert a series of strings of zeros and ones into a matrix
问题描述
我有一个以以下格式到达的数据:
I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
以元组格式,看起来像这样:
In tuple format, it looks like this:
(Y, X, other_info, ... )
最后,我需要使用Y和X训练分类器(例如sklearn.linear_model.logistic.LogisticRegression).
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
将一和零的字符串转换为类似np.array的最直接的方法是什么,以便我可以通过分类器运行它?似乎这里应该有一个简单的答案,但我一直没想到/谷歌.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
一些注意事项:
- 我已经在使用numpy/pandas/sklearn,所以这些库中的任何东西都是公平的游戏.
- 对于我正在做的很多事情,将other_info列一起放在DataFrame中很方便
- 字符串很长(〜20,000列),但总数据帧不是很高(〜500行).
推荐答案
由于您主要要求将一串和零串转换为numpy数组的方法,因此,我将提供以下解决方案:
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
就速度而言,这比 @DSM的解决方案更有利:
This compares favourable to @DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
这篇关于numpy/pandas:如何将一系列由零和一组成的字符串转换为矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!