如何从具有不同长度的列表列表中创建 Pandas DataFrame? [英] How to create a Pandas DataFrame from a list of lists with different lengths?
问题描述
我有如下格式的数据
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
并且我想要一个 DataFrame,其中所有唯一的字符串都作为列和出现的二进制值等
and I would like to have a DataFrame with all unique strings as columns and binary values of occurrence as such
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
我有一个使用列表推导式的工作代码,但对于大数据来说速度很慢.
I have a working code using list comprehensions but it's pretty slow for large data.
# vocab_list contains all the unique keys, which is obtained when reading in data from file
df = pd.DataFrame([[1 if word in entry else 0 for word in vocab_list] for entry in data])
有没有办法优化这个任务?谢谢.
Is there any way to optimise this task? Thanks.
编辑(实际数据的小样本):
EDIT (a small sample of actual data):
[['a','关于','荒诞','再次','一个','同事','写','写','X','约克','你','你的'],['一种','坚持','年龄','加重','积极地','全部','几乎','独自的','已经','还','虽然']]
[['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
推荐答案
为了获得更好的性能,请使用 MultiLabelBinarizer
:
For better performance use MultiLabelBinarizer
:
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a abiding about absurd again age aggravated aggressively all \
0 1 0 1 1 1 0 0 0 0
1 1 1 0 0 0 1 1 1 1
almost ... also although an associates writes wrote x york you \
0 0 ... 0 0 1 1 1 1 1 1 1
1 1 ... 1 1 0 0 0 0 0 0 0
your
0 1
1 0
[2 rows x 22 columns]
纯熊猫解决方案是可能的,但我想它应该更慢:
Pure pandas solution is possible, but I guess it should be slowier:
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a b d c e f
0 1 1 0 1 0 0
1 0 1 0 1 0 0
2 0 0 1 1 1 1
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a abiding about absurd age again aggravated aggressively an all \
0 1 0 1 1 0 1 0 0 1 0
1 1 1 0 0 1 0 1 1 0 1
... writes alone wrote already x also york although you your
0 ... 1 0 1 0 1 0 1 0 1 1
1 ... 0 1 0 1 0 1 0 1 0 0
[2 rows x 22 columns]
这篇关于如何从具有不同长度的列表列表中创建 Pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!