使用ggplot和pandas在Python中绘制事件密度 [英] Plotting event density in Python with ggplot and pandas
问题描述
我正在尝试可视化这种形式的数据:
I am trying to visualize data of this form:
timestamp senderId
0 735217 106758968942084595234
1 735217 114647222927547413607
2 735217 106758968942084595234
3 735217 106758968942084595234
4 735217 114647222927547413607
5 etc...
如果我不分隔senderId
s,则
geom_density
可以工作:
geom_density
works if I don't separate the senderId
s:
df = pd.read_pickle('data.pkl')
df.columns = ['timestamp', 'senderId']
plot = ggplot(aes(x='timestamp'), data=df) + geom_density()
print plot
结果看起来像预期的那样:
The result looks as expected:
但是,如果我要单独显示senderId
,请按照以下步骤操作文档,但失败:
However if I want to show the senderId
s separately, as is done in the doc, it fails:
> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.
尝试使用更大的数据集(约40K个事件):
> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix
有什么想法吗?关于这些错误,SO上有一些答案,但似乎都没有.
Any idea? There are some answers on SO for those errors but none seems relevant.
这是我想要的图形(来自ggplot的文档):
This is the kind of graph I want (from ggplot's doc):
推荐答案
使用较小的数据集:
> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
ValueError: `dataset` input should have multiple elements.
这是因为某些senderId
仅具有一行.
This was because some senderId
s had only one row.
具有更大的数据集:
> plot = ggplot(aes(x='timestamp', color='senderId'), data=df) + geom_density()
numpy.linalg.linalg.LinAlgError: singular matrix
这是因为对于某些senderId
,我在完全相同的timestamp
处有多行. ggplot
不支持此功能.我可以通过使用更精细的时间戳来解决它.
This was because for some senderId
s I had multiple rows at the exact same timestamp
. This is not supported by ggplot
. I could solve it by using finer timestamps.
这篇关于使用ggplot和pandas在Python中绘制事件密度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!