双变量CDF/CCDF分布Python [英] Bivariate CDF/CCDF Distribution Python

查看:549
本文介绍了双变量CDF/CCDF分布Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试绘制同时具有xy值的数据集的双变量ccdf.

我可以很好地绘制单变量,下面是输入,并且代码用于统一数据集.

输入:这些仅是数据点的前20行.输入具有1000行,其中col[1]col[3]具有用户和关键字的频率关系,因此需要对其进行绘制.

tweetcricscore  34 #afgvssco   51
tweetcricscore  23 #afgvszim   46
tweetcricscore  24 #banvsire   12
tweetcricscore  456 #banvsned  46
tweetcricscore  653 #canvsnk   1
tweetcricscore  789 #cricket   178
tweetcricscore  625 #engvswi   46
tweetcricscore  86 #hkvssco    23
tweetcricscore  3 #indvsban    1
tweetcricscore  87 #sausvsvic  8
tweetcricscore  98 #wt20       56

代码:统一数据集

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

data = np.genfromtxt('keyword.csv', delimiter=',', comments=None)

d0=data[:,1]
X0 = np.sort(d0)
cdf0 = np.arange(len(X0))/float(len(X0))
ccdf0 = 1 - cdf0
plt.plot(X0,ccdf0, color='b', marker='.', label='Keywords')

plt.legend(loc='upper right')
plt.xlabel('Freq (x)')
plt.ylabel('ccdf(x)')
plt.gca().set_xscale("log")
#plt.gca().set_yscale("log")
plt.show()

我正在寻找双变量数据点的某些选项.我提到了季节性双变量分布但是我无法将其与我的数据集放在适当的上下文中.

欢迎使用python,matplotlib,seaborn中的任何其他建议. 提前致谢.

解决方案

您试图描述的双变量分布通常是连续的,例如房屋的大小(输入x)和价格(输出y) .)在您的情况下,关键字的编号没有任何有意义的关系(我认为),因为它可能只是分配给关键字的ID对吗?

就我而言,您似乎拥有类别(关键字).每个类别似乎都有两个数字tweetcricscorekeyword数字. \

您的代码在这里:

cdf0 = np.arange(len(X0))/float(len(X0))

对我来说,您的x范围只是它们的标签,而不是有意义的值.

可以在此处找到分类图的更好来源. .

要创建一个二元分布,并假设您仍然想要阅读它,可以使用数据作为示例,并使用上面的数据进行以下操作:

import numpy as np
import seaborn as sns

col_1 = np.array([34, 23, 24, 456, 653, 789, 625, 86, 3, 87, 98])
col_3 = np.array([51, 46, 12, 46, 1, 178, 46, 23, 1, 8, 56])

sns.jointplot(x=col_3, y=col_1)

这在这里产生了非常荒谬的数字:

您必须手动添加x和y标签;这是因为您传递的是numpy array s而不是pandas Dataframes,可以认为它类似于dictionaries,其中字典中的每个键都是列的标题,而值是numpy数组.

使用随机数显示在随机性,连续性和相关性更高的数据集中的外观.

这是从文档中提取的示例.

import numpy as np
import seaborn as sns
import pandas as pd

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
sns.jointplot(x="x", y="y", data=df);

哪个给这个:

可以将图表顶部的条形图视为单变量图表(您可能已生成),因为它们仅描述一个或另一个变量(x或y,col_3或col_1)的分布. /p>

I am trying to plot a bivariate ccdf of the dataset that has x and y values both.

Univariate I can plot very well, below is the input and the code is for univeriate dataset.

Input: These are only first 20 rows of the data points. Input has 1000s of rows and of which col[1] and col[3] needs to be plotted as they posses a user and keyword frequency relationship.

tweetcricscore  34 #afgvssco   51
tweetcricscore  23 #afgvszim   46
tweetcricscore  24 #banvsire   12
tweetcricscore  456 #banvsned  46
tweetcricscore  653 #canvsnk   1
tweetcricscore  789 #cricket   178
tweetcricscore  625 #engvswi   46
tweetcricscore  86 #hkvssco    23
tweetcricscore  3 #indvsban    1
tweetcricscore  87 #sausvsvic  8
tweetcricscore  98 #wt20       56

Code: univeriate dataset

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

data = np.genfromtxt('keyword.csv', delimiter=',', comments=None)

d0=data[:,1]
X0 = np.sort(d0)
cdf0 = np.arange(len(X0))/float(len(X0))
ccdf0 = 1 - cdf0
plt.plot(X0,ccdf0, color='b', marker='.', label='Keywords')

plt.legend(loc='upper right')
plt.xlabel('Freq (x)')
plt.ylabel('ccdf(x)')
plt.gca().set_xscale("log")
#plt.gca().set_yscale("log")
plt.show()

I am looking for some option for bivariate data points. I referred Seaborn Bivariate Distribution But I am not able to put it in proper context with my dataset.

Any alternative suggestion within python, matplotlib, seaborn are welcome.. Thanks in advance.

解决方案

Bivariate distributions the way you're trying to describe are oftentimes continuous, for instance the size of a house (input, x) and it's price (output, y.) In your case there is no meaningful relationship (I think) in the number of the keyword, as it's probably just an ID assigned to the keyword right?

In your case to me it seems as though you have categories (keywords). each category appears to have two numbers a tweetcricscore and a keyword number. \

Your code here:

cdf0 = np.arange(len(X0))/float(len(X0))

To me suggests that your x range is just their labels and not a meaningful value.

A better source for categorical plots can be found here.

To create a bivariate distribution, assuming that's still what you want having read that, you'd do the following using your data as an example using your data from above:

import numpy as np
import seaborn as sns

col_1 = np.array([34, 23, 24, 456, 653, 789, 625, 86, 3, 87, 98])
col_3 = np.array([51, 46, 12, 46, 1, 178, 46, 23, 1, 8, 56])

sns.jointplot(x=col_3, y=col_1)

Which produces the very nonsensical figure here:

You'll have to add the x and y labels manually; this is because you're passing numpy arrays instead of pandas Dataframes which can be thought of like dictionaries where each key in the dictionary is the title of a column, and the value the numpy array.

Using random numbers to show how it might look with a more random, continuous, related dataset.

This is the example taken from the docs.

import numpy as np
import seaborn as sns
import pandas as pd

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
sns.jointplot(x="x", y="y", data=df);

Which gives this:

The bar graphs on top of the chart can be thought of as uni variate charts (what you probably have produced) because they just describe the distribution of one or the other variable (x, or y, col_3, or col_1)

这篇关于双变量CDF/CCDF分布Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆