CDF累积分布函数误差 [英] CDF Cumulative Distribution Function Error

查看:243
本文介绍了CDF累积分布函数误差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为多列数据文件中的一列绘制CDF.当数据文件中仅存在一列时,它将很好地绘制.当我尝试从数据中获取特定列时,它给了我错误.我也尝试使用for循环读取它读得很好的特定列.如果我给出for循环之外的plot语句,则仅显示列的最后一个值,并且如果我将plot语句保留在循环中,则会显示错误.读取文件或特定列不是问题,甚至不是缩进问题.我该如何解决?

I am trying to plot a CDF for one column in multi-column data file. When only one column is present in data file it plots fine. When I try to grab a particular column from data it gives me error. I also tried using for loop to read a particular column it reads fine. If I give the plot statements out of for loop the plot is shown with only the last value of the column and if i keep the plot statement inside the loop is gives error. It is not the problem with reading a file or the particular column, not even indentation problem. How do i fix it ?

带有for循环的代码

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

with open('input.txt', 'r') as f:
    for rows in f:
        cols = rows.split()
        data = cols[2]
        sorted_data = np.sort(data)
        cdf = np.arange(len(data))/float(len(data))
        plt.plot(sorted_data, cdf, '-bs')

plt.show()
#print data

错误

Traceback (most recent call last):
  File "cdf_plot.py", line 13, in <module>
    plt.plot(sorted_data, cdf, '-bs')
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2467, in plot
    ret = ax.plot(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 3893, in plot
    for line in self._get_lines(*args, **kwargs):
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 322, in _grab_next_args
    for seg in self._plot_args(remaining, kwargs):
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 300, in _plot_args
    x, y = self._xy_from_xy(x, y)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 240, in _xy_from_xy
    raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension

没有for循环的代码:

Code With no for loop:

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator

data = np.loadtxt('input.txt')
data_one = [row[2] for row in data]
sorted_data = np.sort(data)
cdf = np.arange(len(data_one))/float(len(data_one))
#cumulative = np.cumsum(data)
#ccdf = 1 - cdf

#plt.plot(data, sorted_data, 'r-*')
plt.plot(sorted_data, cdf, '-bs')

#plt.xlim([0,0.5])
plt.gca().set_xscale("log")
plt.gca().set_yscale("log")
plt.show()

错误:

Traceback (most recent call last):
  File "cum_graph.py", line 7, in <module>
    data = np.loadtxt('e_p_USC_30_days.txt')
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 804, in loadtxt
    X = np.array(X, dtype)
ValueError: setting an array element with a sequence.

输入文件:我对计算col [2]的CDF感兴趣,即仅计算第3列

Input file: I am interested in calculating the CDF of col[2] i.e. column 3 only

4814  2464  27  0.000627707861971  117923.0
4211  736  2  4.64968786645  05  2576.0
2075  1339  30  0.000697453179968  499822.0
2441  2381  3  6.97453179968  05  1968.0
4694  1738  1  2.32484393323  05  5702.0
4406  3008  12  0.000278981271987  8483.0
3622  1396  3  6.97453179968  05  2564.0
5425  478  1  2.32484393323  05  428.0
4489  1715  6  0.000139490635994  19045.0
3695  3387  2  4.64968786645  05  16195.0

推荐答案

这里确实有很多错误.

仔细看他们:

4814  2464  27  0.000627707861971  117923.0
4211  736  2  4.64968786645  05  2576.0
2075  1339  30  0.000697453179968  499822.0
2441  2381  3  6.97453179968  05  1968.0
4694  1738  1  2.32484393323  05  5702.0
4406  3008  12  0.000278981271987  8483.0
3622  1396  3  6.97453179968  05  2564.0
5425  478  1  2.32484393323  05  428.0
4489  1715  6  0.000139490635994  19045.0
3695  3387  2  4.64968786645  05  16195.0

有时您会得到6列,如下所示:

Sometimes you got 6 columns as in:

4211  736  2  4.64968786645  05  2576.0

有时您只有5个:

4814  2464  27  0.000627707861971  117923.0

所以第一件事就是学习如何正确地写数据.

So the first thing is to learn how to write data correctly.

想象一下,您所有的数据都在一个名为data的2D numpy数组中.

Imagine that all you data are in a 2D numpy array called data.

您可以致电:

numpy.savetxt("input.txt", data)

或者,以更好地控制格式:

or, to get more control over formating:

numpy.savetxt("input.txt", data, fmt="%d %d %d %.6f %d %.1f")

fmt=参数是一种告诉numpy如何保存数据的方式(%d表示将其写为整数,%f表示将其写为浮点数,%.5f表示将其写为a浮动,只有5个小数).

The fmt= parameter is a way to tell numpy how you want to save your data (%d means write it as an integer, %f means write it as a float, %.5f means write it as a float with only 5 decimals).

如果您想自己编写,可以执行以下操作:

If you want to write it yourself, you could do something like:

fmt = "%d %d %d %.6f %d %.1f"
with open("input.txt", "w") as f:
    for row in data:
        f.write(fmt%row+"\n")

如果您真正要写的是5列而不是6列,那么请使用另一个分隔符,例如,.这样,

If the lines with 5 columns instead of 6 are what you really want to write, then use another delimiter like ,. This way,

4814,2464,27,0.000627707861971,,117923.0

显然包含6列.

我所说的有效数据是一致的数据,该数据始终包含相同数量的列.

What I call valid data is consistent data, data which always contains the same number of columns.

您应该真正使用numpy.loadtxtnumpy.genfromtxt(如果缺少数据,则使用后一个).请注意,您可以使用delimiter参数为它们两者指定一个定界符.

You should really use numpy.loadtxt or numpy.genfromtxt (the latter one is it use if data are missing). Note that you can specify a delimiter for both of them using the delimiter argument.

data = numpy.loadtxt("valid_input.txt")
col = data[:,2]

或等效地,您可以将usecols自变量与unpack自变量一起使用.

or equivalently you could use the usecols argument together with the unpack one.

对于您的数据,使用usecols的方法是有效的,如果在其他地方的第2列之前没有其他错误,则仅选择第三列(Python语言中的第2列).

For your data, the method with usecols is working is you select only the third column (column 2 in Python lingua) if you don't have any other wrongness before column 2 elsewhere.

您可以手动操作,这将使我们陷入另一种错误:

You could do it by hand which would bring us to another wrongness:

在那里,您只需将变量数据替换为一个值(cols[2]中的一个):

There, you just replace the variable data with a single value (the one in cols[2]):

with open('input.txt', 'r') as f:
    for rows in f:
        cols = rows.split()
        data = cols[2]

您尝试在其中对单个值进行排序:

There you try to sort a single value:

        sorted_data = np.sort(data)

您要获得单个值的长度:

There you want to get the length of a single value:

        cdf = np.arange(len(data))/float(len(data))
        plt.plot(sorted_data, cdf, '-bs')

plt.show()

我真的很惊讶numpy没有抱怨.

I'm really surprised numpy does not complain.

您一次获得一行:您需要将这些值存储在某个地方(例如,在列表中),然后对其进行处理.

You are getting one row at a time: you need to store these values somewhere (in a list for instance) and then do your stuff about it.

numpy.loadtxt无法加载您的数据(默认情况下会尝试加载所有内容),因为它无法根据行推断6列或5列要执行的操作.因此,它唯一能做的就是失败.

numpy.loadtxt can't load your data (it tries to load everything by default) because it can't infer what you want to do with 6 columns or 5 columns depending on the row. So the only thing it can do is failing.

首先,不要生气:我要说的是帮助您改善.我不会以任何方式对您进行判断,只是向您展示如何应对这种错误,不管是微不足道的还是微不足道的.

First, don't get offended: what I'm gonna say is to help you improve. I'm not judging you in any way, just showing you how you should react in front of this kind of errors, trivial or not.

  1. 阅读错误.
  2. 尝试了解正在发生的事情.
  3. 在互联网上查找这些错误.
  4. 问一个人.

问题是您似乎只是复制粘贴了这些错误而没有实际查看它们,因此没有尝试去理解它们(但是我可能错了,我不在您的头脑中:)).

The problem is that you seem to have just copy-pasted the errors without having actually looked at them so without having tried to understand them (but I may be wrong, I'm not in your head :)).

但是可以肯定的是,由于答案很多,因此您没有将它们复制粘贴到您喜欢的搜索引擎中.再说一次,我可能是错的.也许您是这样做的,但没有看到这些答案如何适用于您的案例.不过,第一个关于Google的答案

But what's for sure is that you have not copy-pasted them in your favorite search engine because answers are plenty. Again, I may be wrong. Maybe you did this but without seeing how these answers could apply to your case. Though, the first on Google answer about

ValueError: x and y must have same first dimension

非常明确.您甚至不必提及这是matplotlib还是Python.然后您会发现sorted_datacdf的长度不同.如果再做一些工作,您就可以弄清楚我之前说过的有关实现的内容.

is pretty explicit. You don't even have to mention this is matplotlib or Python. Then you would have discovered that sorted_data is not the same length as cdf. With a little more work, you would have figured out what I said before about your implementations.

如您所见,我没有给出规范的答案",并且因为考虑到您尚未完成工作,所以我不会提供.但是您仍然可以做到:我已经为您提供了回答您自己的问题所需的所有工具.这并不意味着您必须独自在一个孤岛上做所有事情:(几乎)我已经给出了一个完整的答案,文档也可以为Google提供帮助:).您所要做的就是为之寻找一点点.工作正常后,请编辑您的问题(或回答您自己的问题).

As you've seen, I've not given a "canonical answer" and I won't since I consider that you have not done your part of the job. But you can still do it: I've given you all the tools you need to answer your own question. That don't mean that you have to do it all alone on a remote island: I've almost given a complete answer (really), the doc can help and Google too :). All you have to do is searching a tiny bit for it. Once you have something working, edit your question (or answer to your own question).

这篇关于CDF累积分布函数误差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆