python 2.7中的对数刻度上的最佳拟合线 [英] Best Fit Line on Log Log Scales in python 2.7

查看:41
本文介绍了python 2.7中的对数刻度上的最佳拟合线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对数刻度的网络IP频率等级图.完成此部分后,我尝试使用 Python 2.7 在对数对数刻度上绘制最佳拟合线.我必须使用matplotlib的符号"轴比例尺,否则某些值将无法正确显示,并且某些值会被隐藏.

This is a network IP frequency rank plot in log scales. After completing this portion, I am trying to plot the best fit line on log-log scales using Python 2.7. I have to use matplotlib's "symlog" axis scale otherwise some of the values are not displayed properly and some values get hidden.

我正在绘制的数据的X值是URL,Y值是URL的相应​​频率.

The X values of the data I am plotting are URLs and the Y values are the corresponding frequencies of the URLs.

我的数据如下:

'http://www.bing.com/search?q=d2l&src=IE-TopResult&FORM=IETR02&conversationid=  123 0.00052210688591'
`http://library.uc.ca/  118 4.57782298326e-05`
`http://www.bing.com/search?q=d2l+uofc&src=IE-TopResult&FORM=IETR02&conversationid= 114 4.30271029472e-06`
`http://www.nature.com/scitable/topicpage/genetics-and-statistical-analysis-34592   109 1.9483268261e-06`

数据在第一列中包含URL,在第二列中包含相应的频率(存在相同URL的次数),最后在第三列中包含传输的字节.首先,我仅使用第一列和第二列进行此分析.共有2465个x值或唯一的URL.

The data contains the URL in the first column, corresponding frequency (number of times the same URL is present) in the second and finally the bytes transferred in the 3rd. Firstly, I am using only the 1st and 2nd columns for this analysis. There are a total of 2,465 x values or unique URLs.

以下是我的代码

import os
import matplotlib.pyplot as plt
import numpy as np
import math
from numpy import *
import scipy
from scipy.interpolate import *
from scipy.stats import linregress
from scipy.optimize import curve_fit

file = open(filename1, 'r')
lines = file.readlines()

result = {}
x=[]
y=[]
for line in lines:
  course,count,size = line.lstrip().rstrip('\n').split('\t')
  if course not in result:
      result[course] = int(count)
  else:
      result[course] += int(count)
file.close()

frequency = sorted(result.items(), key = lambda i: i[1], reverse= True)
x=[]
y=[]
i=0
for element in frequency:
  x.append(element[0])
  y.append(element[1])


z=[]
fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(len(x))
print z
logA = [x*np.log(x) if x>=1 else 1 for x in z]
logB = np.log(y)
plt.plot(z, y, color = 'r')
plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z))
ax.set_yscale('symlog')
ax.set_xscale('symlog')
slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()

您会看到很多导入的库,因为我一直在使用它们,但是我的实验都没有取得预期的结果.因此,上面的代码正确生成了等级图.可以看到,曲线中的红线是红线,而曲线中的蓝线被认为是最佳拟合线,这在视觉上是不正确的.这是生成的图.

You would see a lot of libraries imported as I have been playing with a lot of them but none of my experiments are yielding the expected result. So the code above generates the rank plot correctly. Which is the red line but the blue line in the curve which is supposed to be the best fit line is visually incorrect, as can be seen. This is the graph generated.

这是我期望的图表.我以某种方式错误地绘制了第二张图中的虚线.

This is the graph I am expecting. The dotted lines in the 2nd Graph is what I am somehow plotting incorrectly.

关于如何解决此问题的任何想法吗?

Any ideas as to how I could solve this issue?

推荐答案

在对数对数刻度上沿直线落下的数据遵循以下形式的幂关系:y = c * x ^(m).通过取两边的对数,可以得到适合的线性方程:

Data that falls along a straight line on a log-log scale follows a power relationship of the form y = c*x^(m). By taking the logarithm of both sides, you get the linear equation that you are fitting:

log(y) = m*log(x) + c

调用 np.polyfit(log(x),log(y),1)提供 m c 的值.然后,您可以使用这些值来计算 log_y_fit 的拟合值,如下所示:

Calling np.polyfit(log(x), log(y), 1) provides the values of m and c. You can then use these values to calculate the fitted values of log_y_fit as:

log_y_fit = m*log(x) + c

,要根据原始数据绘制的拟合值是:

and the fitted values that you want to plot against your original data are:

y_fit = exp(log_y_fit) = exp(m*log(x) + c)

因此,您遇到的两个问题是:

So, the two problems you are having are that:

  1. 您正在使用原始x坐标而不是log(x)坐标计算拟合值

  1. you are calculating the fitted values using the original x coordinates, not the log(x) coordinates

您正在绘制拟合的y值的对数,而没有将它们转换回原始比例

you are plotting the logarithm of the fitted y values without transforming them back to the original scale

在下面的代码中,我通过将 plt.plot(z,np.poly1d(np.polyfit(logA,logB,1))(z))替换为这两个代码:

I've addressed both of these in the code below by replacing plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z)) with:

m, c = np.polyfit(logA, logB, 1) # fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA + c) # calculate the fitted values of y 
plt.plot(z, y_fit, ':')

这可以放在一行上,如下所示: plt.plot(z,np.exp(np.poly1d(np.polyfit(logA,logB,1))(logA))),但是我发现这使得调试变得更加困难.

This could be placed on one line as: plt.plot(z, np.exp(np.poly1d(np.polyfit(logA, logB, 1))(logA))), but I find that makes it harder to debug.

以下代码中的其他一些不同之处:

A few other things that are different in the code below:

  • z 计算 logA 以过滤出所有值时,您正在使用列表推导.1,但 z 是线性范围,只有第一个值<1.从1开始创建 z 似乎更容易,这就是我编写代码的方式.

  • You are using a list comprehension when you calculate logA from z to filter out any values < 1, but z is a linear range and only the first value is < 1. It seems easier to just create z starting at 1 and this is how I've coded it.

我不确定您为什么对 logA 的列表理解中有术语 x * log(x).对我来说,这似乎是一个错误,因此我没有将其包含在答案中.

I'm not sure why you have the term x*log(x) in your list comprehension for logA. This looked like an error to me, so I didn't include it in the answer.

此代码应为您正确运行:

This code should work correctly for you:

fig=plt.figure()
ax = fig.add_subplot(111)

z=np.arange(1, len(x)+1) #start at 1, to avoid error from log(0)

logA = np.log(z) #no need for list comprehension since all z values >= 1
logB = np.log(y)

m, c = np.polyfit(logA, logB, 1) # fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA + c) # calculate the fitted values of y 

plt.plot(z, y, color = 'r')
plt.plot(z, y_fit, ':')

ax.set_yscale('symlog')
ax.set_xscale('symlog')
#slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()

在模拟数据上运行它时,会得到以下图形:

When I run it on simulated data, I get the following graph:

注释:

  • The 'kinks' on the left and right ends of the line are the result of using "symlog" which linearizes very small values as described in the answers to What is the difference between 'log' and 'symlog'? . If this data was plotted on "log-log" axes, the fitted data would be a straight line.

您可能还想阅读以下答案: https://stackoverflow.com/a/3433503/7517724,其中说明了如何使用加权来实现对日志转换后的数据的更好"拟合.

You might also want to read this answer: https://stackoverflow.com/a/3433503/7517724, which explains how to use weighting to achieve a "better" fit for log-transformed data.

这篇关于python 2.7中的对数刻度上的最佳拟合线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆