使用FITTED-LINE matplotlib构建Zipf分布 [英] Constructing Zipf Distribution with matplotlib, FITTED-LINE

查看:30
本文介绍了使用FITTED-LINE matplotlib构建Zipf分布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个段落列表,我想在它们的组合上运行一个zipf发行版.

我的代码如下:

从itertools导入

  *从 pylab 导入 *从集合导入计数器导入matplotlib.pyplot作为plt段落 = " ".join(targeted_pa​​ragraphs)对于以下段落:频率=计数器(paragraph.split())计数 = 数组(频率.值())令牌 = 频率.keys()等级= arange(1,len(counts + 1)索引= argsort(-counts)频率=计数[指数]loglog(等级,频率,marker =.")title("组合文章段落的Zipf图")xlabel("Token 的频次排名")ylabel(令牌的绝对频率")格(真)对于列表中的n(logspace(-0.5,log10(len(counts)-1),20).astype(int)):dummy = text(ranks[n],frequency[n],""+tokens[indices[n]],垂直对齐=底部",horizo​​ntalalignment ="left")

目的 我尝试在此图中绘制拟合线",并将其值分配给一个变量.但是我不知道如何添加.对于这两个问题,任何帮助将不胜感激.

解决方案

我知道距提出这个问题已经有一段时间了.但是,我在

I have a list of paragraphs, where I want to run a zipf distribution on their combination.

My code is below:

from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt


paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
   frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()

ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
    verticalalignment="bottom",
    horizontalalignment="left")

PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.

解决方案

I know it's been a while since this question was asked. However, I came across a possible solution for this problem at scipy site.
I thought I would post here in case anyone else required.

I didn't have paragraph info, so here is a whipped up dict called frequency that has paragraph occurrence as its values.

We then get its values and convert to numpy array. Define zipf distribution parameter which has to be >1.

Finally display the histogram of the samples,along with the probability density function

Working Code:

import random
import matplotlib.pyplot as plt
from scipy import special
import numpy as np

#Generate sample dict with random value to simulate paragraph data
frequency = {}
for i,j in enumerate(range(50)):
    frequency[i]=random.randint(1,50)

counts = frequency.values()
tokens = frequency.keys()


#Convert counts of values to numpy array
s = np.array(counts)

#define zipf distribution parameter. Has to be >1
a = 2. 

# Display the histogram of the samples,
#along with the probability density function
count, bins, ignored = plt.hist(s, 50, normed=True)
plt.title("Zipf plot for Combined Article Paragraphs")
x = np.arange(1., 50.)
plt.xlabel("Frequency Rank of Token")
y = x**(-a) / special.zetac(a)
plt.ylabel("Absolute Frequency of Token")
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()

Plot

这篇关于使用FITTED-LINE matplotlib构建Zipf分布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆