用Python创建高级频率表 [英] Create advanced frequency table with Python

查看：289 发布时间：2017/3/26 2:30:38 python pandas dataframe word-frequency

本文介绍了用Python创建高级频率表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试根据具有 pandas 和Python的数据框制作频率表。事实上，这与我以前使用R的问题完全一样， / a>。

I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R.

假设我在大熊猫中有一个看起来像这样的数据框（实际上数据框大得多，但为了说明的目的，我限制了行）：

Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):

node    |   precedingWord
-------------------------
A-bom       de
A-bom       die
A-bom       de
A-bom       een
A-bom       n
A-bom       de
acroniem    het
acroniem    t
acroniem    het
acroniem    n
acroniem    een
act         de
act         het
act         die
act         dat
act         t
act         n

我想使用这些值来计算每个节点的前一个字符，但是使用子类别。例如：添加一个值的列为 neuter ，另一个非中性和最后一个 rest 。 neuter 将包含onceWord是以下值之一的所有值： t ， het ， dat 。 非中性将包含 de 和 die， code> rest 将包含不属于 neuter 或非中性的所有内容。（如果这样可以是动态的，换句话说， rest 使用一些用于中性和非中性的反转变量，这将是很好的，或者简单地减去从中间和非中性的值与该节点的行长度）。

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

示例输出（在新的数据框中，我们假设 freqDf ，将如下所示：

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest
-----------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

我发现回答类似的问题，但用例并不完全相同，在我看来，所有变量都是独立的在我的情况下，显然我有多个具有相同节点的行，这些行应该被归结为单个频率 - 如上面预期的输出所示。

I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.

我以为这样（未经测试）：

I thought something like this (untested):

def specificFreq(d):  
    for uniqueWord in d['node']
        return pd.Series({'node': uniqueWord ,
            'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),
            'nonNeuter':  sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),
            'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above

df.groupby('node').apply(specificFreq)

但是我非常怀疑这样做的正确方法

But I highly doubt this the correct way of doing something like this.

推荐答案

根据R解决方案中的建议，您可以先更改名称，然后执行交叉表：

As proposed in the R solution, you can first change the name and then perform the cross tabulation:

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
# neuter + non_neuter is the concatenation of both lists.

pd.crosstab(df.node, df.gender)
gender    neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

这一个更好，因为如果一个字在 neuter 或 non_neuter 不在前缀中，不会提高一个 KeyError 像前面的解决方案一样。

This one is better because if a word in neuter or non_neuter is not present in precedingword, it won't raise a KeyError like in the former solution.

不太干净。

鉴于您的数据框，您可以简单的交叉列表：

Given your dataframe, you can make a simple cross tabulation:

ct = pd.crosstab(df.node, df.precedingWord)

其中给出：

pW        dat  de  die  een  het  n  t
node                                  
A-bom       0   3    1    1    0  1  0
acroniem    0   0    0    1    2  1  1
act         1   1    1    0    1  1  1

然后，您只想将某些列合在一起：

Then, you just want to sum certain columns together:

neuter = ["t", "het", "dat"]
non_neuter = ["de","die"]
freqDf = pd.DataFrame()

freqDf["neuter"] = ct[neuter].sum(axis=1)
ct.drop(neuter, axis=1, inplace=1)

freqDf["non_neuter"] = ct[non_neuter].sum(axis=1)
ct.drop(non_neuter, axis=1, inplace=1)

freqDf["rest"] = ct.sum(axis=1)

哪个给你 freqDf ：

          neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

HTH

这篇关于用Python创建高级频率表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用Python创建高级频率表 [英] Create advanced frequency table with Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用Python创建高级频率表 [英] Create advanced frequency table with Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭