根据它们在另一个数据帧中的频率将值附加到一个数据帧 [英] Append values to one dataframe based on their frequency in another dataframe

查看:46
本文介绍了根据它们在另一个数据帧中的频率将值附加到一个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,df1 是 groupby 的产物,或者 df.groupby('keyword'):

I have two dataframes, df1 is the product of a groupby, or df.groupby('keyword'):

df1

keyword     string

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

和 df2

这是一个空数据框,现在我也有一个特定值的列表:

which is an empty dataframe, now I also I have a list of specific values:

keyword_list = ['string', 'test']

基本上我想计算 keyword_listdf1 中每个单词的频率,并且出现最多的单词将该单词附加到特定列中新数据框基于 df1 中的关键字,因此 df2 的 'A' 被分配到 df1 的 string 列中出现的最高值.

Basically I would like to count the frequency of each word that is in keyword_list and within df1 and the word that appears the most append that word to a a specific column in the new dataframe based on the keyword in df1, so df2's 'A' gets assigned the highest occurring value in df1's string column.

理想情况下,由于 'string' 是 df1 的 A 关键字列中出现次数最多的值,所以它被分配了 string 等等.

So ideally, since 'string' is the highest occuring value in df1's A keyword column it gets assigned string and so on.

df2

keyword    High_freq_word

   A         "string"
   B         "test"

让我知道您是否需要澄清或有道理!

Let me know if you need some clarification or it makes sense!

更新:

@anky_91 提供了一些很棒的代码,但输出有点尴尬

@anky_91 provided some awesome code however the output is a little awkward

df['matches'] = df.description.str.findall('|'.join(keyword_list))

    df.groupby(odf.Type.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

给你

df1

keyword     string                                                     

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

但是它添加了一个新列:

However it adds a new column:

matches

['string','test']
['test', 'string','string]
[etc...]

我可以想出一种方法以数字方式转换它,然后将该值分配给该列,但更大的问题是将此新列附加到新数据框.

I can figure out a way to convert it numerically and then assign that value to the column, but the bigger issue is appending this new column to the the new dataframe.

由于它是一个 groupby 有几个重复的值,我试图找到一种 pythonic 方式将最常用词"映射到关键字本身而不是基于关键字列表的整个模式.

Since it is a groupby there are several duplicate values, I'm trying to find a pythonic way of mapping the "most frequent word" to just the keyword itself instead of the entire mode based on the list of key words.

推荐答案

据我所知,你可以这样做:

From what I understand, you can do something like:

from itertools import chain
from scipy.stats import mode

<小时>

keyword_list = ['string', 'test']
df['matches']=df.string.str.findall('|'.join(keyword_list)) #find all matches
df.groupby(df.keyword.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

<小时>

keyword
A    string
B      test
Name: matches, dtype: object

这篇关于根据它们在另一个数据帧中的频率将值附加到一个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆