上下文中的python pandas数据框单词:前后获取3个单词 [英] python pandas dataframe words in context: get 3 words before and after

查看:32
本文介绍了上下文中的python pandas数据框单词:前后获取3个单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 jupyter notebook 中工作并且有一个 Pandas 数据框data":

I am working in jupyter notebook and have a pandas dataframe "data":

Question_ID | Customer_ID | Answer
      1           234         Data is very important to use because ... 
      2           234         We value data since we need it ... 

我想通过答案"列中的文本并获取数据"一词前后的三个词.所以在这种情况下,我会得到非常重要";我们重视"、因为我们需要".

I want to go through the text in column "Answer" and get the three words before and after the word "data". So in this scenario I would have gotten "is very important"; "We value", "since we need".

在 Pandas 数据框中有什么好方法可以做到这一点吗?到目前为止,我只找到了解决方案,其中答案"将是通过 python 代码运行的自己的文件(没有熊猫数据框).虽然我意识到我需要使用 NLTK 库,但我之前没有使用过它,所以我不知道最好的方法是什么.(这是一个很好的例子 在 Python 中将单词及其前 10 个单词的上下文提取到数据框)

Is there an good way to do this within a pandas dataframe? So far I only found solutions where "Answer" would be its own file run through python code (without a pandas dataframe). While I realize that I need to use the NLTK library, I haven't used it before, so I don't know what the best approach would be. (This was a great example Extracting a word and its prior 10 word context to a dataframe in Python)

推荐答案

这可能有效:

import pandas as pd
import re

df = pd.read_csv('data.csv')

for value in df.Answer.values:
    non_data = re.split('Data|data', value) # split text removing "data"
    terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
    substrs = [term.split()[0:3] for term in terms_list]  # slice and grab first three terms
    result = [' '.join(term) for term in substrs] # combine the terms back into substrings
    print result

输出:

['is very important']
['We value', 'since we need']

这篇关于上下文中的python pandas数据框单词:前后获取3个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆