将数据框列转换为列表列表,并转换回数据框,同时保持ID关联 [英] Convert dataframe column to list of lists and convert back to dataframe, while maintaining ID association

查看:93
本文介绍了将数据框列转换为列表列表,并转换回数据框,同时保持ID关联的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由两列组成的数据框:IDTEXT.伪装数据如下:

I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:

ID    TEXT
1     The farmer plants grain. The fisher catches tuna.
2     The sky is blue.
2     The sun is bright.
3     I own a phone. I own a book.

我正在使用nltk在TEXT列上执行清理操作,因此我需要将TEXT列转换为列表:

I am performing cleansing on the TEXT column with nltk, so I need to convert the TEXT column to a list:

corpus = df['TEXT'].tolist()

执行清理(标记化,删除特殊字符和删除停用词)后,输出为列表列表",如下所示:

After performing the cleansing (tokenization, removing special characters, and removing stopwords), the output is a "list of lists" and looks like this:

[[['farmer', 'plants', 'grain'], ['fisher', 'catches', 'tuna']],
[['sky', 'blue']],
[['sun', 'bright']],
[['I', 'own', 'phone'], ['I', 'own', 'book']]]

我知道如何将列表重新添加到pandas数据框中,但如何将列表列表重新添加到pandas数据框中 仍将ID列分配给文本?我想要的输出是:

I know how to get a list back into a pandas dataframe, but how do I get the list of lists back into a pandas dataframe with the ID column still assigned to the text? My desired output is:

ID    TEXT
1     'farmer', 'plants', 'grain'
1     'fisher', 'catches', 'tuna'
2     'sky', 'blue'
2     'sun', 'bright'
3     'I', 'own', 'phone'
3     'I', 'own', 'book'

我假设这与Python数据结构之间的转换有关,但我不确定从哪里开始.这里的特定工作产品不如数据框的概念重要–本地Python数据结构->对本地Python数据结构做一些操作->具有原始属性的数据框完好无损.

I'm assuming it is something simple related to conversion between Python data structures, but I'm not sure where to start with this. The specific work product here is less important than the concept of dataframe --> native Python data structure --> do something to native Python data structure --> dataframe with original attributes intact.

所有人都能提供的任何见解都将不胜感激!请让我知道我是否完全可以改善我的问题!

Any insight you all can provide is greatly appreciated! Please let me know if I can improve my question at all!

推荐答案

Pandas数据框提供了许多快速的全面操作,但是如果未将其塞入数据框,确实可以更轻松地获取数据-尤其是在您刚刚起步的时候.如果您要使用nltk,我当然会推荐它.为了将文本和ID保持在一起,请将您的数据框转换为元组列表.如果您的数据框确实只有两个有意义的列,则可以这样操作:

Pandas dataframes offer a lot of quick across-the-board operations, but is indeed much easier to get your hands on your data if it's not stuffed in a dataframe-- especially if you're just getting started. I certainly recommend it if you'll be working with the nltk. To keep the text and IDs together, convert your dataframe into a list of tuples. If your dataframe really has only two meaningful columns, you can do it like this:

>>> data = list(zip(df["ID"], df["TEXT"]))
>>> from pprint import pprint
>>> pprint(data)
[(265, 'The farmer plants grain. The fisher catches tuna.'),
 (456, 'The sky is blue.'),
 (434, 'The sun is bright.'),
 (921, 'I own a phone. I own a book.')]

现在,如果您想在不丢失id的情况下使用句子,请使用像这样的二变量循环. (这会创建您要的额外行):

Now if you want to work with your sentences without losing the ids, use a two-variable loop like this. (This creates the extra rows you were asking for):

sent_data = []
for id, text in data:
    for sent in nltk.sent_tokenize(text):
        sent_data.append((id, sent))

您做什么取决于您的应用程序;您可能会创建一个包含两个元素的元组的新列表.如果您只是应用转换,请使用列表推导.例如:

What you do depends on your application; you'll probably create a new list of two-element tuples. If you're just applying a transformation, use a list comprehension. For example:

>>> datawords = [ (id, nltk.word_tokenize(t)) for id, t in data ]
>>> print(datawords[3])
(921, ['I', 'own', 'a', 'phone', '.', 'I', 'own', 'a', 'book', '.'])

将元组列表转换回数据框非常简单:

Turning a list of tuples back into a dataframe is as simple as it gets:

 newdf = pd.DataFrame(datawords, columns=["INDEX", "WORDS"])

这篇关于将数据框列转换为列表列表,并转换回数据框,同时保持ID关联的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆