将 Pandas 数据帧列传递给 NLTK 标记器 [英] Passing a pandas dataframe column to an NLTK tokenizer

查看：91 发布时间：2021/6/7 20:43:03 python string pandas nltk tokenize

本文介绍了将 Pandas 数据帧列传递给 NLTK 标记器的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 2 列、ID 和句子的 Pandas 数据框 raw_df.我需要将每个句子转换为字符串.下面的代码不会产生任何错误，并表示规则的数据类型是对象".

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object."

raw_df['sentences'] = raw_df.sentences.astype(str)
raw.df.sentences.dtypes

输出:dtype('O')

Out: dtype('O')

然后，我尝试标记句子并得到一个 TypeError，表明该方法需要一个字符串或类似字节的对象.我做错了什么?

Then, I try to tokenize sentences and get a TypeError that the method is expecting a string or bytes-like object. What am I doing wrong?

raw_sentences=tokenizer.tokenize(raw_df)

相同类型错误

raw_sentences = nltk.word_tokenize(raw_df)

推荐答案

我假设这是一个 NLTK 分词器.我相信这些通过将句子作为输入并返回标记词作为输出来工作.

I'm assuming this is an NLTK tokenizer. I believe these work by taking sentences as input and returning tokenised words as output.

您传递的是 raw_df - pd.DataFrame 对象，不是 str.您不能指望它在不告诉自己的情况下按行应用该函数.有一个名为 apply 的函数.

What you're passing is raw_df - a pd.DataFrame object, not a str. You cannot expect it to apply the function row-wise, without telling it to, yourself. There's a function called apply for that.

raw_df['tokenized_sentences'] = raw_df['sentences'].apply(tokenizer.tokenize)

假设这一切顺利，tokenized_sentences 将是一列列表.

Assuming this works without any hitches, tokenized_sentences will be a column of lists.

由于您在 DataFrames 上执行文本处理，我建议您在此处查看我的另一个答案:在pandas数据帧上应用基于NLTK的文本预处理

Since you're performing text processing on DataFrames, I'd recommend taking a look at another answer of mine here: Applying NLTK-based text pre-proccessing on a pandas dataframe

这篇关于将 Pandas 数据帧列传递给 NLTK 标记器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将 Pandas 数据帧列传递给 NLTK 标记器 [英] Passing a pandas dataframe column to an NLTK tokenizer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将 Pandas 数据帧列传递给 NLTK 标记器 [英] Passing a pandas dataframe column to an NLTK tokenizer

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭