使用字典脚本导入txt文件并将其应用于数据框以替换单词 [英] importing txt file with dictionary script and applying it to dataframe to replace words
问题描述
我正在尝试使用txt文件替换数据框中某列中的某些字符串。
我有一个类似于以下内容的数据框(这是一个非常
coffee_directions_df
话机频率
前往星巴克的路线1045
前往塔利的路线1034
给我前往塔利的路线986
前往西雅图最佳的路线875
告诉我前往Dunkin 812的路线
路线前往Daily Dozen 789
向我显示前往星巴克的路线754
给我前往Dunkin的路线612
导航至Seattles Best 498
显示导航至Starbucks 376
指引我前往星巴克201
DF会显示人们的讲话和讲话的频率。
即,指向星巴克的路线发出了1045次。
我有另一个xlsx格式的DataFrame coffee_donut.xlsx
,我想使用它导入和替换某些字符串(类似于通过检查熊猫数据框来替换单词。)
coffee_donut
代名词
星巴克咖啡
塔利斯咖啡
西雅图最佳咖啡
邓肯甜甜圈
每日十二个甜甜圈
最后,我希望数据框看起来像这样:
coffee_donut_df
出勤频率
前往咖啡的路线1045
前往咖啡的路线1034
给我前往咖啡的路线986
前往咖啡的路线875
向我显示前往甜甜圈的路线812
前往甜甜圈的路线789
。
。
。
我按照上一个问题的步骤进行了操作,但是我被困在最后一部分:
import re
以pd
sdf = pd.read_excel('C:\coffee_donut.xlsx')$导入熊猫b $ b rep = dict(zip(sdf.Token,sdf.synonyms))#转换为字典
rep = dict((re.escape(k),v)for rep .iteritems())
pattern = re.compile( | .join(rep.keys()))
rep = pattern.sub(lambda m:rep [re.escape(m.group (0))],** coffee_directions_df **)
打印代表
如何将代表应用于数据框?如果这是一个菜鸟问题,我感到很抱歉。非常感谢您的帮助。
谢谢!
您差不多了!这是一个在当前代码中重用regex对象和lambda函数的解决方案。
而不是最后一行( rep = pattern.sub(。 ..
),运行以下命令:
coffee_directions_df ['Utterance'] = \
coffee_directions_df ['Utterance']。str.replace(pattern,lambda m:rep [m.group(0)])
#确认替换
coffee_directions_df
发言频率
0前往咖啡的路线1045
1前往咖啡的路线1034
2给我前往咖啡的路线986
3前往Seattles Best 875的路线
...
之所以有用,是因为 pd.Series.str.replace
可以接受编译的正则表达式对象和函数; 有关详细信息,请参阅文档。 / p>
I am trying to replace certain strings within a column in a dataframe using a txt file.
I have a dataframe that looks like the following (this is a very small version of a massive dataframe that i have).
coffee_directions_df
Utterance Frequency
Directions to Starbucks 1045
Directions to Tullys 1034
Give me directions to Tullys 986
Directions to Seattles Best 875
Show me directions to Dunkin 812
Directions to Daily Dozen 789
Show me directions to Starbucks 754
Give me directions to Dunkin 612
Navigate me to Seattles Best 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
The DF shows utterances made by people and the frequency of utterances.
I.e., "Directions to Starbucks" was uttered 1045 times.
I have another DataFrame in xlsx format coffee_donut.xlsx
that I want to use to import and replace certain strings (similar to what Replace words by checking from pandas dataframe asked).
coffee_donut
Token Synonyms
Starbucks Coffee
Tullys Coffee
Seattles Best Coffee
Dunkin Donut
Daily Dozen Donut
And ultimately, I want the dataframe to look like this:
coffee_donut_df
Utterance Frequency
Directions to Coffee 1045
Directions to Coffee 1034
Give me directions to Coffee 986
Directions to Coffee 875
Show me directions to Donut 812
Directions to Donut 789
.
.
.
I followed the previous question's steps, but i got stuck at the last part:
import re
import pandas as pd
sdf = pd.read_excel('C:\coffee_donut.xlsx')
rep = dict(zip(sdf.Token, sdf.Synonyms)) #convert into dictionary
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
pattern = re.compile("|".join(rep.keys()))
rep = pattern.sub(lambda m: rep[re.escape(m.group(0))], **coffee_directions_df**)
print rep
How do I apply the rep to the dataframe?? I'm so sorry if this is such a noob question. I really appreciate your help.
Thanks!!
You almost had it! Here's a solution that reuses the regex object and lambda function in your current code.
Instead of your last line (rep = pattern.sub(...
), run this:
coffee_directions_df['Utterance'] = \
coffee_directions_df['Utterance'].str.replace(pattern, lambda m: rep[m.group(0)])
# Confirm replacement
coffee_directions_df
Utterance Frequency
0 Directions to Coffee 1045
1 Directions to Coffee 1034
2 Give me directions to Coffee 986
3 Directions to Seattles Best 875
...
This works because pd.Series.str.replace
can accept a compiled regex object and a function; see the docs for more.
这篇关于使用字典脚本导入txt文件并将其应用于数据框以替换单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!