使用特定的单词列表来改变文字转换为数字的最佳方法 [英] Best way to change words into numbers using specific word list

查看:171
本文介绍了使用特定的单词列表来改变文字转换为数字的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含每行鸣叫的文本文件,这需要改变的机器学习的格式。即时通讯使用python和基本的Unix文本操作(正则表达式)来实现我的很多字符串操作,和IM刚开桑达,grep和蟒蛇.RE功能的窍门....这下问题,但是对我来说mindblower,并想知道如果任何人都可以帮助我。我已经尝试了一些谷歌搜索,但TBH没有运气:(

我总是伪code开始,使其更容易对我来说,这就是我想要的...
替换-token1-或-token2-或-token3-或整数-token4-'1',用整数替换所有的其他字/令牌'0'

比方说我的话/令牌这就需要成为1的名单如下:


  • :)


  • 开心

  • 乐趣

和我的微博是这样的:


  • 这一直是一个有趣的一天:)

  • 我发现蟒蛇爽!这让我开心

新的程序/函数的输出是:


  • 0 0 0 0 1 0 1

  • 0 0 0 1 0 0 0 1

注1:请注意如何酷有!它的背后,应该包括在内,但我总能先删除该文件中所有标点符号,以便更容易

注2:所有的鸣叫将是小写的,我已经有了改变所有行成小写的函数

有谁知道如何做到这一点使用正则表达式的UNIX(如战略经济对话的grep,awk的),甚至如何做到这一点在Python?顺便说一句,这是不是功课,即时通讯工作的一个情感分析程序,并尝试我一点。

感谢名单! :)


解决方案

 从字符串输入标点符号PNC
令牌= {':)','酷','幸福','快乐'}
鸣叫= ['这一直是一个有趣的一天:),我发现蟒蛇爽!这让我开心']
在鸣叫鸣叫:
    S = [(字记号或记号word.strip(PNC))的tweet.split字()]
    打印(''。加入('1',如果T否则'0'T在S))

输出:

  0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

在4号线有处理。),由@EOL建议

有仍然不会被正确处理的情况下,如爽:),我喜欢它。问题是固有的要求。

I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve a lot of my string manipulation, and im gettin the hang of sed, grep and pythons .re function....this next problem however is mindblower for me, and wondering if anyone could help me with this. I have tried a few google searches, but tbh no luck :(

I always start with pseudocode to make it easier on me, and this is what i want... "Replace -token1- OR -token2- OR -token3- OR -token4- with integer '1', replace all other words/tokens with integer '0' "

Lets say my list of words/tokens for which need to become '1' is the following:

  • :)
  • cool
  • happy
  • fun

and my tweets look like this:

  • this has been a fun day :)
  • i find python cool! it makes me happy

The output of the new program/function would be:

  • 0 0 0 0 1 0 1
  • 0 0 0 1 0 0 0 1

NOTE1: Notice how 'cool' has a '!' behind it, it should be included as well, although i can always remove all punctuation in the file first, to make it easier

NOTE2: All tweets will be lowercase, I already have a function that changes all the lines into lowercase

Does anyone know how to do this using unix regex (such as sed, grep, awk) or even how to do it in python? BTW this is NOT homework, im working on a sentiment analysis program and am experimenting a bit.

thanx! :)

解决方案

from string import punctuation as pnc
tokens = {':)', 'cool', 'happy', 'fun'}
tweets = ['this has been a fun day :)', 'i find python cool! it makes me happy']
for tweet in tweets:
    s = [(word in tokens or word.strip(pnc) in tokens) for word in tweet.split()]
    print(' '.join('1' if t else '0' for t in s))

Output:

0 0 0 0 1 0 1
0 0 0 1 0 0 0 1

The or in the 4th line is there to handle :), as suggested by @EOL.

There are still cases that will not be handled correctly, such as with cool :), I like it. The problem is inherent to the requirements.

这篇关于使用特定的单词列表来改变文字转换为数字的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆