如何正确地循环两个文件,比较两个文件中的字符串 [英] how to properly loop through two files comparing strings in both files against each other

查看:174
本文介绍了如何正确地循环两个文件,比较两个文件中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于分配给他们的意见(正面或负面)的单词列表(文件2,制表符分隔,两列),我无法对tweets(文件1,标准twitter json响应)进行情感分析。

问题是:顶部循环只运行一次,然后脚本结束,而我循环通过文件1,然后嵌套,我循环通过文件2,并尝试

所以我必须:

<$ p $

$ b sent_score = 0
for tweet_file中的行:

document = def get_sentiments(tweet_file,sentiment_file) json.loads(line)
tweets = document.get('text')

如果tweets!= None:
tweet = str(tweets.encode('utf-8 '))

#print tweet


for sentiment_file中的z:
line = z.split('\t')
word = line [0] .strip()
score = int(line [1] .rstrip('\\\
')。strip())

#print得分



如果在推特上写字:
print++++++++++++++++++++++ +++++++++++++++++
打印字,tweet
sent_score + =得分



打印====,sent_score,=====

#PROBLEM,它只做第一个这样做

file1 = open(tweetsfile。 txt)
file2 = open(sentimentfile.txt)

$ b get_sentiments(file1,file2)

我花了好半天的时间试图弄清楚为什么它没有for file2的嵌套for循环输出所有的tweets,但是它只处理第一条推文然后退出。

解决方案

原因是它只做一次就是for循环已经到达文件的末尾,所以它停止了,因为没有更多的行要阅读。

换句话说,第一次循环运行,它遍历整个文件,然后因为没有更多的行读取(自从它到达文件的末尾),它不会再循环,导致只有一行正在处理。

所以一种方法来解决这是倒带文件,你可以用 seek 方法的文件对象。



如果你的文件不大,另一种方法是读他们都进入了一个列表或类似的结构,然后循环通过它。

然而,因为你的情绪分数是一个简单的查找,最好的方法是建立一个字典情感分数,然后查看词典中的每个单词来计算推特的整体情绪:

  import csv 
import json

scores = {}#empty dicti onary来存储每个单词

的打分('sentimentfile.txt')作为f:
reader = csv.reader(f,delimiter ='\ t')
在阅读器中的行:
scores [row [0] .strip()] = int(row [1] .strip())


with open('tweetsfile ('text','')。encode('utf')as f:
for line in f:
tweet = json.loads(line)
text = tweet.get -8')
if text:
total_sentiment = sum(scores.get(word,0)for word in text.split())
print({}:{})。格式(文本,分数))

with statement 会自动关闭文件处理程序。我正在使用 csv 模块读取文件(它也适用于制表符分隔的文件)。

这一行计算:

  total_sentiment = sum(scores.get(word,0)for word in text.split())

这是写这个循环的一个简短的方法:

  tweet_score = [] 
for word in text.split():
如果单词在分数中:
tweet_score [word] = scores [word]

total_score = sum(tweet_score)

字典的 get 方法需要第二个可选参数返回一个自定义值,当找不到密钥时;如果你省略了第二个参数,它将返回 None 。在我的循环中,我使用它返回0,如果这个词没有得分。


I am having trouble doing a sentiment analysis of tweets (file 1, standard twitter json response) against a list of words (file 2, tab delimited, two columns) with their sentiment assigned to them (either positive or negative).

The problem is: the top loop is only running once and then the script ends while I am looping through file 1 then nested within that I am looping through file 2 and trying to compare and keep a running sum of the combined sentiment for each tweet.

so i have:

def get_sentiments(tweet_file, sentiment_file):


    sent_score = 0
    for line in tweet_file:

        document = json.loads(line)
        tweets = document.get('text')

        if tweets != None:
            tweet = str(tweets.encode('utf-8'))

            #print tweet


            for z in sentiment_file:
                line = z.split('\t')
                word = line[0].strip()
                score = int(line[1].rstrip('\n').strip())

                #print score



                if word in tweet:
                    print "+++++++++++++++++++++++++++++++++++++++"
                    print word, tweet
                    sent_score += score



            print "====", sent_score, "====="

    #PROBLEM, IT'S ONLY DOING THIS FOR THE FIRST TWEET

file1 = open(tweetsfile.txt)
file2 = open(sentimentfile.txt)


get_sentiments(file1, file2)

I've spent the better half of a day trying to figure out why it prints out all the tweets without the nested for loop for file2, but with it, it only processes the first tweet then exits.

解决方案

The reason its only doing it once is that the for loop has reached the end of the file, so it stops since there are no more lines to read.

In other words, the first time your loop runs, it steps through the entire file, and then since there are no more lines to read (since its reached the end of the file), it doesn't loop again, resulting in only one line being processed.

So one way to solve this is to "rewind" the file, you can do that with the seek method of the file object.

If your files aren't big, another approach is to read them all into a list or similar structure and then loop through it.

However, since your sentiment score is a simple lookup, the best approach would be to build a dictionary with the sentiment scores, then lookup each word in the dictionary to calculate the overall sentiment of the tweet:

import csv
import json

scores = {}  # empty dictionary to store scores for each word

with open('sentimentfile.txt') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        scores[row[0].strip()] = int(row[1].strip()) 


with open('tweetsfile.txt') as f:
    for line in f:
        tweet = json.loads(line)
        text = tweet.get('text','').encode('utf-8')
        if text:
            total_sentiment = sum(scores.get(word,0) for word in text.split())
            print("{}: {}".format(text,score))

The with statement automatically closes file handlers. I am using the csv module to read the file (it works for tab delimited files as well).

This line does the calculation:

total_sentiment = sum(scores.get(word,0) for word in text.split())

It is a shorter way to write this loop:

tweet_score = []
for word in text.split():
    if word in scores:
        tweet_score[word] = scores[word]

total_score = sum(tweet_score)

The get method of dictionaries takes a second optional argument to return a custom value when the key cannot be found; if you omit this second argument, it will return None. In my loop I am using it to return 0 if the word has no score.

这篇关于如何正确地循环两个文件,比较两个文件中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆