如何在目录中所有csvs文件中执行python关键字搜索和单词计数器,并将其写入单个csv? [英] How can I do a python keyword search and word counter within all csvs files in directory and write to a single csv?

查看:70
本文介绍了如何在目录中所有csvs文件中执行python关键字搜索和单词计数器,并将其写入单个csv?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,正试图了解某些库.不知道如何将csv上传到SO,但是此脚本可与任何csv一起使用,只需替换'SwitchedProviders_TopicModel'

Im new to python and trying to understand certain libraries. Not sure how to upload a csv to SO but this script works with any csv, just replace 'SwitchedProviders_TopicModel'

我的目标是遍历文件目录中的所有csv-C:\ Users \ jj \ Desktop \ autotranscribe,并将我的python脚本输出按文件写入csv.

My objective is to loop through all csv's in a file directory - C:\Users\jj\Desktop\autotranscribe and to write my python script outputs by file to a csv.

例如,让我们说我在上面的文件夹中有这些csv文件-

So let us say for example I have these csv files in the above folder-

'1003391793_1003391784_01bc7e411408166f7c5468f0.csv''1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv''1003478130_1003478103_8eef05b0820cf0ffe9a9882d.csv'

'1003391793_1003391784_01bc7e411408166f7c5468f0.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9882d.csv'

我希望我的python应用程序(下面)为文件夹/目录中的每个csv做一个单词计数器,并将输出写入这样的数据帧-

I want my python app(below) to do a word counter for each csv in the folder/directory and write the output to a dataframe like this -

csvname                                            pre existing  exclusions  limitations  fourteen
1003391793_1003391784_01bc7e411408166f7c5468f0.csv    1           2           0            1

我的脚本-

import pandas as pd
from collections import defaultdict

def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    line_number = 0
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open("SwitchedProviders_TopicModel.csv", 'r') as read_obj:
        # Read all lines in the file one by one
        for line in read_obj:
            line_number += 1
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append((string_to_search, line_number, line.rstrip()))
 
    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)


matched_lines, count = search_multiple_strings_in_file('SwitchedProviders_TopicModel.csv', [ 'pre existing ', 'exclusions','limitations','fourteen'])
    
df = pd.DataFrame.from_dict(count, orient='index').reset_index()
df.columns = ['Word', 'Count']

print(df)

我将如何做到这一点?您只能在我的脚本中看到像十四"这样的单词,而不是针对所有单词寻找计数器

How would I be able to do this? Only looking for a counter specific words as you can see in my script like'fourteen', not looking for a counter for all words

其中一个cvs的示例数据-信用用户Umar H

Sample data of one of the csvs - credit user Umar H

df = pd.read_csv('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv')
print(df.head(10).to_dict())
{'transcript': {0: 'hi thanks for calling ACCA  this is many speaking could have the pleasure speaking with ', 1: 'so ', 2: 'hi ', 3: 'I have the pleasure speaking with my name is B. as in boy E. V. D. N. ', 4: 'thanks yes and I think I have your account pulled up could you please verify your email ', 5: "sure is yeah it's on _ 00 ", 6: 'I T. O.com ', 7: 'thank you how can I help ', 8: 'all right I mean I do have an insurance with you guys I just want to cancel the insurance ', 9: 'sure I can help with that what was the reason for cancellation '}, 'confidence': {0: 0.73, 1: 0.18, 2: 0.88, 3: 0.72, 4: 0.83, 5: 0.76, 6: 0.83, 7: 0.98, 8: 0.89, 9: 0.95}, 'from': {0: 1.69, 1: 1.83, 2: 2.06, 3: 2.13, 4: 2.36, 5: 2.98, 6: 3.17, 7: 3.65, 8: 3.78, 9: 3.93}, 'to': {0: 1.83, 1: 2.06, 2: 2.13, 3: 2.36, 4: 2.98, 5: 3.17, 6: 3.65, 7: 3.78, 8: 3.93, 9: 4.14}, 'speaker': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}, 'Negative': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.116, 9: 0.0}, 'Neutral': {0: 0.694, 1: 1.0, 2: 1.0, 3: 0.802, 4: 0.603, 5: 0.471, 6: 1.0, 7: 0.366, 8: 0.809, 9: 0.643}, 'Positive': {0: 0.306, 1: 0.0, 2: 0.0, 3: 0.198, 4: 0.397, 5: 0.529, 6: 0.0, 7: 0.634, 8: 0.075, 9: 0.357}, 'compound': {0: 0.765, 1: 0.0, 2: 0.0, 3: 0.5719, 4: 0.7845, 5: 0.5423, 6: 0.0, 7: 0.6369, 8: -0.1779, 9: 0.6124}}

推荐答案

步骤-

  1. 定义输入路径
  2. 提取所有CSV文件
  3. 计数
  4. 创建1个结果dict,然后添加文件名和Counter dict.
  5. 最后,将所得的dict转换为dataframe并进行Transpose.(如果需要,请用0填充NAN值)


import string
from collections import Counter, defaultdict
from pathlib import Path

import pandas as pd

inp_dir = Path(r'C:/Users/jj/Desktop/Bulk_Wav_Completed')  # current dir


def search_multiple_strings_in_file(file_name, list_of_strings):
    """Get line from the file along with line numbers, which contains any string from the list"""
    list_of_results = []
    count = defaultdict(lambda: 0)
    # Open the file in read only mode
    with open(file_name, 'r') as read_obj:
        # Read all lines in the file one by one
        for line_number, line in enumerate(read_obj, start=1):
            # For each line, check if line contains any string from the list of strings
            for string_to_search in list_of_strings:
                if string_to_search in line:
                    count[string_to_search] += line.count(string_to_search)
                    # If any string is found in line, then append that line along with line number in list
                    list_of_results.append(
                        (string_to_search, line_number, line.rstrip()))

    # Return list of tuples containing matched string, line numbers and lines where string is found
    return list_of_results, dict(count)


result = {}
for csv_file in inp_dir.glob('**/*.csv'):
    print(csv_file) # for debugging
    matched_lines, count = search_multiple_strings_in_file(csv_file, ['nation', 'nation wide', 'trupanion', 'pet plan', 'best', 'embrace', 'healthy paws', 'pet first', 'pet partners', 'lemon',
                                                                    'AKC', 'akc', 'kennel club', 'club', 'american kennel', 'american', 'lemonade'
                                                                    'kennel', 'figo', 'companion protect', 'true companion',
                                                                    'true panion', 'trusted pals', 'partners' 'lemonade', 'partner',
                                                                    'wagmo', 'vagmo', 'bivvy', 'bivy', 'bee' '4paws', 'paws', 'pet best',
                                                                    'pets best', 'pet best'])
    print(count)  # for debugging
    result[csv_file.name] = count
df = pd.DataFrame(result).T.fillna(0).astype(int)

输出-

       exclusions  limitations  pre existing
1.csv           1            3             1
2.csv           1            3             1

这篇关于如何在目录中所有csvs文件中执行python关键字搜索和单词计数器,并将其写入单个csv?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆