从python输出创建一个ARFF文件 [英] Creating an ARFF file from python output

查看：639 发布时间：2017/11/3 19:10:19 python file classification weka arff

本文介绍了从python输出创建一个ARFF文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

  gardai-plan-strikedown-on-troublemakers-at-protest-2438316.html'：{'dail'：1，'focus'：1，'actions'：1， 'trade'：2，'抗议'：1，'识别'：1，'previous'：1，'detectives'：1，'republican'：1，'group'：1，'monitor'：1，'冲突'：1，'民事'：1，'收费'：1，'违反'：1，'旅行'：1，'主要'：1，'扰乱'：1，'真实'：1，'警察'： 3，'3月'：6，'财务'：1，'画'：1，'助手'：1，'抗议者'：1，'强调'：1，'部门'：1，'交通' 爆发：1，肇事者：1，比例：1，指令：1，警告：2，指挥官：1，michael：2，利用：1， '：1，'大'：2，'继续'：1，'队'：1，'劫持'：1，'混乱'：1，'方块'：1，'领导'：1，'成交'： 2，人：3，街道：1，示威：2，观察：1，街道大学'：1，'组织'：1，'操作'：1，'特殊'：1，'显示'：1，'出勤'：1，'正常'：1，'工会'：2，'个人' ：1，'安全'：2，'起诉'：1，'伊拉'：1，'地'：1，'公开'：2，'告诉'：1，'身体'：1，'管家'：2服从'：1，'商业'：1，'聚集'：1，'聚集'：1，'garda'：5，'sinn'：1，'破碎'：1，'fachtna'：1，''管理：2，可能性：1，组：3，放：1，附属：1，强：2，安全：1，阶段 ：1，'牵涉'：1，'路线'：2，'暴力'：1，'都柏林'：3，'fein'：1，'确保'：2，'站立'：1，'行为'：2 ，应急：1，麻烦制造者：2，便利：2，路：1，成员：1，准备好：1，存在：1，沙利文放心'：1，'数字'：3，'社区'：1，'战略'：1，'可见'：2，'地址'1'，'通知'：1，'训练'：1，'eirigi'：1，'city'：4，'gpo'：1，'from'：3，'crowd'：1，'visit' ：1，'木'：1，'编辑'：1，'和平'：4，'预计'：2，'今天'：1，'委员'：4，'码头'：1，'ictu'：1 ，'advance'：1，'murphy'：2，'gardai'：6，'aware'：1，'closure'：1，'court'：1，'branch'：1，'deployed'：1， 1：'千：1，'社会主义'：1，'工作'：1，'supt'：2，'feehan'：1，'mr'：1，'简报'：1，'拜访' ：1，态度：1，爱尔兰：2，大都会：1，检举者：1，组织者：1，中：13，异议人士：1，证据：1 'tom'：1，'安排'：3，'经验'：1，'允许'：1，'寻求'：1，'集会'：1，'connell'：1，'officer'：3，潜在：1，持有：1，单位：1，地点：2，事件：1，端庄：1，计划：1， ndent'：1，'added'：2，'计划'：1，'国会'：1，'中心'：3，'全面的'：1，'措施'：1，'昨天'：2， ：1，重要：1，移动：1，计划：2，高度：1，法律：2，高级：2，公平：1，最近：1 拒绝：1，企图：1，布雷迪：1，联络：1，自觉：1，光明：1，明确：1，总部：1，翅膀'：1，'首领'：2，'保持'：1，'哈考特'：1，'命令'：2，'左'：1}}

我有一个python脚本，可以从文本文件中提取单词并计算它们在文件中出现的次数。

我想将它们添加到.ARFF文件以用于weka分类。
以上是我的python脚本的输出示例。
如何将其插入到ARFF文件中，将每个文本文件分开。每个文件的区别都是{with their words in here !!}

解决方案

这里是ARFF文件格式，生成起来非常简单。例如，使用Python字典的精简版本，可以使用以下脚本：

  import re 
 
d = {'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html'：
 {'dail'：1，
'focus'：1，
'操作：1，
'trade'：2，
'protest'：1，
'identify'：1}} 
 
为d.keys中的original_filename （）：
m = re.search（'^（。*）\.html $'，original_filename）
如果不是m：
 print忽略文件：，original_filename 
 continue 
 output_filename = m.group（1）+'。arff'
 with open（output_filename，w）as fp：
 fp.write（'''@ RELATION wordcounts 
 
 @ATTRIBUTE字符串
 @ATTRIBUTE计数数字
 
 @DATA 
'''）
 for word_and_count in d [original_filename] .items（）：
 fp.write（％s，％d\\\
％word_and_count）

生成以下格式的输出：

  @RELATION wordcounts 
 
 @ATTRIBUTE字串
 @ATTRIBUTE count count 
 
 @DATA 
 dail，1 
 focus，1 
 actions，1 
 trade，2 
 protest ，1 
 identify，1

...在一个名为警员计划-镇压-上捣乱-AT-抗议-2438316.arff 。如果这不完全是你想要的，我相信你可以很容易地改变它。（例如，如果单词中可能有空格或其他标点符号，则可能需要引用它们。）

gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': 1, 'hijack': 1, 'disorder': 1, 'square': 1, 'leaders': 1, 'deal': 2, 'people': 3, 'streets': 1, 'demonstrations': 2, 'observed': 1, 'street': 2, 'college': 1, 'organised': 1, 'operation': 1, 'special': 1, 'shown': 1, 'attendance': 1, 'normal': 1, 'unions': 2, 'individuals': 1, 'safety': 2, 'prosecuted': 1, 'ira': 1, 'ground': 1, 'public': 2, 'told': 1, 'body': 1, 'stewards': 2, 'obey': 1, 'business': 1, 'gathered': 1, 'assemble': 1, 'garda': 5, 'sinn': 1, 'broken': 1, 'fachtna': 1, 'management': 2, 'possibility': 1, 'groups': 3, 'put': 1, 'affiliated': 1, 'strong': 2, 'security': 1, 'stage': 1, 'behaviour': 1, 'involved': 1, 'route': 2, 'violence': 1, 'dublin': 3, 'fein': 1, 'ensure': 2, 'stand': 1, 'act': 2, 'contingency': 1, 'troublemakers': 2, 'facilitate': 2, 'road': 1, 'members': 1, 'prepared': 1, 'presence': 1, 'sullivan': 2, 'reassure': 1, 'number': 3, 'community': 1, 'strategic': 1, 'visible': 2, 'addressed': 1, 'notify': 1, 'trained': 1, 'eirigi': 1, 'city': 4, 'gpo': 1, 'from': 3, 'crowd': 1, 'visit': 1, 'wood': 1, 'editor': 1, 'peaceful': 4, 'expected': 2, 'today': 1, 'commissioner': 4, 'quay': 1, 'ictu': 1, 'advance': 1, 'murphy': 2, 'gardai': 6, 'aware': 1, 'closures': 1, 'courts': 1, 'branch': 1, 'deployed': 1, 'made': 1, 'thousands': 1, 'socialist': 1, 'work': 1, 'supt': 2, 'feehan': 1, 'mr': 1, 'briefing': 1, 'visited': 1, 'manner': 1, 'irish': 2, 'metropolitan': 1, 'spotters': 1, 'organisers': 1, 'in': 13, 'dissident': 1, 'evidence': 1, 'tom': 1, 'arrangements': 3, 'experience': 1, 'allowed': 1, 'sought': 1, 'rally': 1, 'connell': 1, 'officers': 3, 'potential': 1, 'holding': 1, 'units': 1, 'place': 2, 'events': 1, 'dignified': 1, 'planned': 1, 'independent': 1, 'added': 2, 'plans': 1, 'congress': 1, 'centre': 3, 'comprehensive': 1, 'measures': 1, 'yesterday': 2, 'alert': 1, 'important': 1, 'moving': 1, 'plan': 2, 'highly': 1, 'law': 2, 'senior': 2, 'fair': 1, 'recent': 1, 'refuse': 1, 'attempt': 1, 'brady': 1, 'liaising': 1, 'conscious': 1, 'light': 1, 'clear': 1, 'headquarters': 1, 'wing': 1, 'chief': 2, 'maintain': 1, 'harcourt': 1, 'order': 2, 'left': 1}}

I have a python script that extracts words from text files and counts the number of times they occur in the file.

I want to add them to an ".ARFF" file to use for weka classification. Above is an example output of my python script. How do I go about inserting them into an ARFF file, keeping each text file separate. Each file is differentiated by {"with their words in here!!"}
解决方案
There are details on the ARFF file format here and it's very simple to generate. For example, using a cut-down version of your Python dictionary, the following script:
import re d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1 }} for original_filename in d.keys(): m = re.search('^(.*)\.html$',original_filename,) if not m: print "Ignoring the file:", original_filename continue output_filename = m.group(1)+'.arff' with open(output_filename,"w") as fp: fp.write('''@RELATION wordcounts @ATTRIBUTE word string @ATTRIBUTE count numeric @DATA ''') for word_and_count in d[original_filename].items(): fp.write("%s,%d\n" % word_and_count)
Generates output of the form:
@RELATION wordcounts @ATTRIBUTE word string @ATTRIBUTE count numeric @DATA dail,1 focus,1 actions,1 trade,2 protest,1 identify,1
... in a file called gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff. If that's not exactly what you want, I'm sure you can easily alter it. (For example, if the "words" might have spaces or other punctuation in them, you probably want to quote them.)

这篇关于从python输出创建一个ARFF文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从python输出创建一个ARFF文件 [英] Creating an ARFF file from python output

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从python输出创建一个ARFF文件 [英] Creating an ARFF file from python output

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭