如何从数据集中删除无用的元素 [英] How to remove not useful elements from a dataset

查看:125
本文介绍了如何从数据集中删除无用的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,它看起来如下:

I have a dataset, and it look like the following:

 {0: {"address": 0,
         "ctag": "TOP",
         "deps": defaultdict(<class "list">, {"ROOT": [6, 51]}),
         "feats": "",
         "head": "",
         "lemma": "",
         "rel": "",
         "tag": "TOP",
         "word": ""},
     1: {"address": 1,
         "ctag": "Ne",
         "deps": defaultdict(<class "list">, {"NPOSTMOD": [2]}),
         "feats": "_",
         "head": 6,
         "lemma": "اشرف",
         "rel": "SBJ",
         "tag": "Ne",
         "word": "اشرف"},

我想从该数据集中删除"deps":...?.我试过了这段代码,但没有用,因为"depts":的值在字典的每个元素中都不同.

I want to remove "deps":...? from this dataset. I tried this code but does not work, because the value of "depts": differ in each element of the dict.

import re
import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    lines = fp.readlines()
    k = str(lines)
    a = re.sub(r'\d:', '', k) # this is for removing numbers like `1:{..`
    json_data = simplejson.dumps(a)
    #print(json_data)
    n = eval(k.replace('defaultdict(<class "list">', 'list'))
    print(n)

推荐答案

正确的方法是修复生成文本文件的代码.该defaultdict(<class "list">, {"ROOT": [6, 51]})提示在需要更智能的格式时使用了简单的repr.

The correct way would be to fix the code that produced the text file. This defaultdict(<class "list">, {"ROOT": [6, 51]}) is a hint that it used a simple repr when a smarter format was required.

如果无法真正解决问题,那么以下只是穷人的解决方法.

The following is just a poor man's workaround if the real fix is not possible.

摆脱"deps": ...很容易:一次读取一行文件并丢弃以""deps"开头的任何文件就足够了(忽略初始空白).但这还不够,因为当json坚持键仅是文本时,文件包含数字键.因此,必须标识数字键并用引号引起来.

Getting rid of "deps": ... is easy: it is enough to read the file one line at a time and discard any one starting with ""deps" (ignoring initial white spaces). But it is not enough, because the file contains numeric keys when json insist on keys being only text. So the numerics key must be identified and quoted.

这可以允许加载文件:

导入 将simplejson导入为simplejson

import re import simplejson as simplejson

with open("../data/cleaned.txt", 'r') as fp:
    k = ''.join(re.sub(r'(?<!\w)(\d+)', r'"\1"',line)
        for line in fp if not line.strip().startswith('"deps"'))

# remove an eventual last comma
k = re.sub(r',[\s\n]*$', '', k, re.DOTALL)

# uncomment if the file does not contain the last }
# k += '}'

js = json.loads(k)

这篇关于如何从数据集中删除无用的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆