如何从jsonline文件的每一行中提取元素? [英] How to extract elements from each line in a jsonline file?

查看:128
本文介绍了如何从jsonline文件的每一行中提取元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个jsonl文件,该文件每行包含一个句子和在该句子中找到的标记.我希望从JSON lines文件中的每一行提取令牌,但是我的循环仅从最后一行返回令牌.

I have a jsonl file which contains per line both a sentence and the tokens that are found in that sentence. I wish to extract the tokens from each line in the JSON lines file, but my loop only returns the tokens from the last line.

这是输入.

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is the second sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"second","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}

我尝试运行以下代码:

with jsonlines.open('path/to/file') as reader:
        for obj in reader:
        data = obj['tokens'] # just extract the tokens
        data = [(i['text'], i['id']) for i in data] # elements from the tokens

data

实际结果:

[('This',0), ('是',1), ('the',2), (第一",3), (句子",4), ('.',5)]

[('This', 0), ('is', 1), ('the', 2), ('first', 3), ('sentence', 4), ('.', 5)]

我想要得到的结果是什么

What the result is that I want to get to:

某些令牌包含标签"而不是"id".如何将其合并到代码中?一个例子是:

Some tokens contain a "label" instead of an "id". How could I incorporate that into the code? An example would be:

{"text":"This is the first sentence.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"the","id":2},
{"text":"first","id":3},
{"text":"sentence","id":4},
{"text":".","id":5}]}
{"text":"This is coded in python.","_input_hash":2083129218,"_task_hash":-536378640,"spans":[],"meta":{"score":0.5,"pattern":65},"answer":"accept","tokens":[
{"text":"This","id":0},
{"text":"is","id":1},
{"text":"coded","id":2},
{"text":"in","id":3},
{"text":"python","label":"Programming"},
{"text":".","id":5}]}

推荐答案

代码中的某些问题/更改

Some issues/changes in the code

  • 您每次都会在循环中重新分配变量data,因此您只会看到最后json行的结果,而您想每次都扩展列表

  • You are reassign the variable data in the loop everytime, hence you only see the result for the last json line, instead you want to extend the list everytime

您想在reader迭代器上使用enumerate来获取元组的第一项

You want to use enumerate on the reader iterator to get the first item of the tuple

然后代码更改为

import jsonlines

data = []
#Iterate over the json files
with jsonlines.open('file.txt') as reader:
    #Iterate over the each line on the reader via enumerate
    for idx, obj in enumerate(reader):

        #Append the data to the result
        data.extend([(idx+1, i['text'], i['id']+1) for i in obj['tokens']])  # elements from the tokens

print(data)

或者通过在列表理解本身中创建一个双for循环来实现更紧凑的操作

Or more compact by making a double for-loop in the list comprehension itself

import jsonlines

#Open the file, iterate over the tokens and make the tuples
result = [(idx+1, i['text'], i['id']+1) for idx, obj in enumerate(jsonlines.open('file.txt')) for i in obj['tokens']]

print(result)

输出将是

[
(1, 'This', 1), 
(1, 'is', 2), 
(1, 'the', 3), 
(1, 'first', 4), 
(1, 'sentence', 5), 
(1, '.', 6), 
(2, 'This', 1), 
(2, 'is', 2), 
(2, 'the', 3), 
(2, 'second', 4), 
(2, 'sentence', 5), 
(2, '.', 6)
]

这篇关于如何从jsonline文件的每一行中提取元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆