基于 Python 列表从 yaml 文件中检索数据 [英] Retrieving data from a yaml file based on a Python list
问题描述
我在 ipython 工作;我有一个 Yaml 文件和一个与我的 Yaml 文件对应的 [thomas] id 列表(thomas:文件的第三行).下面只是文件的一小部分.完整的文件可以在这里找到(https://github.com/108michael/congress-legislators/blob/master/legislators-historical.yaml)
I'm working in ipython; I have a Yaml file and a list of [thomas] ids corresponding to my Yaml file (thomas: -third row down on the file). Below is just a small snippet of the file. The complete file can be found here (https://github.com/108michael/congress-legislators/blob/master/legislators-historical.yaml)
- id:
bioguide: C000858
thomas: '00246'
lis: S215
govtrack: 300029
opensecrets: N00002091
votesmart: 53288
icpsr: 14809
fec:
- S0ID00057
wikipedia: Larry Craig
house_history: 11530
name:
first: Larry
middle: E.
last: Craig
bio:
birthday: '1945-07-20'
gender: M
religion: Methodist
terms:
- type: rep
start: '1981-01-05'
end: '1983-01-03'
state: ID
district: 1
party: Republican
- type: rep
start: '1983-01-03'
end: '1985-01-03'
state: ID
district: 1
party: Republican
我想解析文件,对于列表中与 [thomas:] 中的 ID 相对应的每个 id,我想检索以下内容:[fec]:(可能不止一个,我需要所有其中)[姓名:] [第一:] [中间:] [最后:];[生物:] [生日:];[terms:](很可能有多个词条,我需要所有词条)[type:] [start:] [state:] [party:].最后,也可能存在fec数据不可用的情况.
I want to parse the file and for every id in my list that corresponds to an Id in [thomas:] I want to retrieve the following: [fec]: (there could be more than one of these, I need all of them) [name:] [first:] [middle:] [last:]; [bio:] [birthday:]; [terms:] (it is likely that there is more than one term, I need for all terms) [type:] [start:] [state:] [party:]. Finally, there may also be instances where the fec data is not available.
1) 我应该如何存储数据?我对 Python(我的第一种编程语言)还是比较陌生,不确定如何存储数据.直觉上,我会说字典;然而,最重要的是易于访问和数据检索.以前,我将类似的嵌套数据存储为 csv.这种方法看起来有点笨重.如果我可以制作一个字典(我正在检索的数据)的列表(来自我拥有的 thomas id),这似乎是理想的.
1) How should I store the data? I am still relatively new to Python (my first programing language) and am not sure how to store the data. Intuitively, I would say dictionary; however what is paramount is ease of access and data retrieval. Previously, I have stored similarly nested data as csv. This method seems a little bit bulky. It seems that it would be ideal if I could just make a list (from the thomas ids that I have) of dictionaries (the data I am retrieving).
2) 我不确定如何设置 for/while 语句,以便我只检索与我的 thomas id 列表相对应的数据.
2) I'm not sure how to set up the for/while statements so that I only retrieve data corresponding to my list of thomas ids.
我开始编写我期望将信息写入 CSV 的代码:
I started with writing what I expect would be the code for writing the info to CSV:
import pandas as pd
import yaml
import glob
import CSV
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))
outputfile = open('sponsor_details', 'W', newline='')
outputwriter = csv.writer(outputfile)
df = df.drop_duplicates('sponsor_id')
sponsor_list = df['sponsor_id'].tolist()
with open('legislators-historical.yaml', 'r') as f:
data = yaml.load(f)
for sponsor in sponsor_list:
where sponsor == data[0]['thomas']:
x = data[0]['thomas']
a = data[0]['name']['first']
b = data[0]['name']['middle']
c = data[0]['name']['last']
d = data[0]['bio']['gender']
e = data[0]['bio']['religion']
for fec in data[0]['id']:
c = fec.get('fec')
for terms in data[0]['id']:
t = terms.get('type')
s = terms.get('start')
state = terms.get('state')
p = terms.get('party')
outputwriter.writerow([x, a, b, c, d, e, c, t, s, state, p])
outputfile.flush()
我收到以下错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-48-057d25de7e11> in <module>()
15
16 for sponsor in sponsor_list:
---> 17 if sponsor == data[0]['thomas']:
18 x = data[0]['thomas']
19 a = data[0]['name']['first']
KeyError: 'thomas'
推荐答案
我想你可以尝试解析 YAML 并将其加载到数据框,规范化:
I think you may try to parse YAML and load it to data frame, normalizing it:
import pandas as pd
from yaml import safe_load
with open('legislators-historical.yaml', 'r') as f:
df = pd.json_normalize(safe_load(f))
print(df.head())
输出:
bio.birthday bio.gender bio.religion id.bioguide id.fec id.govtrack \
0 1943-12-02 M Protestant A000109 [S6CO00168] 300003
1 1745-04-02 M NaN B000226 NaN 401222
2 1742-03-21 M NaN B000546 NaN 401521
3 1743-06-16 M NaN B001086 NaN 402032
4 1730-07-22 M NaN C000187 NaN 402334
id.house_history id.icpsr id.lis id.opensecrets id.thomas id.votesmart \
0 8410 29108 S250 N00009082 00011 26783
1 NaN 507 NaN NaN NaN NaN
2 9479 786 NaN NaN NaN NaN
3 10177 1260 NaN NaN NaN NaN
4 10687 1538 NaN NaN NaN NaN
id.wikipedia name.first name.last name.middle \
0 Wayne Allard Wayne Allard A.
1 NaN Richard Bassett NaN
2 NaN Theodorick Bland NaN
3 Aedanus Burke Aedanus Burke NaN
4 Daniel Carroll Daniel Carroll NaN
terms
0 [{'party': 'Republican', 'type': 'rep', 'state...
1 [{'party': 'Anti-Administration', 'type': 'sen...
2 [{'end': '1791-03-03', 'district': 9, 'type': ...
3 [{'end': '1791-03-03', 'district': 2, 'type': ...
4 [{'end': '1791-03-03', 'district': 6, 'type': ...
更新:
以下版本将过滤您的输入数据,因此只有包含thomas"的记录才会被过滤.和fec"将被处理:
the following version will filter your input data so only records containing "thomas" and "fec" will be processed:
import pandas as pd
from yaml import safe_load
def read_yaml(fn):
with open(fn, 'r') as fi:
return safe_load(fi)
def filter_data(data):
result_data = []
for x in data:
if 'id' not in x: continue
if 'fec' not in x['id']: continue
if 'thomas' not in x['id']: continue
result_data.append(x)
return result_data
fn = 'aaa.yaml'
df = pd.json_normalize(filter_data(read_yaml(fn)), 'terms', [['id', 'fec'], ['id', 'thomas']])
print(df.head())
df.to_csv('out.csv')
输出:
class district end party start state type \
0 NaN 4 1993-01-03 Republican 1991-01-03 CO rep
1 NaN 4 1995-01-03 Republican 1993-01-05 CO rep
2 NaN 4 1997-01-03 Republican 1995-01-04 CO rep
3 2 NaN 2003-01-03 Republican 1997-01-07 CO sen
4 2 NaN 2009-01-03 Republican 2003-01-07 CO sen
url id.thomas id.fec
0 NaN 00011 S6CO00168
1 NaN 00011 S6CO00168
2 NaN 00011 S6CO00168
3 NaN 00011 S6CO00168
4 http://allard.senate.gov 00011 S6CO00168
PS 如您所见,这将复制您的行(请参阅:id.thomas
和 id.fec
),以便将其显示为数据框
PS as you see this will duplicate your rows (see: id.thomas
and id.fec
) so that it can be shown as a data frame
更新2
您可能还想将id.fec"中的列表转换为列,但我会在其他数据框中进行:
You may also want to convert lists in 'id.fec' into columns, but i would do it in additional data frame:
df_fec = df['id.fec'].apply(pd.Series)
print(df_fec.head())
输出:
0 1
0 S8AR00112 H2AR01022
1 S8AR00112 H2AR01022
2 S8AR00112 H2AR01022
3 S8AR00112 H2AR01022
4 S6CO00168 NaN
这篇关于基于 Python 列表从 yaml 文件中检索数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!