从CSV数据流python创建字典 [英] Creating a Dict from CSV dataflow python

查看:51
本文介绍了从CSV数据流python创建字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python中的csv数据做出命令,我不想使用传统的split(','),然后使用将行重命名为想要的标题,因为我将收到不同的csv具有不同信息量的文件,而我将无法使用该方法始终如一地定位我想要的行.

I am trying to make a dict from csv data in python, I do not want to use the traditional split(',') and then using renaming the rows to the heading I would like, as I will be recieving different csv files with different amounts of information, and I will not be able to consistently target the rows I want with that method.

标头名称将保持一致,只是一个文件中的标头可能比另一个文件中的标头更多

THE HEADER NAMES WILL BE CONSISTENT, just their maybe more headers in one file compared to another

相反,我一直在尝试从CSV文件中制定列表,然后将第一行压缩到其余行中以创建字典,然后提取所需的确切内容.

Instead, I have been trying to formulate a list from the CSV file, then zipping the first row into the rest of the rows to create a dictionary, then I can extract the exact contents I want.

我可以使用csv.reader或:

I can create a list of lists, by either using the csv.reader or :

class Split(beam.DoFn):
    def process(self, element):
        rows = element.splitlines()
        data = []
        for row in rows:
            data.append([row])
        return data

这将返回:

[u'FIRST_NAME,last_name,birthdate,voter_id,phone_number']
[u'hector,ABAD,6/15/1970,11*******,7*********']
[u'm,ABAL,6/16/1949,12********,']
[u'jorge,ABDALA,6/15/1962,21********,3********']
[u'karen,ABELLA,6/18/1988,33********,']

尽管我尝试通过以下方式访问第一行:

Although when I try to access the first row via:

rows = element.splitlines()
data = []
for row in rows:
    # f = pattern.findall(row)
    data.append([row])
return data[0]

它返回:

FIRST_NAME,last_name,birthdate,voter_id,phone_number
hector,ABAD,6/15/1970,11*******,7*********
m,ABAL,6/16/1949,109055849,
jorge,ABDALA,6/15/1962,21********,3********
karen,ABELLA,6/18/1988,33********,

我还尝试了beam_utils csv阅读器,尽管它说在修复fileio错误后没有名为"sources"的模块.

I have also tried the beam_utils csv reader although this says that there is no module named 'sources' after I fix the fileio bug.

如果有人知道更好的方法,或者可以将我引向我做错的事情,那会很棒,这也是我的管道:

If someone knows a better way or can point me towards what I'm doing wrong that would be great, also this is my pipeline:

with beam.Pipeline(options=pipeline_options) as p:
    (p
     | 'Read' >> ReadFromText(known_args.input)
     | 'Split Values' >> beam.ParDo(Split())
     | 'WriteToText' >> beam.io.WriteToText(known_args.output)) 

我现在只是从我的google-cloud存储桶中读取内容,但将来它将来自pubsub.

I am only reading from my google-cloud storage bucket for now, but in the future it will be from pubsub.

我希望内容看起来像这样:

I would like the content to look like:

{"FIRST_NAME": "hector", "last_name": "ABAD", "birthdate": "6/15/1970", "voter_id": 11*******, "phone_number": 7*********}
etc.
etc.
etc.

推荐答案

Python Beam SDK似乎并没有很好地支持处理csv文件的标头元素(除了丢弃它).幸运的是,有人创建了此仓库来处理此用例: https://github.com/pabloem/beam_utils

Processing the header element of csv files doesn't seem to be well supported by the python beam SDK (other than discarding it). Fortunately someone has created this repo for dealing with this use case: https://github.com/pabloem/beam_utils

它包含一个CSVFileSource类,该类扩展了FileBasedSource(Beam的用于创建自定义文件源的抽象类),可从具有可变标题的文件中创建字典.

It contains a CSVFileSource class extending FileBasedSource (Beam's abstract class for creating custom file sources) to create your dict from the file with variable headers.

安装:

pip install beam_utils
from beam_utils.sources import CsvFileSource

它可以像这样使用:

 p | 'ReadCsvFile' >> beam.io.Read(CsvFileSource(known_args.input))

应该产生您想要的输出.

Should produce the output you're looking for.

要使包可用于Dataflow工作人员,请创建tar并使用--extra_package标志提供给作业,如

To make the package available to Dataflow workers create a tar and provide to the job with --extra_package flag as in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#local-or-nonpypi

这篇关于从CSV数据流python创建字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆