使用Python从csv文件创建星型模式 [英] Creating star schema from csv files using Python

查看:172
本文介绍了使用Python从csv文件创建星型模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有6个维表,所有维表都是csv文件的形式.我必须使用Python形成星型架构.我不确定如何使用Python创建事实表.事实表(理论上)至少具有一维表共有的列.

I have 6 dimension tables, all in the form of csv files. I have to form a star schema using Python. I'm not sure how to create the fact table using Python. The fact table (theoretically) has at least one column that is common with a dimension table.

在记住事实表中多个维度表中的数量应正确对应的情况下,如何创建事实表?

How can I create the fact table, keeping in mind that quantities from multiple dimension tables should correspond correctly in the fact table?

我不允许透露代码或确切数据,但我将添加一个小例子.文件1包含以下列:student_id,student_name.文件2包含:student_id,department_id,department_name,sem_id.最后,文件3包含student_id,subject_code,subject_score. 3维表以csv文件的形式出现.我现在需要事实表包含:student_id,student_name,department_id,subject_code.如何以这种形式形成事实表?谢谢您的帮助.

I am not allowed to reveal the code or exact data, but I'll add a small example. File 1 contains the following columns: student_id, student_name. File 2 contains : student_id, department_id, department_name, sem_id. Lastly File 3 contains student_id, subject_code, subject_score. The 3 dimension tables are in the form of csv files. I now need the fact table to contain: student_id, student_name, department_id, subject_code. How can I form the fact table in that form? Thank you for your help.

推荐答案

阅读某些博客似乎不是在内存中处理python中此类情况的好方法,但即使下面的帖子使您有理由使用cn,

Reading certain blogs look like it is not a good way to handle such cases in python in memory but still if the below post make sense you cn use it

>

实际加载

DW加载的第一步是尺寸一致性.稍微聪明一点,上述处理就可以并行完成,这会占用大量CPU时间.为此,每种一致性算法都构成了大型OS级管道的一部分.必须重新格式化源文件的格式,以保留每个维的FK参考的空列.每个一致性过程都读取源文件,并写出一维FK填充的相同格式文件.如果所有这些一致性算法形成一个简单的OS管道,它们都将并行运行.看起来像这样.

The first step in DW loading is dimensional conformance. With a little cleverness the above processing can all be done in parallel, hogging a lot of CPU time. To do this in parallel, each conformance algorithm forms part of a large OS-level pipeline. The source file must be reformatted to leave empty columns for each dimension's FK reference. Each conformance process reads in the source file and writes out the same format file with one dimension FK filled in. If all of these conformance algorithms form a simple OS pipe, they all run in parallel. It looks something like this.

src2cvs源|符合1 | conform2 | 3 |加载 最后,您可以使用RDBMS的批量加载器(或使用Python轻松编写自己的加载器)从所有维度FK完全填充的源记录中选择实际的事实值和维度FK,并将其加载到事实中桌子.

src2cvs source | conform1 | conform2 | conform3 | load At the end, you use the RDBMS's bulk loader (or write your own in Python, it's easy) to pick the actual fact values and the dimension FK's out of the source records that are fully populated with all dimension FK's and load these into the fact table.

这篇关于使用Python从csv文件创建星型模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆