如何开始使用大数据分析 [英] How to get started with Big Data Analysis

查看:85
本文介绍了如何开始使用大数据分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直是R的很长一段时间的用户,最近开始使用Python。使用传统的RDBMS系统进行数据仓库,使用R / Python进行数字处理,我觉得现在需要用大数据分析来解决我的问题。



我会喜欢了解如何开始处理大数据。
- 如何从Map / Reduce和Hadoop的使用开始简单




  • 如何利用我在R和Python中的技能开始使用大数据分析。例如,使用Python Disco项目。

  • 使用RHIPE包并查找玩具数据集和问题区域。

  • 找到正确的信息让我决定是否需要从RDBMS类型数据库迁移到NoSQL



总而言之,我想知道如何从小到大逐渐增强我在大数据分析方面的技能和专业知识。



感谢您的建议和意见。
我对此查询的一般性质表示歉意,但我希望获得关于此主题的更多观点。
$ b


  • 苛刻


解决方案

使用例如Python的迪斯科项目。


好。播放与该




使用RHIPE包和寻找玩具数据集和问题的区域。




好。玩这个也是。



不要为找到大数据集出汗。即使是小数据集也会出现非常有趣的问题事实上,任何数据集都是一个起点。

我曾经构建了一个小型星型模式来分析组织6000万美元的预算。源数据在电子表格中,基本上不可理解。因此,我将它卸载到星型模式中,并用Python编写了几个分析程序,以创建相关数字的简化报告。
$ b


找到正确的信息让我决定是否需要从RDBMS类型数据库转移到NoSQL

这很简单。



'p>首先,得到一本书上,例如数据仓库(拉尔夫Kimball的数据仓库工具包)。



二,学习 星型模式仔细地 - 尤其是所有的变体和特殊情况,Kimball解释(深入)

第三,实现以下内容:SQL用于更新和事务。



当进行分析处理(大或小)时,几乎没有任何更新。 SQL(以及相关的规范化)再也没有多大意义了。

Kimball的观点(和其他人)也认为,大部分数据仓库不在SQL中,它在简单的平面文件。一个数据集市(用于特设的切片和骰子分析)可能位于关系数据库中,以允许使用SQL进行简单灵活的处理。因此,决定是微不足道的。如果它是事务性的(OLTP),它必须位于关系数据库或OO数据库中。如果是分析(OLAP),除了slice-and-dice分析外,它不需要SQL;即使这样,数据库也会根据需要从官方文件中加载。


I've been a long time user of R and have recently started working with Python. Using conventional RDBMS systems for data warehousing, and R/Python for number-crunching, I feel the need now to get my hands dirty with Big Data Analysis.

I'd like to know how to get started with Big Data crunching. - How to start simple with Map/Reduce and the use of Hadoop

  • How can I leverage my skills in R and Python to get started with Big Data analysis. Using the Python Disco project for example.
  • Using the RHIPE package and finding toy datasets and problem areas.
  • Finding the right information to allow me to decide if I need to move to NoSQL from RDBMS type databases

All in all, I'd like to know how to start small and gradually build up my skills and know-how in Big Data Analysis.

Thank you for your suggestions and recommendations. I apologize for the generic nature of this query, but I'm looking to gain more perspective regarding this topic.

  • Harsh

解决方案

Using the Python Disco project for example.

Good. Play with that.

Using the RHIPE package and finding toy datasets and problem areas.

Fine. Play with that, too.

Don't sweat finding "big" datasets. Even small datasets present very interesting problems. Indeed, any dataset is a starting-off point.

I once built a small star-schema to analyze the $60M budget of an organization. The source data was in spreadsheets, and essentially incomprehensible. So I unloaded it into a star schema and wrote several analytical programs in Python to create simplified reports of the relevant numbers.

Finding the right information to allow me to decide if I need to move to NoSQL from RDBMS type databases

This is easy.

First, get a book on data warehousing (Ralph Kimball's The Data Warehouse Toolkit) for example.

Second, study the "Star Schema" carefully -- particularly all the variants and special cases that Kimball explains (in depth)

Third, realize the following: SQL is for Updates and Transactions.

When doing "analytical" processing (big or small) there's almost no update of any kind. SQL (and related normalization) don't really matter much any more.

Kimball's point (and others, too) is that most of your data warehouse is not in SQL, it's in simple Flat Files. A data mart (for ad-hoc, slice-and-dice analysis) may be in a relational database to permit easy, flexible processing with SQL.

So the "decision" is trivial. If it's transactional ("OLTP") it must be in a Relational or OO DB. If it's analytical ("OLAP") it doesn't require SQL except for slice-and-dice analytics; and even then the DB is loaded from the official files as needed.

这篇关于如何开始使用大数据分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆