Hadoop和Django,有可能吗? [英] Hadoop and Django, is it possible?

查看:149
本文介绍了Hadoop和Django,有可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我所理解的,Hadoop是一个分布式存储系统。但是我不知道的是,我们可以用Hadoop替换正常的RDBMS(MySQL,Postgresql,Oracle)吗?或者Hadoop是另一种类型的文件系统,我们可以在其上运行RDBMS?



还有,Django可以与Hadoop集成吗?通常,Web框架(ASP.NET,PHP,Java(JSP,JSF等))如何与Hadoop集成?



我有点困惑与Hadoop vs RDBMS和我会感谢任何解释。
(对不起,我多次阅读文档,但也许是因为我的英文知识不足,我发现大部分时间文件有点混乱)

解决方案

什么是Hadoop?



想象下面的挑战:你有很多数据,很多我的意思至少太字节。您想要转换这些数据或提取一些信息,并将其处理成一种以索引,压缩或消化的格式,以便您可以使用它。



Hadoop能够并行化这样的处理任务,这里最好的部分是处理文件的冗余存储,分布在集群上的不同机器上的任务(是的,您需要一个集群,否则Hadoop是不能补偿框架的性能损失。)



如果您先看看Hadoop生态系统,您会发现3个大字:HDFS(Hadoop Filesystem) Hadoop本身(与MapReduce)和HBase(数据库有时是列存储,不完全匹配)



HDFS是Hadoop和HBase使用的文件系统。它是主机上常规文件系统之上的额外层。 HDFS将上传的文件以块(通常为64MB)分片,并将其保留在集群中,并处理它们的复制。



当Hadoop获取执行任务时,它会HDFS上输入文件的路径,所需的输出路径,Mapper和Reducer类。 Mapper和Reducer通常是在JAR文件中传递的Java类(但是通过 Hadoop Streaming ,您可以使用所需的任何comandline工具)。调用映射器来处理每个条目(通常是行,例如:返回1,如果行包含一个坏的F *字)的输入文件,输出将传递给reducer,将单个输出合并到所需的其他格式(例如:添加数字)。这是一个简单的方法来获得一个坏话的计数器。



很酷的事情:映射的计算是在节点上完成的:你可以线性地处理大块,您只需将通过网络的半消化(通常较小的)数据移动到reducers。



如果其中一个节点死亡:还有一个具有相同数据。



HBase利用文件的分布式存储并存储其表,并将其分为群集中的块。 HBase与Hadoop相反,随机访问数据。



如您所见,HBase和Hadoop与RDMBS完全不同。而且HBase还缺乏很多RDBMS的概念。用触发器,准备的声明,外键等建模数据不是HBase被认为是做的事情(我不是100%肯定这个,所以正确的我;-))



Django可以与Hadoop集成吗?



对于Java来说很简单:Hadoop是用Java编写的,所有的API都可以使用。



对于Python / Django我不知道(但是),但我相信你可以做一些Hadoop流/ Jython作为最后的手段。
我发现以下内容: Hadoopy 贴图和减少器中的Python 。 p>

From what I understood, Hadoop is a distributed storage system thingy. However what I don't really get is, can we replace normal RDBMS(MySQL, Postgresql, Oracle) with Hadoop? Or is Hadoop is just another type of filesystem and we CAN run RDBMS on it?

Also, can Django integrated with Hadoop? Usually, how web frameworks (ASP.NET, PHP, Java(JSP,JSF, etc) ) integrate themselves with Hadoop?

I am a bit confused with the Hadoop vs RDBMS and I would appreciate any explanation. (Sorry, I read the documentation many times, but maybe due to my lack of knowledge in English, I find the documentation is a bit confusing most of the time)

解决方案

What is Hadoop?

Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.

Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).

If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)

HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.

When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.

The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.

And if one of the nodes dies: there is another one with the same data.

HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.

As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )

Can Django integrated with Hadoop?

For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.

For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort. I've found the following: Hadoopy and Python in Mappers and Reducers.

这篇关于Hadoop和Django,有可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆