如何使用python将文件保存在hadoop中 [英] How to save a file in hadoop with python

查看:387
本文介绍了如何使用python将文件保存在hadoop中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

我开始学习hadoop,但是,我需要使用python将很多文件保存到其中. 我似乎无法弄清楚我在做什么错.有人可以帮我吗?

I am starting to learn hadoop, however, I need to save a lot of files into it using python. I cannot seem to figure out what i am doing wrong. Can anyone help me with this?

下面是我的代码. 我认为HDFS_PATH是正确的,因为我在安装时未在设置中进行更改. pythonfile.txt在我的桌面上(通过命令行运行的python代码也是如此).

Below is my code. I think the HDFS_PATH is correct as I didn't change it in the settings while installing. the pythonfile.txt is on my desktop (so is the python code running through the command line).

代码:

import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python' 

def main():
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])

main()

输出 当我运行上面的代码时,我得到的只是python本身的目录.

Output When I run the above code all I get is a directory in python itself.

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 Brian supergroup        236 2014-10-28 11:30 /python

推荐答案

这是subprocess模块的非常典型的任务.解决方案如下:

This is a pretty typical task for the subprocess module. The solution looks like this:

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

完整示例

假设您在服务器上,并且与hdfs的连接已通过验证(例如,您已经呼叫.keytab).

Let's assume you're on a server and have a verified connection with hdfs (e.g. you already called a .keytab).

您刚刚从pandas.DataFrame创建了一个csv,并希望将其放入hdfs.

You just created a csv from a pandas.DataFrame and want to put it into hdfs.

然后您可以按照以下步骤将文件上传到hdfs:

You can then upload the file to hdfs as follows:

import os 

import pandas as pd

from subprocess import PIPE, Popen


# define path to saved file
file_name = "saved_file.csv"

# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)

# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

然后,csv文件将位于/user/<your-user-name/saved_file.csv.

The csv file will then exist at /user/<your-user-name/saved_file.csv.

注意-如果您是通过Hadoop中调用的python脚本创建此文件的,则中间csv文件可能存储在某些随机节点上.由于(大概)不再需要该文件,因此最佳做法是删除该文件,以免每次调用脚本时都污染节点.您只需在上述脚本的最后一行添加os.remove(file_name)即可解决此问题.

Note - If you created this file from a python script called in Hadoop, the intermediate csv file may be stored on some random nodes. Since this file is (presumably) no longer needed, it's best practice to remove it so as not to pollute the nodes everytime the script is called. You can simply add os.remove(file_name) as the last line of the above script to solve this issue.

这篇关于如何使用python将文件保存在hadoop中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆