如何将参数传递给Hive中的Python流式处理脚本? [英] How to pass parameters to Python streaming script in Hive?

查看:334
本文介绍了如何将参数传递给Hive中的Python流式处理脚本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 添加文件replace-nan-with-zeros.py ; 

SELECT
TRANSFORM(...)
USING'python replace-nan-with-zeros.py'
AS(...)
FROM some_table;

我有一个简单的Python脚本:

 #!/ usr / bin / env python 
import sys


kFirstColumns = 7

def main (argv):

用于sys.stdin中的行:
line = line.strip();
inputs = line.split('\t')

#用零替代NaN
outputs = []
columnIndex = 1;
用于输入值:
newValue =值
如果columnIndex> kFirstColumns:
newValue = value.replace('NaN','0.0')
outputs.append(newValue)
columnIndex = columnIndex + 1

print'\\ \\ t'.join(输出)

如果__name__ ==__main__:
main(sys.argv [1:])

如何使 kFirstColumns 成为此Python脚本的命令行或其他类型的参数?


解决方案

解决方案非常简单。使用

 添加文件replace-nan-with-zeros.py; 

SELECT
TRANSFORM(...)
USING'python replace-nan-with-zeros.py 7'
AS(...)
FROM some_table;

而不仅仅是

  ... 
USING'python replace-nan-with-zeros.py'
...

它适用于我。



Python脚本应改为:

  kFirstColumns = int(sys.argv [1])$ ​​b $ b  


Hive user can stream table through script to transform that data:

ADD FILE replace-nan-with-zeros.py;

SELECT
  TRANSFORM (...)
  USING 'python replace-nan-with-zeros.py'
  AS (...)
FROM some_table;

I have a simple Python script:

#!/usr/bin/env python
import sys


kFirstColumns= 7

def main(argv):

    for line in sys.stdin:
        line = line.strip();
        inputs = line.split('\t')

        # replace NaNs with zeros
        outputs = [ ]
        columnIndex = 1;
        for value in inputs:
            newValue = value
            if columnIndex > kFirstColumns:
                newValue = value.replace('NaN','0.0')
            outputs.append(newValue)
            columnIndex = columnIndex + 1

        print '\t'.join(outputs)

if __name__ == "__main__":
    main(sys.argv[1:])

How to make kFirstColumns to be a command-line or some other kind of parameter to this Python script?

Thank you!

解决方案

Solution is really trivial. Use

ADD FILE replace-nan-with-zeros.py;

SELECT
  TRANSFORM (...)
  USING 'python replace-nan-with-zeros.py 7'
  AS (...)
FROM some_table;

instead of just

  ...
  USING 'python replace-nan-with-zeros.py'
  ...

It works fine for me.

Python script should be changed to:

kFirstColumns= int(sys.argv[1])

这篇关于如何将参数传递给Hive中的Python流式处理脚本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆