用蟒蛇分隔克拉A. [英] delimiting carat A in python

查看:95
本文介绍了用蟒蛇分隔克拉A.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下格式的数据:



  37101000ssd48800 ^ A1420asd938987 ^ A2011-09-10 ^ A18:47:50.000 ^ A99 00 ^ A1 ^ A0 ^ A 
37101000sd48801 ^ A44557asd03082 ^ A2011-09-06 ^ A13:24:58.000 ^ A42.01 ^ A1 ^ A0 ^ A

因此,我首先将它从字面上理解并尝试:

  line = line.split(^ A)

还有

  line = line.split(\\001)



所以,问题是:



第一种方法适用于我的本地机器,如果我这样做的话:

  cat input.txt | python mapper.py 

它在本地运行正常(input.txt是以上数据),但失败hadoop streaming cluster。



有人告诉我应该使用\\001作为分隔符,但在我的本地机器或群集上,这也不起作用。



对于hadoop人员:



如果我在本地调试它:

  cat input.txt | python mapper.py |排序| python reducer.py 

如果我使用^ A 作为本地分隔符,但是在群集上运行时出现错误,并且错误代码也没有太大帮助...



任何关于如何调试这个问题的建议?

谢谢

解决方案

如果原始数据使用control-A作为分隔符,它只是在 ^ A 中打印,无论您用什么来列出数据,您都有两个选择:


  1. 使用 split('^ A')


  2. 只需使用 split('\u001')

后者几乎总是成为你真正想要的东西。这不起作用的原因是你写了 split('\\u001'),转义反斜杠,所以你在分割字符串 \u001 而不是控制-A。



如果原始数据实际上包含 ^ A (脱字符后跟 A )作为分隔符,只需使用 split('^ A')


I have data in form:

37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A
37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A

So first I took it literally and tried:

line = line.split("^A")

and also

line = line.split("\\u001")

So, the issue is:

The first approach works on my local machine if I do this:

cat input.txt | python mapper.py 

It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters.

Someone told me that I should use "\\u001" as the delimiter, but this is also not working, either on my local machine or on clusters.

For hadoop folks:

If I debug it on local using:

cat input.txt | python mapper.py | sort | python reducer.py

This runs just fine, if I use "^A" as delimiter on local but I am getting errors when running on clusters, and the error code is not too helpful either...

Any suggestions on how can i debug this?
Thanks

解决方案

If the original data uses a control-A as a delimiter, and it's just being printed as ^A in whatever you're using to list the data, you have two choices:

  1. Pipe whatever you use the list the data into a Python script that uses split('^A').

  2. Just use split('\u001') to split on actual control-A values.

The latter is almost always going to be what you really want. The reason this didn't work from you is that you wrote split('\\u001'), escaping the backslash, so you're splitting on the literal string \u001 rather than on control-A.

If the original data actually has ^A (a caret followed by an A) as the delimiter, just use split('^A').

这篇关于用蟒蛇分隔克拉A.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆