hadoop streaming - 如何使用python内部连接两个diff文件 [英] hadoop streaming - how to inner join of two diff files using python

查看:219
本文介绍了hadoop streaming - 如何使用python内部连接两个diff文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据18至25岁之间的用户年龄组找出顶级网站页面访问量。
我有两个文件,一个包含用户名,年龄和其他文件包含用户名,网站名称。例子:

users.txt


John,22


pages.txt


John,google.com p>

我在python中编写了以下代码,它的工作原理与我在hadoop之外的预期一样。

  import os 
os.chdir(/ home / pythonlab)

#年龄在18到25岁之间的用户访问的热门网站

#读取用户文件
lines = open(users.txt)
users = [line.split(, )for line in line]#用户名,年龄(例如 - john,22)
userlist = [(u [0],int(u [1]))年龄

#读取页面访问文件
pages = open(pages.txt)
page = [p.split(,)for page in page]#用户名,访问过的网站(例如 - john,google.com)
pagelist = [(p [0],p [1])for page in page]

#map用户和页面访问& (u [0] == p [0]和u [1])中,用户列表中的用户列表中的u为18到25之间的过滤器年龄段
usrpage = [[p [1],u [0] > = 18和u [1] <= 25)]

for usrpage:
print(z [0] .strip('\r\\\
') +,1)#print website name,1

样本输出:


yahoo.com,1
google.com,1




现在我想用hadoop streaming来解决这个问题。



我的问题是,如何在映射器中处理这两个命名文件(users.txt,pages.txt)?我们通常只将输入目录传递给hadoop streaming。 你需要使用 Hive 。这将允许您将多个源文件合并为一个,就像您需要的一样。它允许你加入两个数据源,就像你在SQL中做的那样,然后把结果推送到你的映射器和简化器中。


I want to find out top website page visits based on user age group between 18 and 25. I have two files, one contains username, age and other file contains username, website name. Examples:

users.txt

John, 22

pages.txt

John, google.com

I have written the following in python, and it works as i expected in outside of hadoop.

import os
os.chdir("/home/pythonlab")

#Top sites visited by users aged 18 to 25

#read the users file
lines = open("users.txt")
users = [ line.split(",") for line in lines]      #user name, age (eg - john, 22)
userlist = [ (u[0],int(u[1])) for u in users]     #split the user name and age

#read the page visit file
pages = open("pages.txt")
page = [p.split(",") for p in pages]              #user name, website visited (eg - john,google.com)
pagelist  = [ (p[0],p[1]) for p in page]

#map user and page visits & filter age group between 18 and 25
usrpage = [[p[1],u[0]] for u in userlist for p in pagelist  if (u[0] == p[0] and u[1]>=18 and u[1]<=25) ]

for z in usrpage:
    print(z[0].strip('\r\n')+",1")     #print website name, 1

Sample output:

yahoo.com,1 google.com,1

Now I want to solve this using hadoop streaming.

My question is, how do I process these two named files (users.txt, pages.txt) in my mapper? We normally pass only input directory to hadoop streaming.

解决方案

You would need to look into using Hive. This would allow you to join multiple source files into one, just like you need. it allows you to join two data sources, almost like you do in SQL and then push the result into your mapper and reducer.

这篇关于hadoop streaming - 如何使用python内部连接两个diff文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆