蜂巢计数元组? [英] Hive count tuple?

查看:116
本文介绍了蜂巢计数元组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对HiveQL非常新颖,我有点卡住了:S



我有一张表格,内容如下。一个名为res的列和三个名为filed的partion_column分区的列。

 创建表结果(res string)PARTITIONED BY(field STRING); 

然后我在此表中导入数据

 插入覆盖表结果PARTITION(field ='title')SELECT explode(line)AS myNewCol FROM titles; 
插入覆盖表结果PARTITION(field ='artist')SELECT explode(line)AS myNewCol FROM artist;
插入覆盖表结果PARTITION(field ='albums')SELECT explode(line)AS myNewCol FROM albums;

我正在计算三个分区中的独特的tubles。



例如,此命令计算数据集中特定标题的存在数量。

  SELECT res, count(1)AS counttotal FROM results where field ='title'GROUP BY res ORDER BY counttotal; 

,它输出的内容类似于

  title count 
打我宝贝更多时间9

我可以将它扩展到元组(标题,专辑,艺术家)吗?如果我想要一个输出:

 标题专辑艺术家数量

宝贝多一次我宝贝多了一次布兰妮斯皮尔斯9

我的整个代码:

  CREATE EXTERNAL TABLE如果不存在hivetesttable(
xmldata STRING)
行格式DELIMITED FIELDS TERMINATED BY'\ t'
location '/用户/ sdasd / hivetestdata /';

创建视图xmlout(行)作为select * from hivetesttable;

从xmlout中选择xpath(line,'/ MC / SC / * / @ ttl')作为CREATE VIEW TITLES(行)
CREATE VIEW ARTIST(line)从xmlout中选择xpath(line,'/ MC / SC / * / @ art');
从xmlout中选择xpath(line,'/ MC / SC / * / @ art')作为CREATE VIEW ALBUMS(行)



创建表结果(res string)PARTITIONED BY(field STRING);
插入覆盖表结果PARTITION(field ='title')SELECT explode(line)AS myNewCol FROM titles;
插入覆盖表结果PARTITION(field ='artist')SELECT explode(line)AS myNewCol FROM artist;
插入覆盖表结果PARTITION(field ='albums')SELECT explode(line)AS myNewCol FROM albums;

SELECT res,count(1)AS counttotal FROM results where field ='title'GROUP BY res ORDER BY counttotal;

一行xml数据就像

< $ p $ lt; code><?xml version =1.0encoding =UTF-8?>< MC>< SC>< S uid =2 yr =2011art =Samsungcmp =& lt; unknown& gt; fld =/ mnt / sdcard / Samsung / Musicalb =Samsungttl =Over the horizo​​n/>< S uid =37gen =yr =2010art =Jason Derulo CMP = &安培; lt;未知&安培; gt; 中fld =/ mnt / sdcard / Music / Jason Derulo / Jason Deruloalb =Jason Derulottl =Whatcha Say/>< S uid =38gen =yr =2010 Jason Derulocmp =& lt; unknown& gt; fld =/ mnt / sdcard / Music / Jason Derulo / Jason Deruloalb =Jason Derulottl =In My Head/>< S uid =39gen =yr =2011art =Alexandra Stancmp =& lt; unknown& gt; fld =/ mnt / sdcard / Music / Alexandra Stan / Mr_ Saxobeat - Singlealb =Saxobeat先生 - 单身ttl =Saxobeat先生(加长版)/>< S uid =40gen =yr =2011art =Bushidocmp =& lt; unknown& gt; fld =/ mnt / sdcard / Music / Bushido / Jenseits von Gut undBöse(Premium Edition)alb =Jenseits von Gut undBöse(Premium Edition)ttl =Wie einLöwe/>< S uid = 41gen =yr =2011art =Bushidocmp =& lt; unknown& gt; fld =/ mnt / sdcard / Music / Bushido / Jenseits von Gut undBöse(Premium Edition)alb =Jenseits von Gut undBöse(Premium Edition)ttl =Verreckt/>< S uid =42 gen =yr =2011art =Lucenzocmp =& lt; unknown& gt; fld =/ mnt / sdcard / Music / Lucenzo / Danza Kuduro(feat_ Don Omar)[From _Fast& amp; amp; amp; amp; amp; amp; Furious 5_] - Singlealb =Danza Kuduro(feat。Don Omar)[From& & amp; amp; amp; amp; amp; amp; amp; amp; furious 5& quot;] - 单曲ttl =Danza Kuduro(feat。Don Omar)[From& amp; amp; amp; amp; amp; amp; furious 5& quot;]] />< S uid = 121gen =yr =701art =Michael Jacksoncmp =& lt; unknown& gt; fld =/ mnt / sdcard / external_sd / Music / Michael Jackson / Bad [Bonus Tracks]alb =Bad [Bonus Tracks]ttl =配音介绍/ Quincy Jones Interview#1 [*]/> < / SC>< PC />< / MC>


解决方案

根据您提供的信息,你想要的是不可能的。现在你有一张如下所示的表:

  res字段
--- -----
宝贝多一次标题
宝贝多一次标题
宝贝多一次标题
宝贝多一次标题
宝贝多一次标题
宝贝一个更多时间标题
宝贝多一次标题
宝贝多一次标题
宝贝多一次标题
打我宝贝多一次时间专辑
打我宝贝多一个时间专辑
击中我宝贝再一次的专辑
打我的宝贝再一次的专辑
打我的宝贝再一次的专辑
打我的宝贝再一次的专辑
打我吧宝贝再来一次专辑
打我吧宝贝再次播放专辑
打我宝贝再多一次专辑
布兰妮·斯皮尔斯艺术家
布兰妮·斯皮尔斯艺术家
布兰妮·斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
布兰妮斯皮尔斯艺术家
the distance title
距离标题
打开书名
daria标题
时尚金块专辑
时尚金块专辑
时尚金块专辑
时尚金块专辑
蛋糕艺术家
蛋糕艺术家
蛋糕艺术家
蛋糕艺术家

由于您对它进行了分区,Hive恰好将其存储在三个不同的文件夹中,但这不会影响查询的结果。我添加了一些额外的曲目,并且我想象着您希望输出的额外曲目(如果我错了,请纠正我):

 标题专辑艺术家数量
宝贝多一次打我宝贝one mroe时间britney spears 9
距离时尚掘金蛋糕2
打开书时尚块蛋糕1
daria fashion掘金蛋糕1

但是没有办法说开放书与时尚块或蛋糕,就像没有办法说多一次的宝宝与布兰妮斯皮尔斯有关。你可以尝试在数量上进行匹配,但是最终你会得到类似的结果

 标题专辑艺术家数量
宝贝多一次打我宝贝多一次布兰妮长矛9
null时尚块蛋糕3
距离null null 1
打开书籍,daria null null 1

我想你想要一个有这样的列的表

 标题专辑艺术家
宝贝多一次打我宝贝多一次britney spears
宝贝多一次打我宝贝多一次britney spears
宝贝多一次打我宝贝多一次时间布兰妮斯皮尔斯
宝贝多一次击中我宝贝多一次布兰妮长矛
宝贝多一击m e宝贝多一次britney spears
宝贝多一次打我宝贝多一次britney spears
宝贝多一次打我宝贝多一次britney spears
宝贝多一次打我宝贝多一次布兰妮斯皮尔斯
宝贝多一次打我宝贝多一次布兰妮斯皮尔斯
远距离时尚掘金蛋糕
远距离时尚掘金蛋糕
打开书时尚掘金蛋糕
达里亚时装掘金蛋糕

但仍分配给艺术家和/或专辑。无论是否进行分区,您都可以像查询表未分区一样编写查询(只要数据没有损坏,只影响性能,它不会影响结果)。但是,这将影响您创建和填充表格的方式。让我知道如果这是你想要的,我会编辑这个答案来回答这个问题。






编辑AS PROMISED:



好的,创建没有任何分区的表很简单:

  CREATE TABLE结果(标题字符串,专辑字符串,艺术家字符串)

使用with分区几乎一样简单,你只需要首先决定分割什么。如果您对艺术家进行分区,这意味着您可以针对单个或一组艺术家运行查询,而无需为其他艺术家处理信息。如果按艺术家和专辑进行分区,则可以对专辑也做同样的事情。这确实是以将大文件分解为更小的文件为代价的,通常MapReduce(因此Hive)对于大文件效果更好。我根本不用担心分区问题,除非你处理至少10个GB,并且觉得你可以处理分区如何工作和HiveQL。但为了完整性,由艺术家进行分区:

  CREATE TABLE结果(标题字符串,专辑字符串)PARTITIONED BY(artist string); 

,然后按专辑分割。通过(艺术家字符串,专辑字符串) vs (专辑字符串,艺术家字符串)进行分区不会更改您的结果,但是您应该首先放置层次结构的逻辑顶部。

  CREATE TABLE(标题字符串)PARTITIONED BY(艺术家字符串,专辑字符串); 

如果我们访问的唯一信息来自表格标题,艺术家和专辑,因为我们有一个巨大的标题,艺术家和专辑列表,但没办法告诉哪个专辑与哪个标题一起。我希望你有一些数据,这些关系仍然完好无损,或者你的数据集完整无缺。在不知道这个假设数据的形式的情况下,我无法提供如何填充表格的答案。但是如果你有分区表,这个答案可能会有用给你,如果你不想手动指定每个艺术家和专辑(因为每个艺术家都有自己的分区,并在每个专辑分区内获得它自己的分区)。



编辑:提问者有xml文件有标题,ablum,arist关系完好无损。更多关于这方面的信息。



现在问题的关键是计算独特的元组。无论数据如何分区,这都是一样的。我们使用 GROUP BY 子句执行此操作。当您指定一列(或分区,可将其视为具有特殊属性的列)时,可将数据分解为具有该列的不同值的组。如果您指定了多个列,则可以将数据分解为具有不同值的列组合。这是我们利用计算不同元组的优势:

  SELECT标题,专辑,艺术家,COUNT(*)
FROM结果
GROUP BY标题,专辑,艺术家

这里我们是:

 标题专辑艺术家数量
宝贝再一次打我宝宝一个mroe时间britney spears 9
远距离时尚掘金蛋糕2
打开的书时尚块蛋糕1
达里亚时尚块蛋糕1


I am pretty new with HiveQL and I am kinda stuck :S

I have a table of the following schema. One column named res and three partitioned under partion_column named filed.

create table results( res string) PARTITIONED BY (field STRING); 

I then imported data in this table

insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

I am trying to count the unique tubles in the three partitions.

For example this command count the number of existence of certain titles in the dataset.

 SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

and it outputs something like

 title                                count        
 Hit me Baby More time                   9

How can I extend this to tuples ( title, album, artist)? If I want to have an output like :

title                            album                 artist       count

Baby one more time    hit me baby one more time    britney spears    9

My whole code:

CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable  (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';

create view xmlout(line) as  select * from hivetesttable;  

CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl')  from xmlout;
CREATE VIEW ARTIST(line) as select  xpath(line,'/MC/SC/*/@art')  from xmlout;
CREATE VIEW ALBUMS( line) as select   xpath(line,'/MC/SC/*/@art') from xmlout;



create table results( res string) PARTITIONED BY (field STRING); 
insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

a row of the xml data is like

<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast &amp; Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;] - Single" ttl="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>

解决方案

Based on the information you've provided, the output you want is not possible. Right now you have a table that looks like this:

res                           field
---                           -----
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
the distance                  title
the distance                  title
open book                     title
daria                         title
fashion nugget                album
fashion nugget                album
fashion nugget                album
fashion nugget                album
cake                          artist
cake                          artist
cake                          artist
cake                          artist

Because you partitioned it, Hive happens to store it in three different folders but this doesn't affect the results of the query. I added some extra tracks and I imagine with the extra tracks you would want the output to be (correct me if I'm wrong):

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

But there is no way to tell that "open book" has anything to do with "fashion nuggets" or "cake", just like there is no way to tell that "baby one more time" is associated with "britney spears". You could try to match on the counts but then you would end up with something like this

title                  album                       artist              count
baby one more time     hit me baby one more time   britney spears      9
null                   fashion nuggets             cake                3
the distance           null                        null                1
open book,daria        null                        null                1

I think you wanted a table with columns like this

title                  album                         artist
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
the distance           fashion nuggets               cake
the distance           fashion nuggets               cake
open book              fashion nuggets               cake
daria                  fashion nuggets               cake

but still partitioned on maybe artist and/or album. With or without the partitioning, you can write the query as if the table is not partitioned (it doesn't effect results as long as the data isn't corrupted, only performance). It will affect how you create and populate the table, however. Let me know if this is what you wanted an I'll edit this answer to answer that question instead.


THE EDIT AS PROMISED:

Okay, creating the table without any partitions is straightforward:

CREATE TABLE results (title string, album string, artist string)

Creating the table with with partitions is almost as straightforward, you just need to first decided what to partition on. If you partition on artist it will mean you can run queries particular to a single or set of artists without having to process information for other artists. If you partition by artist and album you can do the same thing with albums as well. This does come at the cost of breaking a large file into smaller files, and generally MapReduce (and therefore Hive) works better with large files. I wouldn't worry about partitioning at all unless you dealing with at least 10's of GBs and feel like you have a handle on how paritioning works and HiveQL in general. But for completeness, partitioning by artist:

CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);

and partitioned by artist then by album. Partitioning by (artist string, album string) vs (album string, artist string) won't change your results, but you should put the logical top of the hierarchy first.

CREATE TABLE (title string) PARTITIONED BY (artist string, album string);

Populating this table won't be easy if the only information we have access to are from the tables titles, artists, and albums since we have a huge list of titles, artists, and albums but no way to tell which title goes with which album for example. I hope you have some data where these relationships are still intact or your data set is still intact. Without knowing the form of this hypothetical data, I can't provide an answer for how to populate your tables. But if you have partitioned tables, this answer might be useful to you if you don't want to manually specify every artist and album(since every artist gets there own partition, and within though partition every album gets it's own partition).

EDIT: The asker has xml files which have the title, ablum, arist relationships intact. More information on this in the comments.

Now the meat of the question is counting unique tuples. This will be the same regardless of how data was partitioned, if at all. We do this using the GROUP BY clause. When you specify one column (or partition, which can be thought of as a column with special properties), you break the data down into groups which have distinct values for that column. If you specify several columns, you break the data down into groups with have distinct values for their combination of columns. This is the we take advantage of to count distinct tuples:

SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist

and here we are:

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

这篇关于蜂巢计数元组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆