大数据开发——Hive实战案例
•
大数据
文章目录
- 1. 创建表结构
-
- 1.1 视频表结构
- 1.2 用户表结构
- 2. 准备工作
-
- 2.1 创建临时表
- 2.2 创建最终使用表
- 2.3 对创建表进行解读
- 3. 业务分析
1. 创建表结构
1.1 视频表结构

1.2 用户表结构

2. 准备工作
2.1 创建临时表
- 由于使用的是orc方式进行存储,所以我们需要建立一个临时表,通过查询插入的方式将数据插入到最终表中。
创建临时视频表
create table gulivideo_ori( videoId string, uploader string, age int, category array, length int, views int, rate float, ratings int, comments int, relatedId array ) row format delimited fields terminated by "\t" collection items terminated by "&" stored as textfile;
创建临时用户表
create table gulivideo_user_ori( uploader string, videos int, friends int) row format delimited fields terminated by "\t" stored as textfile;
加载原数据到临时表
load data local inpath "/opt/module/data/video" into table gulivideo_ori; load data local inpath "/opt/module/user" into table gulivideo_user_ori;
2.2 创建最终使用表
创建视频表
create table gulivideo_orc( videoId string, uploader string, age int, category array, length int, views int, rate float, ratings int, comments int, relatedId array ) stored as orc tblproperties("orc.compress"="SNAPPY");
创建用户表
create table gulivideo_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as orc
tblproperties("orc.compress"="SNAPPY");
2.3 对创建表进行解读
由于初始表和最终表他们两者中的结构都是一样的,所以对其中一种进行解读
-
对于用户表
- 三个字段都是基本数据类型,行格式字段分割以‘\t’结束,以orc的方式存储,并且设置压缩属性为snappy
-
视频表
- 视频表中视频的类别和视频的相关视频两个字段是数组的方式进行存储的 ,字段之间分隔符为‘\t’,数组中元素的分隔符为‘&’
3. 业务分析
需求Ⅰ:统计视频观看数 Top10
- 分析
- 第一步: 本需求比较简单,直接对视频的观看数进行排序,使用LIMIT关键字,限定前十条数据
代码:
select videoId from gulivideo_orc order by views limit 10;

需求Ⅱ:统计视频类别热度 Top10
- 分析
- 第一步:获取视频类别,使用炸裂函数explode,由于不需要表中的其他字段,所以可以不考虑使用侧偏移
- 第二部:按照视频的类型进行分组,并计算每一个类别的视频总数,按照总数获取前十的视频类别
步骤代码:
--步骤一使用下列任意一种都可以,第一种使用侧偏移,第二种直接炸裂 select videoId, category_name from gulivideo_orc lateral view explode(category) tmp_category as category_name;t1 select explode(category) categoru_name from gulivideo_orc;t1 --第二步代码: select category_name, count(*) ct from t1 group by category_name order by ct desc limit 10;
最终代码:
select category_name, count(*) ct from (select explode(category) category_name from gulivideo_orc)t1 group by category_name order by ct desc limit 10;

需求Ⅲ:统计出视频观看数最高的 20 个视频的所属类别以及每一个类别包含 Top20 视频的个数
- 分析
- 第一步:统计观看数最高的二十个视频
- 第二步:获取视频的类别
- 第三步:按照类别进行分组,获取每一组的视频数(即为每一个类别中包含Top20视频的个数)
select category from gulivideo_orc order by views desc limit 20;t1 select explode(category) category_name from t1;t2 select category_name, count(*) video_sum from t2 group by category_name;
最终代码:
select category_name, count(*) video_sum from (select explode(category) category_name from (select category from gulivideo_orc order by views desc limit 20)t1)t2 group by category_name;

需求Ⅳ: 统计视频观看数 Top50 所关联视频的所属类别排序
- 分析
- 第一步:获取观看数Top50的视频
- 第二步:获取视频的关联的视频Id
- 第三步:和原表做内连接,查询到该id的类别
- 第四步:将类别进行炸裂
- 第五步:按照类别进行分组,并且统计该分组下的视频总数,按照总数进行排序
分步代码:
select relatedId from gulivideo_orc order by views desc limit 50;t1 select explode(relatedId) relatedId_id from t1;t2 select category from t2 join gulivideo_orc t3 on t2.relatedId_id = t3.videoId;t4 select explode(category) category_name from t4;t5 select category_name, count(*) ct from t5 group by category_name order by ct desc;
代码:
select category_name, count(*) ct from (select explode(category) category_name from (select category from (select explode(relatedId) relatedId_id from (select relatedId from gulivideo_orc order by views desc limit 50)t1)t2 join gulivideo_orc t3 on t2.relatedId_id = t3.videoId)t4)t5 group by category_name order by ct desc;

需求Ⅴ:统计每个类别中的视频热度 Top10,以 Music 为例
- 分析
- 第一步:将视频的类别炸裂开
- 第二步:获取类别是Music的数据,并且按照观看数进行排序
分布代码:
select videoId, category_name from gulivideo_orc lateral view explode(category) tmp_category as category_name;t1 select videoId, views, category_name from t1 where category_name = "Music" order by views desc limit 10;
代码:
select videoId, views, category_name from (select videoId, views, category_name from gulivideo_orc lateral view explode(category) tmp_category as category_name)t1 where category_name = "Music" order by views desc limit 10;

需求Ⅵ:统计每个类别视频观看数 Top10
- 分析
- 由于要求组内进行排序,所以使用开窗函数over()
- 第一步:使用explode()函数,将类别分开
- 第二步:使用开窗函数,按照类别分区,并且使用观看数进行排序
- 第三步:获取前十数据
分步代码:
select videoId, views, category_name from gulivideo_orc lateral view explode(category) tmp_category as category_name;t1 select videoId, views, category_name rank() over(partition by t1.category_name order by t1.views desc) rk from t1;t2 select videoId, views, category_name, rk from t2 where t2.rk <= 10;
代码:
select videoId, views, category_name rk from ( select videoId, views, category_name, rank() over(partition by t1.category_name order by t1.views desc) rk from (select videoId, views, category_name from gulivideo_orc lateral view explode(category) tmp_category as category_name)t1)t2 where t2.rk <= 10;


需求Ⅶ:统计上传视频最多的用户 Top10 以及他们上传的视频观看次数在前 20 的视频
- 分析
- 第一步:统计上传视频最多的用户 Top10
- 第二步:用户表和视频表做内连接,获取Top10用户上传的所有视频Id
- 第三步:按照用户进行分区,并且获取每一个用户排名前20的视频(按照视频观看书排序)
分布代码:
select uploader from gulivideo_user_orc order by videos desc limit 10;t1 select videoId, views, t2.uploader from t1 join gulivideo_orc g on t1.uploader = g.uploader;t2 select videoId, views, uploader, rank()over(partition by uploader order by views desc) rk from t2;t3 select videoId, views, uploader from t3 where rk<= 20;
代码:
select videoId, views, uploader, rank()over(partition by uploader order by views desc) rk from (select videoId, views, t1.uploader from (select uploader from gulivideo_user_orc order by videos desc limit 10)t1 join gulivideo_orc g on t1.uploader = g.uploader) t2;
- 由于数据不全,所有没有查询到数据

本文来自网络,不代表协通编程立场,如若转载,请注明出处:https://net2asp.com/9bb2293ce5.html
