> 技术文档 > HIVE实战处理(二十四)留存用户数

HIVE实战处理(二十四)留存用户数

留存概念:次X日活跃留存,次X日新增留存,也就是看今天的新增或活跃用户在后续几天的留存情况

一、留存表的生成逻辑

因为用户活跃日期和留存的日期无法对齐所以搞了2级分区(dt,static_day)

1)首先获得计算日D、根据要出的次X日留存,推算出前面的DT ,整体从活跃表里根据这些日期生成临时活跃表tmp1
2)分别把计算DT和前X日的DT进行匹配,按相差的天数进行匹配,如果匹配一直分别得到对应的次X日留存标识。
3)需要使用1个新的字段存储留存指标的的日期,比如20250701号的留存keep1_num只能等20250702号过完才能计算,那对应也是7.1号算留存日期,是指在DT=20250702的留存时间。

所以根据dt往前推算的日期都是留存日期,不能写到dt这个字段里,因为除了留存指标外还要计算统计日的指标。
如果留存日期=统计日期的,出的当日活跃。留存日期< 统计日期的话,出的是次X日留存指标。

--活跃临时表create table tmp1 as select ,t1.uuid ,t1.dt as statis_day ,case when t1.dt=\'${DT}\' then \'Y\' else \'N\' endas keep_0d_active_flag,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-1), \'-\', \'\') then \'Y\' else \'N\' endas keep_1d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-2), \'-\', \'\') then \'Y\' else \'N\' endas keep_2d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-3), \'-\', \'\') then \'Y\' else \'N\' endas keep_3d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-4), \'-\', \'\') then \'Y\' else \'N\' endas keep_4d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-5), \'-\', \'\') then \'Y\' else \'N\' endas keep_5d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-6), \'-\', \'\') then \'Y\' else \'N\' endas keep_6d_active_flag ,case when t1.dt=regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-7), \'-\', \'\') then \'Y\' else \'N\' endas keep_7d_active_flag  from 活跃表 t1where t1.dt in ( ${DT},regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-1), \'-\', \'\'),regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-2), \'-\', \'\'),regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-3), \'-\', \'\'),regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-4), \'-\', \'\'),regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-5), \'-\', \'\'),regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-6), \'-\', \'\') ,regexp_replace(date_add(from_unixtime(to_unix_timestamp(\'${DT}\', \'yyyyMMdd\')),-7), \'-\', \'\'));--当日活跃以及留存指标insert overwrite table 留存表 partition(dt=\'${DT}\')select group_id ,statis_day ,channel ,version ,sum(case when keep_0d_active_flag=\'Y\' then 1 else 0 end) as av ,sum(case when keep_1d_active_flag=\'Y\' then 1 else 0 end) as keep_1d_av ,sum(case when keep_2d_active_flag=\'Y\' then 1 else 0 end) as keep_2d_av ,sum(case when keep_3d_active_flag=\'Y\' then 1 else 0 end) as keep_3d_av ,sum(case when keep_4d_active_flag=\'Y\' then 1 else 0 end) as keep_4d_av ,sum(case when keep_5d_active_flag=\'Y\' then 1 else 0 end) as keep_5d_av ,sum(case when keep_6d_active_flag=\'Y\' then 1 else 0 end) as keep_6d_av ,sum(case when keep_7d_active_flag=\'Y\' then 1 else 0 end) as keep_7d_avfrom( select  cast(grouping__id as bigint)& 7 & 3 as group_id ,channel,uuid ,statis_day ,max(keep_1d_active_flag) as keep_1d_active_flag ,max(keep_2d_active_flag) as keep_2d_active_flag ,max(keep_3d_active_flag) as keep_3d_active_flag ,max(keep_4d_active_flag) as keep_4d_active_flag ,max(keep_5d_active_flag) as keep_5d_active_flag ,max(keep_6d_active_flag) as keep_6d_active_flag ,max(keep_7d_active_flag) as keep_7d_active_flag from tmp1 group by  ,channel --1,version--2,uuid -- 4,statis_day --8  grouping sets( (channel,uuid,statis_day) ,(version,uuid,statis_day),(uuid,statis_day))) tagroup by group_id ,statis_day ,channel ,version

二、对于留存的表的查询处理

1)非留存指标的话,直接使用where dt between ‘20250701’ and ‘20250707’
2)对于留存指标要取static_day,这个static_day是代表留存日期在dt的不同留存指标。

select
dt
,sum(active_num)
,sum(keep1_num)
,sum(keep2_num)
,sum(keep3_num)
,sum(keep4_num)
from
(select
dt,
,active_num
,0 as keep1_num
,0 as keep2_num
,0 as keep3_num
,0 as keep4_num
from 留存表 where dt between ‘20250701’ and ‘20250704’
union all

select
static_day dt,
,0 as active_num
,keep1_num
,keep2_num
,keep3_num
,keep4_num
from 留存表 where static_day between ‘20250701’ and ‘20250704’
) t group by dt