본문 바로가기
Language/SQL

『SQL로 시작하는 데이터 분석』 Chapter 3. 시계열 분석

by Night Fury 2024. 11. 8.
반응형
캐시 타니무라, 『SQL로 시작하는 데이터 분석』, 박상근, 한빛미디어-OREILLY(2022), p95-166.

 

표준 시간대(standard time zone)

  • GMT(Greenwich Mean Time): 그린위치 평균시
  • UTC(Coordinated Universal Time): 세계 표준시
    • 보통 UTC를 기준으로 사용

 

타임스탬프 시간이 제대로 동기화되었는지 주의

  • e.g. 특정 액션 로그 데이터가 스마트폰 시간을 기록되는지 혹은 서버 DB에 도착하는 시간으로 기록되는지 확인

 

계절성이란?

 

데이터 인덱싱: 시계열 베이스 구간을 기준으로 데이터의 변화량을 이해하는 방법

  • 많은 인사이트를 얻을 수 있는 분석 기법
  • e.g. first_value(sales) over(order by year asc)
SELECT sales_year, sales
,first_value(sales) over (order by sales_year) as index_sales
FROM
(
    SELECT date_part('year',sales_month) as sales_year
    ,sum(sales) as sales
    FROM retail_sales
    WHERE kind_of_business = 'Women''s clothing stores'
    GROUP BY 1
) a

 

 

시간 윈도우 롤링

  • 중요한 요소:
    • 윈도우 사이즈 (e.g. DAU, WAU, MAU)
    • 집계 함수
    • 데이터 분할 or 그룹화
  • 방법: self-join or window function (일부 DB에서 지원하지 않음)
    • window function: Frame 절 윈도우 함수 옵션 사용
    • Frame 절: {RANGE | ROWS | GROUPS} BETWEEN frame_start AND frame_end
      • frame_start, frame end 값 종류
        • UNBOUNDED PRECEDING: 해당 윈도우에서 현재 행을 기준으로 위에 있는 모든 레코드
        • offset PRECEDING: 현재 이전의 가져올 레코드 (offset은 숫자로 표현)
        • CURRENT ROW: 현재 행
        • offset FOLLOWING: 현재 이후의 가져올 레코드 (offset은 숫자로 표현)
        • UNBOUNDED FOLLOWING: 해당 윈도우에서 현재 행을 기준으로 아래에 있는 모든 레코드

SELECT a.sales_month
,a.sales
,avg(b.sales) as moving_avg
,count(b.sales) as records_count
FROM retail_sales a
JOIN retail_sales b on a.kind_of_business = b.kind_of_business 
 and b.sales_month between a.sales_month - interval '11 months' 
 and a.sales_month
 and b.kind_of_business = 'Women''s clothing stores'
WHERE a.kind_of_business = 'Women''s clothing stores'
and a.sales_month >= '1993-01-01'
GROUP BY 1,2
ORDER BY 1
SELECT sales_month
,avg(sales) over (order by sales_month rows between 11 preceding and current row) as moving_avg
,count(sales) over (order by sales_month rows between 11 preceding and current row) as records_count
FROM retail_sales
WHERE kind_of_business = 'Women''s clothing stores'
;

 

 

누적값 (e. g. YTD, QTD, MTD)

SELECT sales_month
,sales
,sum(sales) over (partition by date_part('year',sales_month) order by sales_month) as sales_ytd
FROM retail_sales
WHERE kind_of_business = 'Women''s clothing stores'
;
SELECT a.sales_month, a.sales
,sum(b.sales) as sales_ytd
FROM retail_sales a
JOIN retail_sales b on date_part('year',a.sales_month) = date_part('year',b.sales_month)
 and b.sales_month <= a.sales_month
 and b.kind_of_business = 'Women''s clothing stores'
WHERE a.kind_of_business = 'Women''s clothing stores'
GROUP BY 1,2
;

 

계절성 분석

  • 구간 비교 (period-over-period)
    • YoY(전년 대비 증감률), MoM(전월 대비 증감률), DoD(전일 대비 증감률)
      • 계절성의 변화를 이해하는 데는 유용하지만 데이터셋의 계절성을 분석하기에는 적절하지 않음
      • window function: LAG(offset만큼 이전 행의 값을 반환), LEAD(offset만큼 다음 행의 값을 반환)
SELECT sales_year, yearly_sales
,lag(yearly_sales) over (order by sales_year) as prev_year_sales
,(yearly_sales / lag(yearly_sales) over (order by sales_year) -1) * 100 as pct_growth_from_previous
FROM
(
    SELECT date_part('year',sales_month) as sales_year
    ,sum(sales) as yearly_sales
    FROM retail_sales
    WHERE kind_of_business = 'Book stores'
    GROUP BY 1
) a
;

 

  •  작년 vs 올해
    • date_part 함수 활용
SELECT sales_month, sales
,sales - lag(sales) over (partition by date_part('month',sales_month) order by sales_month) as absolute_diff
,(sales / lag(sales) over (partition by date_part('month',sales_month) order by sales_month) - 1) * 100 as pct_diff
FROM retail_sales
WHERE kind_of_business = 'Book stores'
;
SELECT date_part('month',sales_month) as month_number
,to_char(sales_month,'Month') as month_name
,max(case when date_part('year',sales_month) = 1992 then sales end) as sales_1992
,max(case when date_part('year',sales_month) = 1993 then sales end) as sales_1993
,max(case when date_part('year',sales_month) = 1994 then sales end) as sales_1994
FROM retail_sales
WHERE kind_of_business = 'Book stores' and sales_month between '1992-01-01' and '1994-12-01'
GROUP BY 1,2
;

 

  • 다중 구간 비교
SELECT sales_month, sales
,sales / ((prev_sales_1 + prev_sales_2 + prev_sales_3) / 3) * 100
as pct_of_3_prev
FROM
(
    SELECT sales_month, sales
    ,lag(sales,1) over (partition by date_part('month',sales_month)
                        order by sales_month
                        ) as prev_sales_1
    ,lag(sales,2) over (partition by date_part('month',sales_month)
                        order by sales_month
                        ) as prev_sales_2
    ,lag(sales,3) over (partition by date_part('month',sales_month)
                        order by sales_month
                        ) as prev_sales_3
    FROM retail_sales
    WHERE kind_of_business = 'Book stores'
) a
;
SELECT sales_month, sales
,sales / avg(sales) over (partition by date_part('month',sales_month)
                          order by sales_month
                          rows between 3 preceding and 1 preceding
                          ) * 100 as pct_of_prev_3
FROM retail_sales
WHERE kind_of_business = 'Book stores'
;
반응형

댓글