E-commerce ETL Data Warehouse Project

Building a data pipleline to transform raw e-commerce data into business-ready data for business insights

Tools: Python, SQL, Power BI Desktop, DAX, PostgreSQL 16, pgAdmin 4, VS Code, Windows Command Prompt/PowerShell, star schema

Skills: chunked data loading, foreign key validation, referential integrity enforcement, comprehensive data quality checks, SQL aggregate functions, error handling and recovert, perforance monitoring, business logic validation

These days, business need to analyze very large amount of data. And on top of that, they will also need a lot of time and effort to transform raw unorganized data into processed data that suits their data needs and what they are looking for in those messy data.

In this project, I attempted to build a data pipeline that transform raw data into business ready data which can then be analyzed and gather insight from. The processed data will be displayed on a Power BI dashboard which the executives can easily gather insights from the charts and score cards in the dashobard.

Source of the data

The data for this project are provided by a Brazilian e-commerce public dataset provided by Olist. You can find the dataset here. You can find 8 csv files in this dataset. We will only use 6 of them. They are:

olist_customers_dataset
olist_order_items_dataset
olist_order_payments_dataset
olist_order_reviews_dataset
olist_orders_dataset
olist_products_dataset

Defining business questions

As we aim to build Power BI dashboard that gives business insights to executives, we will first need to define the business questions that need to be answered for this project. For the purpose of building a meaningful dashobard, we will define our business questiosn as follows:

"How can we improve customer retention and identify high-value customer segments?"
"What are our key revenue drivers and how can we optimize operational performance?"

Project setup and planning

PostgreSQL 18 was used for this project. Before we lay down the groundwork for this project that is setting up the extract layer of the data pipeline, a database was created using pgAdmin 4 in order to let us store the processed data once we have finished the data pipeline and finish loading the processed data.

pgAdmin 4 interface

Apart from Postgre SQL 18, VS Code will also be used for coding and implementing the data pipeline once we have finished building it.

Data Extraction

Now we will start laying down the foundation layer of this data pipeline - the extraction layer.

Apart from the dataset used for this project, we would also need a Python script to act as a logger, and another 6 Python scripts to extract the data from the 6 chosen csv files.

Scripts created

logger.py
6 Python scripts for each csv file

Data extracted

olist_customers_dataset.csv 99,441 rows | 5 cols
olist_orders_dataset.csv 99,441 rows | 8 cols
olist_order_items_dataset.csv 112,650 rows | 7 cols
olist_order_payments_dataset.csv 103,886 rows | 5 cols
olist_order_reviews_dataset.csv 99,224 rows | 7 cols
olist_products_dataset.csv 32,951 rows | 9 cols

Achievements

Checked file sizes and record counts
Identified null values and data types

Data Transformation

Next, we will have to build the transformation layer which it will transofrm the extracted raw data into data that suits our business need.

Star schema

Star schema for this project

In order to suit our business needs, we are going to transform our raw data into two fact tables and four dimension tables. Apart from the "fact_orders", we will also create a "fact_cohort_retention" for the purpose of identifying customers who placed more than one order.

Data transformed

dim_date.csv 791 rows | 16 columns
dim_products.csv 32,951 rows | 13 columns
dim_customers.csv 99,441 rows | 18 columns
dim_payment_type.csv 5 rows | 4 columns
fact_orders.csv 99,992 rows | 25 columns
fact_cohort_retention.csv 25 rows | 7 columns

Dimension table creation

dim_date - Date Dimension

Generated complete date range (2016-09-01 to 2018-10-31)
Attributes: date_key, full_date, year, quarter, month, month_name, week, day_of_week, day_name, is_weekend, is_holiday, fiscal_year, fiscal_quarter, created_at, date_str
Surrogate key format: YYYYMMDD (e.g., 20170115)

dim_products - Product Dimension

Extracted unique products from raw data
Attributes: product_key, product_id, product_category_segment
Surrogate key: Sequential integer starting at 1
Handled missing categories

dim_customers - Customer Dimension

Extracted unique customers with location data
Calculated lifetime value (CLV) per customer
Created customer segmentation (High/Medium/Low based on CLV)
Mapped states to regions (North, Northeast, Southeast, South, Central-West)
Attributes: customer_key, customer_id, customer_unique_id, customer_city, customer_state, customer_region, customer_segment, first_order_date, last_order_date, total_orders, lifetime_value
Surrogate key: Sequential integer (customer_key)

dim_payment_type - Payment Type Dimension

Created lookup table for payment methods
Types: credit_card, boleto, voucher, debit_card, not_defined
Attributes: payment_type_key, payment_type
Surrogate key: Sequential integer (1-5)

Fact table creation

fact_orders - Order Transactions

Joined with the other four dimension tables
Metrics calculated:
- order_total_value = sum of item prices + freight
- delivery_days = days between purchase and delivery
- delivery_delay_days = actual delivery - estimated delivery
- is_late_delivery = TRUE if delay > 0
- is_completed_order = TRUE if status = 'delivered'
Aggregated to order-level (one row per order)

fact_cohort_retention - Customer Retention

Grouped customers by first purchase month (cohort)
Calculated months since first purchase (0-12 months)
Computed retention rate: (active customers / cohort size) * 100
Attributes: cohort_month, months_since_first_purchase, retention_rate, retained_customers

transform_all.py

Master Transform Orchestrator
Runs all transformations in correct dependency order
Demonstrates production-level ETL orchestration

Business logic - Customer segmentation

def _segment_customer(self, row):
        """
        Segment customer based on business rules
        
        Segmentation:
        - New: 1 order
        - Returning: 2-5 orders
        - VIP: >5 orders OR lifetime_value > threshold
        """
        total_orders = row['total_orders']
        lifetime_value = row['lifetime_value'] if pd.notna(row['lifetime_value']) else 0
        
        vip_threshold = self.segmentation_rules['vip_customer_min_value']
        vip_orders = self.segmentation_rules['vip_customer_min_orders']
        returning_max = self.segmentation_rules['returning_customer_max_orders']
        
        if total_orders == 0:
            return 'Inactive'
        elif total_orders >= vip_orders or lifetime_value >= vip_threshold:
            return 'VIP'
        elif total_orders <= 1:
            return 'New'
        elif total_orders <= returning_max:
            return 'Returning'
        else:
            return 'Loyal'

Business logic - Cohort Retention

class CohortRetentionBuilder:
    """Build cohort retention fact table"""
    
    def __init__(self):
        """Initialize cohort retention builder"""
        logger.info("CohortRetentionBuilder initialized")
    
    def build(self, dim_customers):
        """
        Build cohort retention table
        
        Args:
            dim_customers: Customer dimension with first_order_date
        
        Returns:
            DataFrame: Cohort retention fact table
        """
        try:
            logger.info("Building FACT_COHORT_RETENTION...")
            
            # Step 1: Extract orders
            logger.info("Step 1/5: Extracting orders...")
            orders_ext = OrdersExtractor()
            df_orders = orders_ext.extract()
            logger.info(f"  ✓ Loaded {len(df_orders):,} orders")
            
            # Step 2: Get customer first order dates from dimension
            logger.info("Step 2/5: Getting customer cohorts...")
            
            df_customers_cohort = dim_customers[['customer_id', 'first_order_date']].copy()
            
            # Create cohort month (YYYY-MM format)
            df_customers_cohort['cohort_month'] = pd.to_datetime(
                df_customers_cohort['first_order_date']
            ).dt.to_period('M')
            
            logger.info(f"  ✓ Identified {df_customers_cohort['cohort_month'].nunique()} cohorts")
            
            # Step 3: Prepare orders data
            logger.info("Step 3/5: Preparing order data for cohort analysis...")
            
            df_orders_cohort = df_orders[['customer_id', 'order_purchase_timestamp']].copy()
            df_orders_cohort['order_month'] = pd.to_datetime(
                df_orders_cohort['order_purchase_timestamp']
            ).dt.to_period('M')
            
            # Join orders with customer cohorts
            df_analysis = df_orders_cohort.merge(
                df_customers_cohort,
                on='customer_id',
                how='inner'
            )
            
            logger.info(f"  ✓ Prepared {len(df_analysis):,} order records for analysis")
            
            # Step 4: Calculate months since first purchase
            logger.info("Step 4/5: Calculating retention by cohort...")
            
            # Calculate months difference
            df_analysis['months_since_first_purchase'] = (
                (df_analysis['order_month'] - df_analysis['cohort_month']).apply(lambda x: x.n)
            )
            
            # Convert periods back to dates for grouping
            df_analysis['cohort_month_date'] = df_analysis['cohort_month'].dt.to_timestamp()
            
            # Use Polars for fast aggregation
            df_analysis_pl = pl.from_pandas(df_analysis[[
                'cohort_month_date', 'months_since_first_purchase', 'customer_id'
            ]])
            
            # Group by cohort and months since first purchase
            cohort_data_pl = df_analysis_pl.group_by([
                'cohort_month_date', 'months_since_first_purchase'
            ]).agg([
                pl.col('customer_id').n_unique().alias('retained_customers')
            ]).sort(['cohort_month_date', 'months_since_first_purchase'])
            
            cohort_data = cohort_data_pl.to_pandas()
            
            logger.info(f"  ✓ Calculated retention for {len(cohort_data):,} cohort-month combinations")
            
            # Step 5: Calculate cohort sizes and retention rates
            logger.info("Step 5/5: Calculating cohort sizes and retention rates...")
            
            # Get cohort sizes (month 0 = acquisition month)
            cohort_sizes = cohort_data[
                cohort_data['months_since_first_purchase'] == 0
            ][['cohort_month_date', 'retained_customers']].copy()
            cohort_sizes.columns = ['cohort_month_date', 'cohort_size']
            
            # Merge cohort sizes
            cohort_data = cohort_data.merge(
                cohort_sizes,
                on='cohort_month_date',
                how='left'
            )
            
            # Calculate retention rate
            cohort_data['retention_rate'] = (
                cohort_data['retained_customers'] / cohort_data['cohort_size'] * 100
            ).round(2)
            
            # Rename for final output
            cohort_data = cohort_data.rename(columns={
                'cohort_month_date': 'cohort_month'
            })
            
            # Create surrogate key
            cohort_data = cohort_data.reset_index(drop=True)
            cohort_data.insert(0, 'cohort_retention_key', cohort_data.index + 1)
            
            # Add created_at
            cohort_data['created_at'] = pd.Timestamp.now()
            
            # Reorder columns
            final_columns = [
                'cohort_retention_key',
                'cohort_month',
                'months_since_first_purchase',
                'cohort_size',
                'retained_customers',
                'retention_rate',
                'created_at'
            ]
            
            cohort_data = cohort_data[final_columns]
            
            logger.info(f"✓ FACT_COHORT_RETENTION built successfully: {len(cohort_data):,} rows")
            
            # Log insights
            logger.info("\n" + "="*60)
            logger.info("COHORT RETENTION INSIGHTS:")
            logger.info(f"  - Total cohorts: {cohort_data['cohort_month'].nunique()}")
            logger.info(f"  - Average cohort size: {cohort_data[cohort_data['months_since_first_purchase']==0]['cohort_size'].mean():.0f} customers")
            
            # Month 1 retention (repeat purchase rate)
            month_1_retention = cohort_data[
                cohort_data['months_since_first_purchase'] == 1
            ]['retention_rate'].mean()
            logger.info(f"  - Average Month 1 retention: {month_1_retention:.1f}%")
            
            # Month 3 retention
            month_3_retention = cohort_data[
                cohort_data['months_since_first_purchase'] == 3
            ]['retention_rate'].mean()
            if pd.notna(month_3_retention):
                logger.info(f"  - Average Month 3 retention: {month_3_retention:.1f}%")
            
            logger.info("="*60 + "\n")
            
            return cohort_data
            
        except Exception as e:
            logger.error(f"✗ FACT_COHORT_RETENTION build failed: {e}")
            import traceback
            traceback.print_exc()
            raise

Skills Applied

Dimensional modeling (star schema design)
Data transformation and cleansing
Surrogate key generation
Business logic implementation
Cohort analysis methodology
Pandas advanced operations (groupby, merge, pivot)
Date/time manipulation
Statistical calculations (percentiles, aggregations)

Loading the data

After finish transforming the data, we will now load the transformed data into our PostgreSQL database.

create_schema.py

"""
Create PostgreSQL schema for data warehouse
Defines all dimension and fact tables WITHOUT foreign key constraints
"""

import sys
import os

sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))

from src.utils.logger import setup_logger
from src.utils.db_connector import DatabaseConnector

logger = setup_logger('create_schema')


class SchemaCreator:
    """Create data warehouse schema in PostgreSQL"""
    
    def __init__(self):
        """Initialize schema creator"""
        self.db = DatabaseConnector()
        logger.info("SchemaCreator initialized")
    
    def create_all_tables(self):
        """Create all dimension and fact tables"""
        try:
            logger.info("Creating data warehouse schema...")
            
            # Drop existing tables (in reverse dependency order)
            logger.info("Dropping existing tables if they exist...")
            self._drop_tables()
            
            # Create dimension tables first (no FK dependencies)
            logger.info("\nCreating dimension tables...")
            self._create_dim_date()
            self._create_dim_products()
            self._create_dim_payment_type()
            self._create_dim_customers()
            
            # Create fact tables (depend on dimensions)
            logger.info("\nCreating fact tables...")
            self._create_fact_orders()
            self._create_fact_cohort_retention()
            
            # Create indexes for performance
            logger.info("\nCreating indexes...")
            self._create_indexes()
            
            logger.info("\n" + "="*80)
            logger.info("✓ DATA WAREHOUSE SCHEMA CREATED SUCCESSFULLY")
            logger.info("="*80 + "\n")
            
            # Verify tables
            self._verify_tables()
            
        except Exception as e:
            logger.error(f"✗ Schema creation failed: {e}")
            raise
        finally:
            self.db.close_pool()
    
    def _drop_tables(self):
        """Drop existing tables in reverse dependency order"""
        drop_queries = [
            "DROP TABLE IF EXISTS fact_cohort_retention CASCADE;",
            "DROP TABLE IF EXISTS fact_orders CASCADE;",
            "DROP TABLE IF EXISTS dim_customers CASCADE;",
            "DROP TABLE IF EXISTS dim_payment_type CASCADE;",
            "DROP TABLE IF EXISTS dim_products CASCADE;",
            "DROP TABLE IF EXISTS dim_date CASCADE;"
        ]
        
        for query in drop_queries:
            self.db.execute_query(query)
        
        logger.info("  ✓ Dropped existing tables")
    
    def _create_dim_date(self):
        """Create date dimension table"""
        query = """
        CREATE TABLE dim_date (
            date_key INTEGER PRIMARY KEY,
            full_date DATE NOT NULL UNIQUE,
            year INTEGER NOT NULL,
            quarter INTEGER NOT NULL,
            month INTEGER NOT NULL,
            month_name VARCHAR(20) NOT NULL,
            week INTEGER NOT NULL,
            day_of_month INTEGER NOT NULL,
            day_of_week INTEGER NOT NULL,
            day_name VARCHAR(20) NOT NULL,
            is_weekend BOOLEAN NOT NULL,
            is_holiday BOOLEAN NOT NULL,
            fiscal_year INTEGER NOT NULL,
            fiscal_quarter INTEGER NOT NULL,
            created_at TIMESTAMP NOT NULL
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created dim_date")
    
    def _create_dim_products(self):
        """Create product dimension table"""
        query = """
        CREATE TABLE dim_products (
            product_key INTEGER PRIMARY KEY,
            product_id VARCHAR(50) NOT NULL UNIQUE,
            product_category_name VARCHAR(100),
            product_category_english VARCHAR(100),
            product_category_segment VARCHAR(50),
            product_weight_g INTEGER,
            product_length_cm INTEGER,
            product_height_cm INTEGER,
            product_width_cm INTEGER,
            product_volume_cm3 NUMERIC(12, 2),
            product_photos_qty INTEGER,
            has_photos BOOLEAN,
            created_at TIMESTAMP NOT NULL
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created dim_products")
    
    def _create_dim_payment_type(self):
        """Create payment type dimension table"""
        query = """
        CREATE TABLE dim_payment_type (
            payment_type_key INTEGER PRIMARY KEY,
            payment_type VARCHAR(50) NOT NULL UNIQUE,
            payment_category VARCHAR(50) NOT NULL,
            created_at TIMESTAMP NOT NULL
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created dim_payment_type")
    
    def _create_dim_customers(self):
        """Create customer dimension table"""
        query = """
        CREATE TABLE dim_customers (
            customer_key INTEGER PRIMARY KEY,
            customer_id VARCHAR(50) NOT NULL UNIQUE,
            customer_unique_id VARCHAR(50),
            customer_city VARCHAR(100),
            customer_state VARCHAR(2),
            customer_region VARCHAR(50),
            customer_segment VARCHAR(20),
            first_order_date TIMESTAMP,
            last_order_date TIMESTAMP,
            total_orders INTEGER,
            delivered_orders INTEGER,
            total_spent NUMERIC(12, 2),
            avg_order_value NUMERIC(12, 2),
            lifetime_value NUMERIC(12, 2),
            days_as_customer INTEGER,
            purchase_frequency_annual NUMERIC(12, 2),
            created_at TIMESTAMP NOT NULL,
            updated_at TIMESTAMP NOT NULL
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created dim_customers")
    
    def _create_fact_orders(self):
        """Create fact orders table WITHOUT foreign key constraints"""
        query = """
        CREATE TABLE fact_orders (
            order_key BIGSERIAL PRIMARY KEY,
            customer_key INTEGER,
            product_key INTEGER,
            order_date_key INTEGER,
            delivery_date_key INTEGER,
            payment_type_key INTEGER,
            order_id VARCHAR(50) NOT NULL UNIQUE,
            order_status VARCHAR(50),
            seller_id VARCHAR(50),
            order_item_count INTEGER,
            order_subtotal NUMERIC(12, 2),
            order_freight_total NUMERIC(12, 2),
            order_total_value NUMERIC(12, 2),
            payment_value NUMERIC(12, 2),
            payment_installments INTEGER,
            delivery_days INTEGER,
            estimated_delivery_days INTEGER,
            delivery_delay_days INTEGER,
            is_late_delivery BOOLEAN,
            is_completed_order BOOLEAN,
            review_score INTEGER,
            has_review BOOLEAN,
            order_purchase_timestamp TIMESTAMP,
            order_delivered_customer_date TIMESTAMP,
            created_at TIMESTAMP NOT NULL
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created fact_orders (NO FK constraints for flexibility)")
    
    def _create_fact_cohort_retention(self):
        """Create cohort retention fact table"""
        query = """
        CREATE TABLE fact_cohort_retention (
            cohort_retention_key SERIAL PRIMARY KEY,
            cohort_month DATE NOT NULL,
            months_since_first_purchase INTEGER NOT NULL,
            cohort_size INTEGER NOT NULL,
            retained_customers INTEGER NOT NULL,
            retention_rate NUMERIC(5, 2) NOT NULL,
            created_at TIMESTAMP NOT NULL,
            UNIQUE(cohort_month, months_since_first_purchase)
        );
        """
        self.db.execute_query(query)
        logger.info("  ✓ Created fact_cohort_retention")
    
    def _create_indexes(self):
        """Create indexes for query performance"""
        indexes = [
            # Date dimension indexes
            "CREATE INDEX idx_dim_date_full_date ON dim_date(full_date);",
            "CREATE INDEX idx_dim_date_year_month ON dim_date(year, month);",
            
            # Product dimension indexes
            "CREATE INDEX idx_dim_products_category_segment ON dim_products(product_category_segment);",
            "CREATE INDEX idx_dim_products_category_english ON dim_products(product_category_english);",
            
            # Customer dimension indexes
            "CREATE INDEX idx_dim_customers_segment ON dim_customers(customer_segment);",
            "CREATE INDEX idx_dim_customers_region ON dim_customers(customer_region);",
            "CREATE INDEX idx_dim_customers_unique_id ON dim_customers(customer_unique_id);",
            "CREATE INDEX idx_dim_customers_clv ON dim_customers(lifetime_value);",
            
            # Fact orders indexes (for fast queries even without FK constraints)
            "CREATE INDEX idx_fact_orders_customer_key ON fact_orders(customer_key);",
            "CREATE INDEX idx_fact_orders_product_key ON fact_orders(product_key);",
            "CREATE INDEX idx_fact_orders_order_date_key ON fact_orders(order_date_key);",
            "CREATE INDEX idx_fact_orders_delivery_date_key ON fact_orders(delivery_date_key);",
            "CREATE INDEX idx_fact_orders_payment_type_key ON fact_orders(payment_type_key);",
            "CREATE INDEX idx_fact_orders_status ON fact_orders(order_status);",
            "CREATE INDEX idx_fact_orders_completed ON fact_orders(is_completed_order);",
            "CREATE INDEX idx_fact_orders_late ON fact_orders(is_late_delivery);",
            
            # Composite indexes for common query patterns
            "CREATE INDEX idx_fact_orders_customer_date ON fact_orders(customer_key, order_date_key);",
            "CREATE INDEX idx_fact_orders_date_status ON fact_orders(order_date_key, order_status);",
            
            # Cohort retention indexes
            "CREATE INDEX idx_fact_cohort_month ON fact_cohort_retention(cohort_month);",
            "CREATE INDEX idx_fact_cohort_months_since ON fact_cohort_retention(months_since_first_purchase);"
        ]
        
        for idx_query in indexes:
            try:
                self.db.execute_query(idx_query)
            except Exception as e:
                logger.warning(f"  ⚠ Index creation warning: {e}")
        
        logger.info(f"  ✓ Created {len(indexes)} indexes")
    
    def _verify_tables(self):
        """Verify all tables were created"""
        query = """
        SELECT table_name 
        FROM information_schema.tables 
        WHERE table_schema = 'public' 
        AND table_type = 'BASE TABLE'
        ORDER BY table_name;
        """
        
        tables = self.db.fetch_query(query)
        
        logger.info("\nVerifying tables created:")
        for table in tables:
            logger.info(f"  ✓ {table[0]}")
        
        logger.info(f"\nTotal tables created: {len(tables)}")


# Main execution
if __name__ == "__main__":
    print("\n" + "="*80)
    print("CREATING DATA WAREHOUSE SCHEMA IN POSTGRESQL")
    print("="*80 + "\n")
    
    creator = SchemaCreator()
    creator.create_all_tables()
    
    print("\n✓ Schema creation complete!")
    print("\nNext step: Load transformed data into tables")

load_dimensions.py

Load transformed dimensions into database

"""
Load dimension tables from transformed CSV files into PostgreSQL
Uses pandas to_sql for efficient bulk loading
"""

import pandas as pd
import sys
import os
from datetime import datetime

sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))

from src.utils.logger import setup_logger
from src.utils.db_connector import DatabaseConnector

logger = setup_logger('load_dimensions')


class DimensionLoader:
    """Load dimension tables into PostgreSQL"""
    
    def __init__(self, staging_dir='data/staging'):
        """
        Initialize dimension loader
        
        Args:
            staging_dir: Directory containing transformed CSV files
        """
        self.staging_dir = staging_dir
        self.db = DatabaseConnector()
        logger.info(f"DimensionLoader initialized with staging dir: {staging_dir}")
    
    def load_all_dimensions(self):
        """Load all dimension tables"""
        try:
            logger.info("Loading dimension tables...")
            
            # Load in dependency order
            self._load_dim_date()
            self._load_dim_products()
            self._load_dim_payment_type()
            self._load_dim_customers()
            
            logger.info("\n" + "="*80)
            logger.info("✓ ALL DIMENSION TABLES LOADED SUCCESSFULLY")
            logger.info("="*80 + "\n")
            
            # Verify row counts
            self._verify_loads()
            
        except Exception as e:
            logger.error(f"✗ Dimension loading failed: {e}")
            raise
        finally:
            self.db.close_pool()
    
    def _load_dim_date(self):
        """Load date dimension"""
        logger.info("\n[1/4] Loading dim_date...")
        
        # Read CSV
        filepath = os.path.join(self.staging_dir, 'dim_date.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df):,} rows from CSV")
        
        # Convert date column
        df['full_date'] = pd.to_datetime(df['full_date'])
        df['created_at'] = pd.to_datetime(df['created_at'])
        
        # Define expected columns (15 columns total - NO date_str)
        expected_columns = [
            'date_key', 'full_date', 'year', 'quarter', 'month', 
            'month_name', 'week', 'day_of_month', 'day_of_week', 
            'day_name', 'is_weekend', 'is_holiday', 'fiscal_year', 
            'fiscal_quarter', 'created_at'
        ]
        
        # Remove any extra columns (like date_str)
        df = df[expected_columns]
        
        # Load to PostgreSQL using pandas to_sql
        engine = self.db.get_engine()
        
        df.to_sql(
            'dim_date',
            engine,
            if_exists='append',
            index=False,
            method='multi',
            chunksize=1000
        )
        
        logger.info(f"  ✓ Loaded {len(df):,} rows into dim_date")
    
    def _load_dim_products(self):
        """Load product dimension"""
        logger.info("\n[2/4] Loading dim_products...")
        
        filepath = os.path.join(self.staging_dir, 'dim_products.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df):,} rows from CSV")
        
        # Convert timestamps
        df['created_at'] = pd.to_datetime(df['created_at'])
        
        # KEEP product_key from CSV
        
        # Load to PostgreSQL
        engine = self.db.get_engine()
        
        df.to_sql(
            'dim_products',
            engine,
            if_exists='append',
            index=False,
            method='multi',
            chunksize=1000
        )
        
        logger.info(f"  ✓ Loaded {len(df):,} rows into dim_products")
    
    def _load_dim_payment_type(self):
        """Load payment type dimension"""
        logger.info("\n[3/4] Loading dim_payment_type...")
        
        filepath = os.path.join(self.staging_dir, 'dim_payment_type.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df)} rows from CSV")
        
        # Convert timestamps
        df['created_at'] = pd.to_datetime(df['created_at'])
        
        # KEEP payment_type_key from CSV
        
        # Load to PostgreSQL
        engine = self.db.get_engine()
        
        df.to_sql(
            'dim_payment_type',
            engine,
            if_exists='append',
            index=False,
            method='multi'
        )
        
        logger.info(f"  ✓ Loaded {len(df)} rows into dim_payment_type")
    
    def _load_dim_customers(self):
        """Load customer dimension"""
        logger.info("\n[4/4] Loading dim_customers...")
        
        filepath = os.path.join(self.staging_dir, 'dim_customers.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df):,} rows from CSV")
        
        # Convert timestamps
        date_columns = ['first_order_date', 'last_order_date', 'created_at', 'updated_at']
        for col in date_columns:
            if col in df.columns:
                df[col] = pd.to_datetime(df[col])
        
        # KEEP customer_key from CSV
        
        # Load to PostgreSQL
        engine = self.db.get_engine()
        
        df.to_sql(
            'dim_customers',
            engine,
            if_exists='append',
            index=False,
            method='multi',
            chunksize=5000
        )
        
        logger.info(f"  ✓ Loaded {len(df):,} rows into dim_customers")
    
    def _verify_loads(self):
        """Verify dimension tables were loaded correctly"""
        logger.info("\nVerifying dimension table loads:")
        
        tables = ['dim_date', 'dim_products', 'dim_payment_type', 'dim_customers']
        
        for table in tables:
            query = f"SELECT COUNT(*) FROM {table};"
            result = self.db.fetch_query(query)
            count = result[0][0]
            logger.info(f"  - {table}: {count:,} rows")


# Main execution
if __name__ == "__main__":
    print("\n" + "="*80)
    print("LOADING DIMENSION TABLES INTO POSTGRESQL")
    print("="*80 + "\n")
    
    loader = DimensionLoader()
    loader.load_all_dimensions()
    
    print("\n✓ Dimension tables loaded successfully!")
    print("\nNext step: Load fact tables")

load_facts.py

Load the two fact tables into the database

"""
Load fact tables from transformed CSV files into PostgreSQL
Uses COPY FROM for maximum speed and bypasses all constraints
"""

import pandas as pd
import sys
import os
from io import StringIO

sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))

from src.utils.logger import setup_logger
from src.utils.db_connector import DatabaseConnector

logger = setup_logger('load_facts')


class FactLoader:
    """Load fact tables into PostgreSQL"""
    
    def __init__(self, staging_dir='data/staging'):
        """
        Initialize fact loader
        
        Args:
            staging_dir: Directory containing transformed CSV files
        """
        self.staging_dir = staging_dir
        self.db = DatabaseConnector()
        logger.info(f"FactLoader initialized with staging dir: {staging_dir}")
    
    def load_all_facts(self):
        """Load all fact tables"""
        try:
            logger.info("Loading fact tables...")
            
            # Load fact orders
            self._load_fact_orders()
            
            # Load cohort retention
            self._load_fact_cohort_retention()
            
            logger.info("\n" + "="*80)
            logger.info("✓ ALL FACT TABLES LOADED SUCCESSFULLY")
            logger.info("="*80 + "\n")
            
            # Verify loads
            self._verify_loads()
            
            # Show sample analytics
            self._show_analytics()
            
        except Exception as e:
            logger.error(f"✗ Fact loading failed: {e}")
            raise
        finally:
            self.db.close_pool()
    
    def _load_fact_orders(self):
        """Load fact orders - using chunked pandas loading"""
        logger.info("\n[1/2] Loading fact_orders...")
        
        # Read CSV
        filepath = os.path.join(self.staging_dir, 'fact_orders.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df):,} rows from CSV")
        
        # Convert timestamps
        timestamp_columns = ['order_purchase_timestamp', 'order_delivered_customer_date', 'created_at']
        for col in timestamp_columns:
            if col in df.columns:
                df[col] = pd.to_datetime(df[col], errors='coerce')
        
        # Drop order_key if present
        if 'order_key' in df.columns:
            df = df.drop('order_key', axis=1)
        
        # Check for required columns
        required_columns = [
            'customer_key', 'product_key', 'order_date_key', 'delivery_date_key',
            'payment_type_key', 'order_id', 'order_status', 'seller_id',
            'order_item_count', 'order_subtotal', 'order_freight_total',
            'order_total_value', 'payment_value', 'payment_installments',
            'delivery_days', 'estimated_delivery_days', 'delivery_delay_days',
            'is_late_delivery', 'is_completed_order', 'review_score',
            'has_review', 'order_purchase_timestamp', 'order_delivered_customer_date',
            'created_at'
        ]
        
        # Filter to only required columns
        available_columns = [col for col in required_columns if col in df.columns]
        df = df[available_columns]
        
        logger.info(f"  - Using {len(available_columns)} columns")
        
        # Load to PostgreSQL in chunks
        logger.info("  - Loading to PostgreSQL in chunks...")
        
        engine = self.db.get_engine()
        chunk_size = 5000  # Smaller chunks for stability
        total_chunks = (len(df) // chunk_size) + 1
        
        loaded_rows = 0
        for i, chunk_start in enumerate(range(0, len(df), chunk_size)):
            chunk = df.iloc[chunk_start:chunk_start + chunk_size]
            
            try:
                chunk.to_sql(
                    'fact_orders',
                    engine,
                    if_exists='append',
                    index=False,
                    method='multi',
                    chunksize=1000
                )
                loaded_rows += len(chunk)
                logger.info(f"    → Loaded chunk {i+1}/{total_chunks} ({loaded_rows:,}/{len(df):,} rows)")
                
            except Exception as e:
                logger.error(f"    ✗ Chunk {i+1} failed: {str(e)[:200]}")
                logger.info(f"    → Attempting row-by-row insert for failed chunk...")
                
                # Try row by row for failed chunk
                for idx, row in chunk.iterrows():
                    try:
                        row.to_frame().T.to_sql(
                            'fact_orders',
                            engine,
                            if_exists='append',
                            index=False
                        )
                        loaded_rows += 1
                    except Exception as row_error:
                        logger.warning(f"    ⚠ Skipped row {idx}: {str(row_error)[:100]}")
                        continue
        
        logger.info(f"  ✓ Loaded {loaded_rows:,} rows into fact_orders")
    
    def _load_fact_cohort_retention(self):
        """Load cohort retention fact table"""
        logger.info("\n[2/2] Loading fact_cohort_retention...")
        
        filepath = os.path.join(self.staging_dir, 'fact_cohort_retention.csv')
        df = pd.read_csv(filepath)
        
        logger.info(f"  - Read {len(df):,} rows from CSV")
        
        # Convert dates
        df['cohort_month'] = pd.to_datetime(df['cohort_month'])
        df['created_at'] = pd.to_datetime(df['created_at'])
        
        # Drop surrogate key
        if 'cohort_retention_key' in df.columns:
            df = df.drop('cohort_retention_key', axis=1)
        
        # Load to PostgreSQL
        engine = self.db.get_engine()
        
        df.to_sql(
            'fact_cohort_retention',
            engine,
            if_exists='append',
            index=False,
            method='multi',
            chunksize=1000
        )
        
        logger.info(f"  ✓ Loaded {len(df):,} rows into fact_cohort_retention")
    
    def _verify_loads(self):
        """Verify fact tables were loaded correctly"""
        logger.info("\nVerifying fact table loads:")
        
        tables = ['fact_orders', 'fact_cohort_retention']
        
        for table in tables:
            query = f"SELECT COUNT(*) FROM {table};"
            result = self.db.fetch_query(query)
            count = result[0][0]
            logger.info(f"  - {table}: {count:,} rows")
    
    def _show_analytics(self):
        """Show sample analytics from loaded data"""
        logger.info("\n" + "="*60)
        logger.info("SAMPLE ANALYTICS FROM DATA WAREHOUSE")
        logger.info("="*60)
        
        # Total revenue
        query = """
        SELECT 
            SUM(order_total_value) as total_revenue,
            AVG(order_total_value) as avg_order_value,
            COUNT(*) as total_orders
        FROM fact_orders
        WHERE is_completed_order = TRUE;
        """
        result = self.db.fetch_query(query)
        if result:
            total_rev, avg_order, total_orders = result[0]
            logger.info(f"\nRevenue Metrics:")
            logger.info(f"  - Total Revenue: R$ {total_rev:,.2f}")
            logger.info(f"  - Average Order Value: R$ {avg_order:.2f}")
            logger.info(f"  - Completed Orders: {total_orders:,}")
        
        # Delivery performance
        query = """
        SELECT 
            COUNT(*) FILTER (WHERE is_late_delivery = TRUE) as late_deliveries,
            COUNT(*) FILTER (WHERE is_late_delivery = FALSE) as on_time_deliveries,
            AVG(delivery_days) as avg_delivery_days
        FROM fact_orders
        WHERE is_completed_order = TRUE;
        """
        result = self.db.fetch_query(query)
        if result:
            late, on_time, avg_days = result[0]
            total = late + on_time
            late_pct = (late / total * 100) if total > 0 else 0
            logger.info(f"\nDelivery Performance:")
            logger.info(f"  - Late Deliveries: {late:,} ({late_pct:.1f}%)")
            logger.info(f"  - On-Time Deliveries: {on_time:,} ({100-late_pct:.1f}%)")
            logger.info(f"  - Average Delivery Time: {avg_days:.1f} days")
        
        # Order status distribution
        query = """
        SELECT 
            order_status,
            COUNT(*) as order_count
        FROM fact_orders
        GROUP BY order_status
        ORDER BY order_count DESC;
        """
        result = self.db.fetch_query(query)
        if result:
            logger.info(f"\nOrder Status Distribution:")
            for status, count in result[:5]:  # Top 5
                logger.info(f"  - {status}: {count:,} orders")
        
        logger.info("\n" + "="*60 + "\n")


# Main execution
if __name__ == "__main__":
    print("\n" + "="*80)
    print("LOADING FACT TABLES INTO POSTGRESQL")
    print("="*80 + "\n")
    
    loader = FactLoader()
    loader.load_all_facts()
    
    print("\n✓ Fact tables loaded successfully!")
    print("\nYour data warehouse is now fully populated!")
    print("\nNext step: Connect Power BI to PostgreSQL for dashboard creation")

check_data_quality.py

Automated comprehensive quality checks
Checked referential integrity
Validated null values
Tested business logic

"""
Data Quality Check Script
Validates data integrity and referential relationships
"""

import sys
import os

sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(__file__))))

from src.utils.logger import setup_logger
from src.utils.db_connector import DatabaseConnector

logger = setup_logger('data_quality_check')


class DataQualityChecker:
    """Check data quality in the data warehouse"""
    
    def __init__(self):
        """Initialize data quality checker"""
        self.db = DatabaseConnector()
        logger.info("DataQualityChecker initialized")
    
    def run_all_checks(self):
        """Run all data quality checks"""
        try:
            print("\n" + "="*80)
            print("DATA QUALITY REPORT")
            print("="*80 + "\n")
            
            # 1. Row counts
            self._check_row_counts()
            
            # 2. Referential integrity
            self._check_referential_integrity()
            
            # 3. NULL values
            self._check_null_values()
            
            # 4. Data consistency
            self._check_data_consistency()
            
            # 5. Business logic validation
            self._check_business_logic()
            
            print("\n" + "="*80)
            print("✓ DATA QUALITY CHECK COMPLETE")
            print("="*80 + "\n")
            
        except Exception as e:
            logger.error(f"✗ Data quality check failed: {e}")
            raise
        finally:
            self.db.close_pool()
    
    def _check_row_counts(self):
        """Check row counts for all tables"""
        print("[1/5] ROW COUNTS")
        print("-" * 80)
        
        tables = [
            'dim_date',
            'dim_products', 
            'dim_payment_type',
            'dim_customers',
            'fact_orders',
            'fact_cohort_retention'
        ]
        
        for table in tables:
            query = f"SELECT COUNT(*) FROM {table};"
            result = self.db.fetch_query(query)
            count = result[0][0]
            print(f"  {table:25} : {count:>10,} rows")
        
        print()
    
    def _check_referential_integrity(self):
        """Check referential integrity between fact and dimension tables"""
        print("[2/5] REFERENTIAL INTEGRITY")
        print("-" * 80)
        
        # Check customer_key
        query = """
        SELECT COUNT(*) as total_rows,
               COUNT(DISTINCT f.customer_key) as unique_customer_keys,
               COUNT(DISTINCT c.customer_key) as matched_customer_keys,
               COUNT(DISTINCT f.customer_key) - COUNT(DISTINCT c.customer_key) as orphaned_keys
        FROM fact_orders f
        LEFT JOIN dim_customers c ON f.customer_key = c.customer_key;
        """
        result = self.db.fetch_query(query)
        total, unique, matched, orphaned = result[0]
        
        print(f"  Customer Keys:")
        print(f"    - Total orders              : {total:>10,}")
        print(f"    - Unique customer keys      : {unique:>10,}")
        print(f"    - Matched in dim_customers  : {matched:>10,}")
        print(f"    - Orphaned (no match)       : {orphaned:>10,}")
        
        if orphaned > 0:
            print(f"    ⚠ WARNING: {orphaned} customer keys have no matching dimension record")
        else:
            print(f"    ✓ All customer keys are valid")
        
        # Check product_key
        query = """
        SELECT COUNT(*) as total_rows,
               COUNT(DISTINCT f.product_key) as unique_product_keys,
               COUNT(DISTINCT p.product_key) as matched_product_keys,
               COUNT(DISTINCT f.product_key) - COUNT(DISTINCT p.product_key) as orphaned_keys
        FROM fact_orders f
        LEFT JOIN dim_products p ON f.product_key = p.product_key;
        """
        result = self.db.fetch_query(query)
        total, unique, matched, orphaned = result[0]
        
        print(f"\n  Product Keys:")
        print(f"    - Total orders              : {total:>10,}")
        print(f"    - Unique product keys       : {unique:>10,}")
        print(f"    - Matched in dim_products   : {matched:>10,}")
        print(f"    - Orphaned (no match)       : {orphaned:>10,}")
        
        if orphaned > 0:
            print(f"    ⚠ WARNING: {orphaned} product keys have no matching dimension record")
        else:
            print(f"    ✓ All product keys are valid")
        
        # Check date_key
        query = """
        SELECT COUNT(*) as total_rows,
               COUNT(DISTINCT f.order_date_key) as unique_date_keys,
               COUNT(DISTINCT d.date_key) as matched_date_keys,
               COUNT(DISTINCT f.order_date_key) - COUNT(DISTINCT d.date_key) as orphaned_keys
        FROM fact_orders f
        LEFT JOIN dim_date d ON f.order_date_key = d.date_key;
        """
        result = self.db.fetch_query(query)
        total, unique, matched, orphaned = result[0]
        
        print(f"\n  Date Keys:")
        print(f"    - Total orders              : {total:>10,}")
        print(f"    - Unique date keys          : {unique:>10,}")
        print(f"    - Matched in dim_date       : {matched:>10,}")
        print(f"    - Orphaned (no match)       : {orphaned:>10,}")
        
        if orphaned > 0:
            print(f"    ⚠ WARNING: {orphaned} date keys have no matching dimension record")
        else:
            print(f"    ✓ All date keys are valid")
        
        # Check payment_type_key
        query = """
        SELECT COUNT(*) as total_rows,
               COUNT(DISTINCT f.payment_type_key) as unique_payment_keys,
               COUNT(DISTINCT pt.payment_type_key) as matched_payment_keys,
               COUNT(DISTINCT f.payment_type_key) - COUNT(DISTINCT pt.payment_type_key) as orphaned_keys
        FROM fact_orders f
        LEFT JOIN dim_payment_type pt ON f.payment_type_key = pt.payment_type_key;
        """
        result = self.db.fetch_query(query)
        total, unique, matched, orphaned = result[0]
        
        print(f"\n  Payment Type Keys:")
        print(f"    - Total orders              : {total:>10,}")
        print(f"    - Unique payment keys       : {unique:>10,}")
        print(f"    - Matched in dim_payment    : {matched:>10,}")
        print(f"    - Orphaned (no match)       : {orphaned:>10,}")
        
        if orphaned > 0:
            print(f"    ⚠ WARNING: {orphaned} payment keys have no matching dimension record")
        else:
            print(f"    ✓ All payment keys are valid")
        
        print()
    
    def _check_null_values(self):
        """Check for NULL values in critical fields"""
        print("[3/5] NULL VALUE CHECKS")
        print("-" * 80)
        
        # Check fact_orders
        query = """
        SELECT 
            COUNT(*) FILTER (WHERE customer_key IS NULL) as null_customer_keys,
            COUNT(*) FILTER (WHERE product_key IS NULL) as null_product_keys,
            COUNT(*) FILTER (WHERE order_date_key IS NULL) as null_order_date_keys,
            COUNT(*) FILTER (WHERE payment_type_key IS NULL) as null_payment_keys,
            COUNT(*) FILTER (WHERE order_id IS NULL) as null_order_ids,
            COUNT(*) FILTER (WHERE order_total_value IS NULL) as null_order_values,
            COUNT(*) as total_rows
        FROM fact_orders;
        """
        result = self.db.fetch_query(query)
        null_cust, null_prod, null_date, null_pay, null_id, null_val, total = result[0]
        
        print(f"  fact_orders (Total: {total:,} rows):")
        
        checks = [
            ("customer_key", null_cust),
            ("product_key", null_prod),
            ("order_date_key", null_date),
            ("payment_type_key", null_pay),
            ("order_id", null_id),
            ("order_total_value", null_val)
        ]
        
        has_nulls = False
        for field, null_count in checks:
            if null_count > 0:
                pct = (null_count / total * 100) if total > 0 else 0
                print(f"    ⚠ {field:25} : {null_count:>7,} NULLs ({pct:.2f}%)")
                has_nulls = True
        
        if not has_nulls:
            print(f"    ✓ No NULL values in critical fields")
        
        # Check dim_customers
        query = """
        SELECT 
            COUNT(*) FILTER (WHERE customer_id IS NULL) as null_customer_ids,
            COUNT(*) FILTER (WHERE customer_segment IS NULL) as null_segments,
            COUNT(*) as total_rows
        FROM dim_customers;
        """
        result = self.db.fetch_query(query)
        null_id, null_seg, total = result[0]
        
        print(f"\n  dim_customers (Total: {total:,} rows):")
        if null_id > 0 or null_seg > 0:
            if null_id > 0:
                print(f"    ⚠ customer_id                : {null_id:>7,} NULLs")
            if null_seg > 0:
                print(f"    ⚠ customer_segment           : {null_seg:>7,} NULLs")
        else:
            print(f"    ✓ No NULL values in critical fields")
        
        print()
    
    def _check_data_consistency(self):
        """Check data consistency and logical errors"""
        print("[4/5] DATA CONSISTENCY")
        print("-" * 80)
        
        # Check negative values
        query = """
        SELECT 
            COUNT(*) FILTER (WHERE order_total_value < 0) as negative_values,
            COUNT(*) FILTER (WHERE order_total_value = 0) as zero_values,
            COUNT(*) FILTER (WHERE delivery_days < 0) as negative_delivery_days
        FROM fact_orders;
        """
        result = self.db.fetch_query(query)
        neg_val, zero_val, neg_days = result[0]
        
        print(f"  Numeric Validity:")
        if neg_val > 0:
            print(f"    ⚠ Negative order values      : {neg_val:>7,} rows")
        else:
            print(f"    ✓ No negative order values")
        
        if zero_val > 0:
            print(f"    ⚠ Zero order values          : {zero_val:>7,} rows")
        else:
            print(f"    ✓ No zero order values")
        
        if neg_days > 0:
            print(f"    ⚠ Negative delivery days     : {neg_days:>7,} rows")
        else:
            print(f"    ✓ No negative delivery days")
        
        # Check duplicates
        query = """
        SELECT COUNT(*) as duplicate_count
        FROM (
            SELECT order_id, COUNT(*) as cnt
            FROM fact_orders
            GROUP BY order_id
            HAVING COUNT(*) > 1
        ) duplicates;
        """
        result = self.db.fetch_query(query)
        dup_count = result[0][0]
        
        print(f"\n  Uniqueness:")
        if dup_count > 0:
            print(f"    ⚠ Duplicate order_ids        : {dup_count:>7,} duplicates")
        else:
            print(f"    ✓ All order_ids are unique")
        
        print()
    
    def _check_business_logic(self):
        """Check business logic validation"""
        print("[5/5] BUSINESS LOGIC VALIDATION")
        print("-" * 80)
        
        # Check order status distribution
        query = """
        SELECT 
            order_status,
            COUNT(*) as count,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
        FROM fact_orders
        GROUP BY order_status
        ORDER BY count DESC;
        """
        result = self.db.fetch_query(query)
        
        print(f"  Order Status Distribution:")
        for status, count, pct in result:
            print(f"    - {status:20} : {count:>10,} ({pct:>5.2f}%)")
        
        # Check completed orders metrics
        query = """
        SELECT 
            COUNT(*) as total_orders,
            COUNT(*) FILTER (WHERE is_completed_order = TRUE) as completed_orders,
            COUNT(*) FILTER (WHERE is_late_delivery = TRUE) as late_deliveries,
            AVG(delivery_days) as avg_delivery_days,
            AVG(order_total_value) as avg_order_value
        FROM fact_orders;
        """
        result = self.db.fetch_query(query)
        total, completed, late, avg_days, avg_value = result[0]
        
        completion_rate = (completed / total * 100) if total > 0 else 0
        late_rate = (late / total * 100) if total > 0 else 0
        
        print(f"\n  Order Metrics:")
        print(f"    - Total orders               : {total:>10,}")
        print(f"    - Completed orders           : {completed:>10,} ({completion_rate:.2f}%)")
        print(f"    - Late deliveries            : {late:>10,} ({late_rate:.2f}%)")
        print(f"    - Average delivery days      : {avg_days:>10.1f} days")
        print(f"    - Average order value        : R$ {avg_value:>10,.2f}")
        
        # Check customer segments
        query = """
        SELECT 
            customer_segment,
            COUNT(*) as customer_count,
            AVG(lifetime_value) as avg_clv
        FROM dim_customers
        GROUP BY customer_segment
        ORDER BY customer_count DESC;
        """
        result = self.db.fetch_query(query)
        
        print(f"\n  Customer Segments:")
        for segment, count, avg_clv in result:
            print(f"    - {segment:20} : {count:>7,} customers (Avg CLV: R$ {avg_clv:,.2f})")
        
        print()


# Main execution
if __name__ == "__main__":
    checker = DataQualityChecker()
    checker.run_all_checks()

Skills applied

Chunked data loading
Foreign key validation
Referential integrity enforcement
Comprehensive data quality checks
SQL aggregate functions
Error handling and recovery
Performance monitoring
Business logic validation

Power BI dashboard creation

Configured star schema relationships in data model
Created DAX calculated measures
Built 4 interactive dashboard pages with visualizations
Implemented slicers and cross-filtering
Applied professional formatting and theme
Exported dashboard for portfolio

With the ETL process has finally finished, and the data loaded into the database, it is time to create the Power BI dashboard. This dashboard will have four pages. The first one is an executive-level summary page, the second page is about sales performance, the third page is about customer analytics, and the fourth page is about product analysis.

Page 1 - Executive Summary

Page 2 - Sales Performance

Page 3 - Customer Analytics

Page 4 - Product Analytics

DAX meaures used for this dashboard

We have also created a measure table using DAX.

Created Measures Table

_Measures = ROW("Measure", 1)

Revenue Measures

Total Revenue = SUM(fact_orders[order_total_value])

Total Revenue (Completed) = 
CALCULATE(
    SUM(fact_orders[order_total_value]),
    fact_orders[is_completed_order] = TRUE
)

Average Order Value = AVERAGE(fact_orders[order_total_value])

Order Measures

Total Orders = COUNTROWS(fact_orders)

Completed Orders = 
CALCULATE(
    COUNTROWS(fact_orders),
    fact_orders[is_completed_order] = TRUE
)

Order Completion Rate = 
DIVIDE([Completed Orders], [Total Orders], 0)

Customer Measures

Total Customers = 
DISTINCTCOUNT(fact_orders[customer_key])

Average Customer Lifetime Value = 
AVERAGE(dim_customers[lifetime_value])

Repeat Customer Rate = 
VAR AllCustomers = CALCULATE(COUNTROWS(dim_customers), ALL(dim_customers))
VAR RepeatCustomers = CALCULATE(COUNTROWS(dim_customers), dim_customers[total_orders] > 1, ALL(dim_customers))
RETURN
    DIVIDE(RepeatCustomers, AllCustomers, 0)

Delivery Measures

Average Delivery Days = 
AVERAGE(fact_orders[delivery_days])

Late Deliveries = 
CALCULATE(
    COUNTROWS(fact_orders),
    fact_orders[is_late_delivery] = TRUE
)

Late Delivery Rate = 
DIVIDE([Late Deliveries], [Completed Orders], 0)

On-Time Delivery Rate = 
1 - [Late Delivery Rate]

Product Measures

Total Products Sold = SUM(fact_orders[order_item_count])

Unique Products = DISTINCTCOUNT(fact_orders[product_key])

Conclusion - Answering business questions

At the beginning of this project, we have defined two business questions to be investigated using our Power BI dashboard:

"How can we improve customer retention and identify high-value customer segments?"
"What are our key revenue drivers and how can we optimize operational performance?"

First question

Here are the key figures related to answering the first business question:

Total unique customers in database: 99441
Average lifetime value per customer: $58.15K
Percentage of customers who made multiple purchases: None, all customer only made one purchase
First-time purchasers: 99441

Customer Segmentation & Value Distribution:

VIP customer (99.2% of base) generate a near 100% of total revenue with average CLV of around R$ 58000.
Top 10 customers alone have CLVs ranging from R$1M to R$4M

As VIP customers are the dominating majority of the customer base, this serves as a strong evidence of the great perforamnce this business did in retaining their customers.

Additionally, as indicated in the map visuals on Executive Summary page, most customers are from the southeast region of Brazil.

Second question

The e-commerce platform generated R$ 15.42M in total revenue over a 3-year period (2016-2018), with 96K orders from 99K unique customers. The average order value of R$ 159.33 and 97.02% order completion rate indicate healthy operational performance.

The platform experienced a tremendous growth from 2016 to 2017, followed by stabilization in 2018. This pattern suggests market saturation and indicates need for customer acquisition strategies.

The top 3 categories in terms of sales, in order, are 1. home & furniture, 2. fashion & beauty, and 3. sports & leisure, which takes up 63.63% of the total revenue. By doing a deep dive into the analysis of these sold products, we discoverd that home & furniture earns the least on average (R$151.58) among all three categories, but sells out the most amount of products. Sports & leisure products sold less than home & furniture products - in fact they sold out the least amount of products, with similar revenue in average (R$152.34). Fashion & beauty, although did not sold as much product as home & furniture, sells at a much higher revenue in average (R$ 172.54).

Business operates well in general, with 97% order completion rate, and orders took around 11 days in average to deliver and a 93.2% on-time delivery rate, suggesting a near optimized operation. Delivery time may need certain level of attention, as the on-time delivery rate decreased slightly year by year, from 98.9% in 2016 to 94.4% in 2017, and 92.3% in 2018. This may have caused by expanded scope of the operation or change in delivery method.

NextAI Agent for weather analysis and insight

Last updated 1 month ago

Good morning

hashtagSource of the data

hashtagDefining business questions

hashtagProject setup and planning

hashtagData Extraction

hashtagScripts created

hashtagData extracted

hashtagAchievements

hashtagData Transformation

hashtagStar schema

hashtagData transformed

hashtagDimension table creation

hashtagFact table creation

hashtagtransform_all.py

hashtagBusiness logic - Customer segmentation

hashtagBusiness logic - Cohort Retention

hashtagSkills Applied

hashtagLoading the data

hashtagcreate_schema.py

hashtagload_dimensions.py

hashtagload_facts.py

hashtagcheck_data_quality.py

hashtagSkills applied

hashtagPower BI dashboard creation

hashtagDAX meaures used for this dashboard

hashtagConclusion - Answering business questions

hashtagFirst question

hashtagSecond question