E-commerce ETL Data Warehouse Project
Building a data pipleline to transform raw e-commerce data into business-ready data for business insights
Tools: Python, SQL, Power BI Desktop, DAX, PostgreSQL 16, pgAdmin 4, VS Code, Windows Command Prompt/PowerShell, star schema
Skills: chunked data loading, foreign key validation, referential integrity enforcement, comprehensive data quality checks, SQL aggregate functions, error handling and recovert, perforance monitoring, business logic validation

These days, business need to analyze very large amount of data. And on top of that, they will also need a lot of time and effort to transform raw unorganized data into processed data that suits their data needs and what they are looking for in those messy data.
In this project, I attempted to build a data pipeline that transform raw data into business ready data which can then be analyzed and gather insight from. The processed data will be displayed on a Power BI dashboard which the executives can easily gather insights from the charts and score cards in the dashobard.
Source of the data
The data for this project are provided by a Brazilian e-commerce public dataset provided by Olist. You can find the dataset here. You can find 8 csv files in this dataset. We will only use 6 of them. They are:
olist_customers_dataset
olist_order_items_dataset
olist_order_payments_dataset
olist_order_reviews_dataset
olist_orders_dataset
olist_products_dataset
Defining business questions
As we aim to build Power BI dashboard that gives business insights to executives, we will first need to define the business questions that need to be answered for this project. For the purpose of building a meaningful dashobard, we will define our business questiosn as follows:
"How can we improve customer retention and identify high-value customer segments?"
"What are our key revenue drivers and how can we optimize operational performance?"
Project setup and planning
PostgreSQL 18 was used for this project. Before we lay down the groundwork for this project that is setting up the extract layer of the data pipeline, a database was created using pgAdmin 4 in order to let us store the processed data once we have finished the data pipeline and finish loading the processed data.
pgAdmin 4 interface
Apart from Postgre SQL 18, VS Code will also be used for coding and implementing the data pipeline once we have finished building it.
Data Extraction
Now we will start laying down the foundation layer of this data pipeline - the extraction layer.
Apart from the dataset used for this project, we would also need a Python script to act as a logger, and another 6 Python scripts to extract the data from the 6 chosen csv files.
Scripts created
logger.py
6 Python scripts for each csv file
Data extracted
olist_customers_dataset.csv 99,441 rows | 5 cols
olist_orders_dataset.csv 99,441 rows | 8 cols
olist_order_items_dataset.csv 112,650 rows | 7 cols
olist_order_payments_dataset.csv 103,886 rows | 5 cols
olist_order_reviews_dataset.csv 99,224 rows | 7 cols
olist_products_dataset.csv 32,951 rows | 9 cols
Achievements
Checked file sizes and record counts
Identified null values and data types
Data Transformation
Next, we will have to build the transformation layer which it will transofrm the extracted raw data into data that suits our business need.
Star schema

Star schema for this project
In order to suit our business needs, we are going to transform our raw data into two fact tables and four dimension tables. Apart from the "fact_orders", we will also create a "fact_cohort_retention" for the purpose of identifying customers who placed more than one order.
Data transformed
dim_date.csv 791 rows | 16 columns
dim_products.csv 32,951 rows | 13 columns
dim_customers.csv 99,441 rows | 18 columns
dim_payment_type.csv 5 rows | 4 columns
fact_orders.csv 99,992 rows | 25 columns
fact_cohort_retention.csv 25 rows | 7 columns
Dimension table creation
dim_date - Date Dimension
Generated complete date range (2016-09-01 to 2018-10-31)
Attributes: date_key, full_date, year, quarter, month, month_name, week, day_of_week, day_name, is_weekend, is_holiday, fiscal_year, fiscal_quarter, created_at, date_str
Surrogate key format: YYYYMMDD (e.g., 20170115)
dim_products - Product Dimension
Extracted unique products from raw data
Attributes: product_key, product_id, product_category_segment
Surrogate key: Sequential integer starting at 1
Handled missing categories
dim_customers - Customer Dimension
Extracted unique customers with location data
Calculated lifetime value (CLV) per customer
Created customer segmentation (High/Medium/Low based on CLV)
Mapped states to regions (North, Northeast, Southeast, South, Central-West)
Attributes: customer_key, customer_id, customer_unique_id, customer_city, customer_state, customer_region, customer_segment, first_order_date, last_order_date, total_orders, lifetime_value
Surrogate key: Sequential integer (customer_key)
dim_payment_type - Payment Type Dimension
Created lookup table for payment methods
Types: credit_card, boleto, voucher, debit_card, not_defined
Attributes: payment_type_key, payment_type
Surrogate key: Sequential integer (1-5)
Fact table creation
fact_orders - Order Transactions
Joined with the other four dimension tables
Metrics calculated:
order_total_value = sum of item prices + freight
delivery_days = days between purchase and delivery
delivery_delay_days = actual delivery - estimated delivery
is_late_delivery = TRUE if delay > 0
is_completed_order = TRUE if status = 'delivered'
Aggregated to order-level (one row per order)
fact_cohort_retention - Customer Retention
Grouped customers by first purchase month (cohort)
Calculated months since first purchase (0-12 months)
Computed retention rate: (active customers / cohort size) * 100
Attributes: cohort_month, months_since_first_purchase, retention_rate, retained_customers
transform_all.py
Master Transform Orchestrator
Runs all transformations in correct dependency order
Demonstrates production-level ETL orchestration
Business logic - Customer segmentation
Business logic - Cohort Retention
Skills Applied
Dimensional modeling (star schema design)
Data transformation and cleansing
Surrogate key generation
Business logic implementation
Cohort analysis methodology
Pandas advanced operations (groupby, merge, pivot)
Date/time manipulation
Statistical calculations (percentiles, aggregations)
Loading the data
After finish transforming the data, we will now load the transformed data into our PostgreSQL database.
create_schema.py
load_dimensions.py
Load transformed dimensions into database
load_facts.py
Load the two fact tables into the database
check_data_quality.py
Automated comprehensive quality checks
Checked referential integrity
Validated null values
Tested business logic
Skills applied
Chunked data loading
Foreign key validation
Referential integrity enforcement
Comprehensive data quality checks
SQL aggregate functions
Error handling and recovery
Performance monitoring
Business logic validation
Power BI dashboard creation
Configured star schema relationships in data model
Created DAX calculated measures
Built 4 interactive dashboard pages with visualizations
Implemented slicers and cross-filtering
Applied professional formatting and theme
Exported dashboard for portfolio
With the ETL process has finally finished, and the data loaded into the database, it is time to create the Power BI dashboard. This dashboard will have four pages. The first one is an executive-level summary page, the second page is about sales performance, the third page is about customer analytics, and the fourth page is about product analysis.

Page 1 - Executive Summary

Page 2 - Sales Performance

Page 3 - Customer Analytics

Page 4 - Product Analytics
DAX meaures used for this dashboard
We have also created a measure table using DAX.
Created Measures Table
Revenue Measures
Order Measures
Customer Measures
Delivery Measures
Product Measures
Conclusion - Answering business questions
At the beginning of this project, we have defined two business questions to be investigated using our Power BI dashboard:
"How can we improve customer retention and identify high-value customer segments?"
"What are our key revenue drivers and how can we optimize operational performance?"
First question
Here are the key figures related to answering the first business question:
Total unique customers in database: 99441
Average lifetime value per customer: $58.15K
Percentage of customers who made multiple purchases: None, all customer only made one purchase
First-time purchasers: 99441
Customer Segmentation & Value Distribution:
VIP customer (99.2% of base) generate a near 100% of total revenue with average CLV of around R$ 58000.
Top 10 customers alone have CLVs ranging from R$1M to R$4M
As VIP customers are the dominating majority of the customer base, this serves as a strong evidence of the great perforamnce this business did in retaining their customers.
Additionally, as indicated in the map visuals on Executive Summary page, most customers are from the southeast region of Brazil.
Second question
The e-commerce platform generated R$ 15.42M in total revenue over a 3-year period (2016-2018), with 96K orders from 99K unique customers. The average order value of R$ 159.33 and 97.02% order completion rate indicate healthy operational performance.
The platform experienced a tremendous growth from 2016 to 2017, followed by stabilization in 2018. This pattern suggests market saturation and indicates need for customer acquisition strategies.
The top 3 categories in terms of sales, in order, are 1. home & furniture, 2. fashion & beauty, and 3. sports & leisure, which takes up 63.63% of the total revenue. By doing a deep dive into the analysis of these sold products, we discoverd that home & furniture earns the least on average (R$151.58) among all three categories, but sells out the most amount of products. Sports & leisure products sold less than home & furniture products - in fact they sold out the least amount of products, with similar revenue in average (R$152.34). Fashion & beauty, although did not sold as much product as home & furniture, sells at a much higher revenue in average (R$ 172.54).
Business operates well in general, with 97% order completion rate, and orders took around 11 days in average to deliver and a 93.2% on-time delivery rate, suggesting a near optimized operation. Delivery time may need certain level of attention, as the on-time delivery rate decreased slightly year by year, from 98.9% in 2016 to 94.4% in 2017, and 92.3% in 2018. This may have caused by expanded scope of the operation or change in delivery method.
Last updated