E-commerce ETL Data Warehouse Project

Building a data pipleline to transform raw e-commerce data into business-ready data for business insights

Tools: Python, SQL, Power BI Desktop, DAX, PostgreSQL 16, pgAdmin 4, VS Code, Windows Command Prompt/PowerShell, star schema

Skills: chunked data loading, foreign key validation, referential integrity enforcement, comprehensive data quality checks, SQL aggregate functions, error handling and recovert, perforance monitoring, business logic validation

These days, business need to analyze very large amount of data. And on top of that, they will also need a lot of time and effort to transform raw unorganized data into processed data that suits their data needs and what they are looking for in those messy data.

In this project, I attempted to build a data pipeline that transform raw data into business ready data which can then be analyzed and gather insight from. The processed data will be displayed on a Power BI dashboard which the executives can easily gather insights from the charts and score cards in the dashobard.

Source of the data

The data for this project are provided by a Brazilian e-commerce public dataset provided by Olist. You can find the dataset herearrow-up-right. You can find 8 csv files in this dataset. We will only use 6 of them. They are:

  • olist_customers_dataset

  • olist_order_items_dataset

  • olist_order_payments_dataset

  • olist_order_reviews_dataset

  • olist_orders_dataset

  • olist_products_dataset

Defining business questions

As we aim to build Power BI dashboard that gives business insights to executives, we will first need to define the business questions that need to be answered for this project. For the purpose of building a meaningful dashobard, we will define our business questiosn as follows:

  1. "How can we improve customer retention and identify high-value customer segments?"

  2. "What are our key revenue drivers and how can we optimize operational performance?"

Project setup and planning

PostgreSQL 18 was used for this project. Before we lay down the groundwork for this project that is setting up the extract layer of the data pipeline, a database was created using pgAdmin 4 in order to let us store the processed data once we have finished the data pipeline and finish loading the processed data.

pgAdmin 4 interface

Apart from Postgre SQL 18, VS Code will also be used for coding and implementing the data pipeline once we have finished building it.

Data Extraction

Now we will start laying down the foundation layer of this data pipeline - the extraction layer.

Apart from the dataset used for this project, we would also need a Python script to act as a logger, and another 6 Python scripts to extract the data from the 6 chosen csv files.

Scripts created

  • logger.py

  • 6 Python scripts for each csv file

Data extracted

  • olist_customers_dataset.csv 99,441 rows | 5 cols

  • olist_orders_dataset.csv 99,441 rows | 8 cols

  • olist_order_items_dataset.csv 112,650 rows | 7 cols

  • olist_order_payments_dataset.csv 103,886 rows | 5 cols

  • olist_order_reviews_dataset.csv 99,224 rows | 7 cols

  • olist_products_dataset.csv 32,951 rows | 9 cols

Achievements

  • Checked file sizes and record counts

  • Identified null values and data types

Data Transformation

Next, we will have to build the transformation layer which it will transofrm the extracted raw data into data that suits our business need.

Star schema

Star schema for this project

In order to suit our business needs, we are going to transform our raw data into two fact tables and four dimension tables. Apart from the "fact_orders", we will also create a "fact_cohort_retention" for the purpose of identifying customers who placed more than one order.

Data transformed

  • dim_date.csv 791 rows | 16 columns

  • dim_products.csv 32,951 rows | 13 columns

  • dim_customers.csv 99,441 rows | 18 columns

  • dim_payment_type.csv 5 rows | 4 columns

  • fact_orders.csv 99,992 rows | 25 columns

  • fact_cohort_retention.csv 25 rows | 7 columns

Dimension table creation

dim_date - Date Dimension

  • Generated complete date range (2016-09-01 to 2018-10-31)

  • Attributes: date_key, full_date, year, quarter, month, month_name, week, day_of_week, day_name, is_weekend, is_holiday, fiscal_year, fiscal_quarter, created_at, date_str

  • Surrogate key format: YYYYMMDD (e.g., 20170115)

dim_products - Product Dimension

  • Extracted unique products from raw data

  • Attributes: product_key, product_id, product_category_segment

  • Surrogate key: Sequential integer starting at 1

  • Handled missing categories

dim_customers - Customer Dimension

  • Extracted unique customers with location data

  • Calculated lifetime value (CLV) per customer

  • Created customer segmentation (High/Medium/Low based on CLV)

  • Mapped states to regions (North, Northeast, Southeast, South, Central-West)

  • Attributes: customer_key, customer_id, customer_unique_id, customer_city, customer_state, customer_region, customer_segment, first_order_date, last_order_date, total_orders, lifetime_value

  • Surrogate key: Sequential integer (customer_key)

dim_payment_type - Payment Type Dimension

  • Created lookup table for payment methods

  • Types: credit_card, boleto, voucher, debit_card, not_defined

  • Attributes: payment_type_key, payment_type

  • Surrogate key: Sequential integer (1-5)

Fact table creation

fact_orders - Order Transactions

  • Joined with the other four dimension tables

  • Metrics calculated:

    • order_total_value = sum of item prices + freight

    • delivery_days = days between purchase and delivery

    • delivery_delay_days = actual delivery - estimated delivery

    • is_late_delivery = TRUE if delay > 0

    • is_completed_order = TRUE if status = 'delivered'

  • Aggregated to order-level (one row per order)

fact_cohort_retention - Customer Retention

  • Grouped customers by first purchase month (cohort)

  • Calculated months since first purchase (0-12 months)

  • Computed retention rate: (active customers / cohort size) * 100

  • Attributes: cohort_month, months_since_first_purchase, retention_rate, retained_customers

transform_all.py

  • Master Transform Orchestrator

  • Runs all transformations in correct dependency order

  • Demonstrates production-level ETL orchestration

Business logic - Customer segmentation

Business logic - Cohort Retention

Skills Applied

  • Dimensional modeling (star schema design)

  • Data transformation and cleansing

  • Surrogate key generation

  • Business logic implementation

  • Cohort analysis methodology

  • Pandas advanced operations (groupby, merge, pivot)

  • Date/time manipulation

  • Statistical calculations (percentiles, aggregations)

Loading the data

After finish transforming the data, we will now load the transformed data into our PostgreSQL database.

create_schema.py

load_dimensions.py

  • Load transformed dimensions into database

load_facts.py

  • Load the two fact tables into the database

check_data_quality.py

  • Automated comprehensive quality checks

  • Checked referential integrity

  • Validated null values

  • Tested business logic

Skills applied

  • Chunked data loading

  • Foreign key validation

  • Referential integrity enforcement

  • Comprehensive data quality checks

  • SQL aggregate functions

  • Error handling and recovery

  • Performance monitoring

  • Business logic validation

Power BI dashboard creation

  • Configured star schema relationships in data model

  • Created DAX calculated measures

  • Built 4 interactive dashboard pages with visualizations

  • Implemented slicers and cross-filtering

  • Applied professional formatting and theme

  • Exported dashboard for portfolio

With the ETL process has finally finished, and the data loaded into the database, it is time to create the Power BI dashboard. This dashboard will have four pages. The first one is an executive-level summary page, the second page is about sales performance, the third page is about customer analytics, and the fourth page is about product analysis.

Page 1 - Executive Summary

Page 2 - Sales Performance

Page 3 - Customer Analytics

Page 4 - Product Analytics

DAX meaures used for this dashboard

We have also created a measure table using DAX.

Created Measures Table

Revenue Measures

Order Measures

Customer Measures

Delivery Measures

Product Measures

Conclusion - Answering business questions

At the beginning of this project, we have defined two business questions to be investigated using our Power BI dashboard:

  1. "How can we improve customer retention and identify high-value customer segments?"

  2. "What are our key revenue drivers and how can we optimize operational performance?"

First question

Here are the key figures related to answering the first business question:

  • Total unique customers in database: 99441

  • Average lifetime value per customer: $58.15K

  • Percentage of customers who made multiple purchases: None, all customer only made one purchase

  • First-time purchasers: 99441

Customer Segmentation & Value Distribution:

  • VIP customer (99.2% of base) generate a near 100% of total revenue with average CLV of around R$ 58000.

  • Top 10 customers alone have CLVs ranging from R$1M to R$4M

As VIP customers are the dominating majority of the customer base, this serves as a strong evidence of the great perforamnce this business did in retaining their customers.

Additionally, as indicated in the map visuals on Executive Summary page, most customers are from the southeast region of Brazil.

Second question

The e-commerce platform generated R$ 15.42M in total revenue over a 3-year period (2016-2018), with 96K orders from 99K unique customers. The average order value of R$ 159.33 and 97.02% order completion rate indicate healthy operational performance.

The platform experienced a tremendous growth from 2016 to 2017, followed by stabilization in 2018. This pattern suggests market saturation and indicates need for customer acquisition strategies.

The top 3 categories in terms of sales, in order, are 1. home & furniture, 2. fashion & beauty, and 3. sports & leisure, which takes up 63.63% of the total revenue. By doing a deep dive into the analysis of these sold products, we discoverd that home & furniture earns the least on average (R$151.58) among all three categories, but sells out the most amount of products. Sports & leisure products sold less than home & furniture products - in fact they sold out the least amount of products, with similar revenue in average (R$152.34). Fashion & beauty, although did not sold as much product as home & furniture, sells at a much higher revenue in average (R$ 172.54).

Business operates well in general, with 97% order completion rate, and orders took around 11 days in average to deliver and a 93.2% on-time delivery rate, suggesting a near optimized operation. Delivery time may need certain level of attention, as the on-time delivery rate decreased slightly year by year, from 98.9% in 2016 to 94.4% in 2017, and 92.3% in 2018. This may have caused by expanded scope of the operation or change in delivery method.

Last updated