Steam Games Store video games analysis

Using a Steam Games Store dataset to look into what genre generates the most revenue, and if discounts work on driving up the sales.

Tools: Python (pandas, matplotlib, seaborn, numpy), Google BigQuery, SQL, Google Looker Studio, XGBoost, VS Code

Skills: Machine learning, model evaluation (SHAP, model latency), data visualizations, EDA, data cleaning, data standardization

Video games industry have changed so much ever since more than 30 years ago, when people play video games either on street arcade or through their own personal computers. Since then, gaming consoles started to enter the video game market, and from that point on, the industry have faced so many changes we even barely recognized how the industry looked like back when street arcade and PC were the main ways to play video games. Apart from consoles representing much more market share than ever back when they first came out (the first Playstation came out way back in 1998, and the first Xbox back in 2001), what also changed significantly was the fact that major game developers were no longer the dominating force of engagements within game communities and high revenue sales. Gone were the days that big AAA level famous video game titles were the sole anticipation for gamers and the driving revenue force for the big name developers, but now less well-known indie game developers have taken over the engagement trend on the online video game communities, starting to take up more viral attention online while the big titles seemingly generate less excitement and anticipation among gamers.

This project explores video games that are sold on Steam from 1997 to 2024 (a total of 42497 games in the dataset), and see what genres were the main driving force of revenue sales in all these times. Apart from showing the genre that generates the most revenue in a dashboard, we will also use XGBoost machine learning model tp predict how they will perform in generating revenue sales.

Business questions

This project aims to give insights into two business questions:

For games that was sold in Steam from 1997 to 2024, what genre generates the most revenue sales?
In the future, will discounts work in driving up the revenue sales of video games?

Exploring the dataset and clean the data

We begin by exploring the dataset using Python.

Before we can make any sense from the dataset, we will need to clean and standardized the data in order to be able to be explored using Python.

Turn all price columns into numeric values, with no currency sign and no commas in between the numbers. All null values were replaced with a 0.
Create a column called "final_price"; this column will be used to show the price after being discounted in the cleaned dataset.
A game may carry different types of genres in the dataset. We will create a new column in the cleaned dataset, and store the first genre recorded in the original dataset as their "main genre".
Standarize column names.

After cleaning the data, we can start exploring the dataset and get some initial insights about the dataset we are using.

This dataset contains the most action genre games (over 17500)
Games that are 10 years old or older generates more engagement within the gaming community. This was measured by using the median of number of reviews for games of this age.
By median count of reviews, the massive multiplayer games have the highest median of reviews.

Deploying into Google Cloud

Once the initial EDA is completed, the next step is to deploy the cleaned dataset into Google BigQuery and create a dashobard to visualize the findings in an interactive, intuitive way. The goal of putting the cleaned dataset into Google BigQuery is that we are trying to mimic the real world business setting of extracting insight from a cloud database. In fact, one of the main goal for this project is we would want to produce a production-ready end-to-end project that solves real world business needs. In this section and the upcoming sections as well, we will try to fine tune our methods of completing this project so it fits the industry standard of the field of data analytics in the real world.

Google BigQuery

After loading the cleaned dataset into Google BIgQuery, now we will try to answer the business questions using SQL.

The first SQL query attempted to find out which genre generates the highest amount of revenue sales. The query also adds a new column to generate a revenue proxy (=overall number of reviews * final price) to estimate the revenue sales of a particular genre in the future. The revenue proxy will be used again when we train the machine learning models later. After running the first SQL query, we discovered that action genre is estimated to become of the best-performed genre in the future, due to its sheer number of games (we have 18170 action games in the dataset). RPG games are another genre to have huge potential in earning high amount of revenue in the future by estimation, with a dedicated loyal gaming community to different RPG games that was fostered by a high amount of reviews and a higher average revenue per game. Adventure genre also has a high potential to earn a high amount of revenue by estimation due to a relatively higher number of games.
The second SQL query brought in the same revenue proxy that we did in the first query and tried to examine if discounts will drive up revenue sales in the future. We seperated games into two categories - games that are still in full price and games after discount, and see how well both game categories perform in the future. To evaluate how accurate the estimation can be, we also add one more column to the table that measures the confidence level of the estimation. The confidence level will be measured in terms of how many discounted and full priced games of that particular genre are there in the sample pool. After running this query, we discovered that RPG games have the biggest revenue impact after being discounted (+572.38%), althought confidence level is at moderate level. Indie games and racing games reported minimal impact to revenue sales after discounted with different confidence levels, and all other genres report a negative revenue impact after discount.

Google Looker Studio

To provide an overview of the cleaned dataset, a dashboard was created using Looker Studio that reports real-time data in cloud.

Page 1 - Executive Overview

Page 2 - Pricing & Revenue Analysis

Page 3 - Engagement Drivers

This dashboard was created with the aim to simulate real world business settings where marketing executives need to monitor the sales data real data to make timely marketing and game sales decisions.

Machine learning model training

Once we have completed examining any useful insights from the cleaned dataset, we now proceed the train a machine learning model to be used for any future revenue sales prediction.

For this project, we have trained three machine learning models and compare their performace to see which performs the best:

Illustrative diagram of how a XGBoost model works

XGBoost: Trees are added sequentially, with gradient boosting over the trees. Even though using this model usually means there is often a need to fine tune more hyperparameters, this will be an ideal model to pick due to the number of features we will put into the model to train, and the fact that we prioritize accurancy over data structure for our prediction.
Random Forest: Ensemble of many decision trees, with the final prediction being the average of all decision trees. Might be problematic as it provides fewer tuning opportunities with our long list of deatures.
LightGBM Regressor: Perfect for very large datasets with many rows and many data, with much faster training time. Not necessary for our project, as the dataset can be fit into BigQuery with ease, and initial insight can also be extracted without much complex analysis. Using this model with our moderately sized dataset may also easily led to overfitting.

Model comparison and evaluation

The three models were trained and compared using these evaluation methods:

R² score (for both training and test dataset)
Train-Test Gap measuring
RMSE for test dataset
RMSE for each genre

We then used another seperate Python script to compare the three models' R² score and different RMSE scores. We ranked the performance of different models based on these metrics and compiled the overall performance of three models based on how well they performed on each evaluation metric.

The models were compared based on different evaluation metrics, and being ranked based on their performance.

After the ranking of the three models' performance, XGBoost was ranked as being the best performed model out of the three models.

SHAP Analysis

In the Python script for the XGBoost model training, SHAP analysis was also performed to get an insight on how much the features impact the final prediction, and how much different genres' revenue sales will be impacted by discounts.

The Top 15 features out of 22 features ranked according to their SHAP analysis score. The higher their score, the higher impact they will have on the final prediction.

The impact discounts itself (not to be confused with the discounted price column) will bring a moderately large impact to the final prediction.

The SHAP analysis score of different genres.

Application Programming Interface (API)

In order to simulate a production level end-to-end data analysis project that actually provides revenue prediction by marketing staff at Steam, we will also need to write an API script to let business users check any individual game or genre's revenue prediction through their preferred applications (e.g. dashboards, regularly scheduled emails, etc.).

As Google Colab notebooks cannot keep an API from running once the notebook shuts down, we will use Visual Studio Code (VS Code) to write the script for the API.

Testing latency of the API script

Python script for the API, written in VS Code.

In addition to the completion of the API Python script, we also test the latency that may be experienced by users when using the API.

We devided the games into 3 batches, with each batch consists of 10 games.

Batch latency summary

This shows how much latency each game will experience.

After the test, we discovered that each game would require just under 3 milisecond to process. by industry standard, the user experience will acutally be instantaneous.

Conclusion

This project demonstrates that with a single, well‑chosen dataset and a modern analytics stack, it is possible to go from raw Steam Games Store data to a full end‑to‑end revenue prediction workflow: EDA and cleaning, SQL analytics in BigQuery, interactive dashboards in Looker Studio, machine learning with XGBoost, model explainability via SHAP, and finally an API served from VS Code that behaves much more like a real production service than a notebook experiment. Building all of this forced a shift in mindset from “offline analysis” to “how would a real company actually use this?”, and the result is a project that can support concrete business questions around genre performance, discount strategy, and revenue potential in a way that feels closer to an actual, real-world business analytics product.

There are several directions to improve this work further. One is to successfully convert the trained XGBoost model into ONNX format and compare ONNX Runtime inference latency against the current pure‑Python XGBoost implementation, both in single‑prediction and true batch‑prediction scenarios. This would not only validate the theoretical latency gains but also bring the implementation closer to what many production-level ML systems use today. Beyond ONNX, future iterations could include more robust hyperparameter tuning pipelines, automated retraining on new Steam data, more nuanced revenue proxies (for example, combining reviews with playtime or ownership estimates), and richer model monitoring around prediction drift and segment‑level performance.

From a business perspective, the components built here can be integrated into a real‑world setting in several ways. The BigQuery tables and Looker Studio dashboards can act as a live “Steam portfolio cockpit” for product and marketing teams, giving them an at‑a‑glance view of which genres, price bands, and discount levels are driving revenue and engagement. The Flask API can be wired into internal pricing tools, scenario‑planning dashboards, or scheduled reporting jobs so that stakeholders no longer need to run notebooks to get answers: they can change a game’s price or discount in a UI, trigger a call to the model behind the scenes, and immediately see the updated revenue prediction. With modest engineering work around authentication, logging, and deployment, this project could evolve from a portfolio piece into the core of an internal decision‑support service for a game publisher or platform operator.

For the files used for this project, please visit the GitHub repository.

PreviousAI Agent for weather analysis and insight NextPredicting future admission figure of an university Master’s programme

Last updated 1 month ago

Good morning

hashtagBusiness questions

hashtagExploring the dataset and clean the data

hashtagDeploying into Google Cloud

hashtagGoogle BigQuery

hashtagGoogle Looker Studio

hashtagMachine learning model training

hashtagModel comparison and evaluation

hashtagSHAP Analysis

hashtagApplication Programming Interface (API)

hashtagTesting latency of the API script

hashtagConclusion