Skip to content
Jake Burleson
Go back

Walmart Retail Sales Regression Analysis

The question

If you’re planning a new retail location, or allocating promotional budget across an existing portfolio of 45 stores, which variables actually predict weekly sales — and how confidently? Intuition says store size matters, holidays matter, and macroeconomic conditions matter. Statistical modeling lets you put numbers on each.

The data

Three raw datasets from a Walmart case study:

Data quality came first

Before any modeling, I ran a data quality audit using COUNTA, COUNTIF, and range checks across all three datasets. The audit surfaced issues that would have corrupted the model if ignored.

Walmart data quality audit table showing three datasets (Stores with 46 rows, Sales with 421,571 rows, Features with 8,191 rows) and the issues identified in each: 0, 1,285, and 24,034 respectively

Breaking down the Sales and Features issues:

The cleaning approach varied by issue. Negative sales were excluded as non-comparable to regular weekly revenue. Missing markdown values were imputed to zero (reasonable, since an absent markdown functionally equals no promotion). Missing CPI and unemployment were carried forward from the most recent available observation.

After cleaning and aggregating sales to the store-week level, the analytical dataset held 430 store-weeks with complete predictors — enough power for regression while keeping the model focused on a clean subset of the data.

Simple regression: store size alone

The first model used just store size as a predictor:

Simple regression output table showing R² of 0.6917, Adjusted R² of 0.6909, and Standard Error of $367,655

The practical read: every additional 1,000 square feet predicts roughly $8,860 in incremental weekly sales. Useful for sizing new stores, but with 31% of variation unexplained, there’s clearly more to the story.

Multiple regression: adding context

The second model added temperature, total markdowns, unemployment, and a holiday indicator:

Comparison table of simple vs multiple regression results. Simple: R-squared 0.6917, Adjusted R-squared 0.6909, Standard Error 367,655. Multiple: R-squared 0.7502, Adjusted R-squared 0.7473, Standard Error 332,456. Improvement in R-squared: 0.0586

Adding four variables lifted R² by 5.86 percentage points and cut standard error by $35,198. The improvement isn’t dramatic, but it’s meaningful: every unexplained variable the model can actually account for tightens forecasts for downstream decisions like inventory planning and staffing.

What the model says

The findings with clearest business implications:

What the model can’t tell you

Any regression analyst’s job includes knowing the model’s limits. Things this one doesn’t address:

Reflection

The real lesson wasn’t in the final R² number. It was in how much the data quality audit shaped what was possible. A 24,058-issue dataset, if modeled naively, produces regression coefficients that look statistically significant but mean nothing. Spending the first phase on cleaning — and documenting which issues were resolved which way — is what makes the rest defensible.


Share this post on:

Previous Post
Red River Ski Area Climate Forecast
Next Post
Retail Store Performance Dashboard