This project explores the Global Superstore dataset to uncover actionable insights about sales, profit, customers, and regional performance.
The goal is to perform end-to-end data analysis using Python — from cleaning and feature engineering to exploratory data analysis (EDA), statistical testing, and key insights generation.
- Understand sales and profit distribution across categories, regions, and segments.
- Identify key drivers of profitability and loss-making products.
- Perform time-series trend analysis and customer behavior study.
- Use statistical tests to validate observed patterns.
- Derive data-backed recommendations for strategic business decisions.
Source: Global Superstore Dataset (Kaggle)
Size: ~51,000 rows
Key Columns:
Order ID,Order Date,Ship Date,Ship ModeCustomer ID,Segment,RegionCategory,Sub-CategorySales,Profit,Quantity,Discount
- Python 3.x
- Jupyter Notebook
pandas,numpymatplotlib,seaborn,plotlyscipy.stats(for statistical testing)datetimepycountry
Retail-Sales-Analysis/
│
├─ data/
│ ├─ global_superstore.csv
│ ├─ global_superstore_clean.csv
│ └─ global_superstore_capped.csv
│
├─ notebook/
│ └─ retail_sales_eda.ipynb
│
├─ visuals/
│ ├─ univariate/
│ │ ├─ numerical_plots/
│ │ │ ├─ delivery_days_plot.png
│ │ │ ├─ discount_plot.png
│ │ │ ├─ profit_margin_plot.png
│ │ │ ├─ profit_plot.png
│ │ │ ├─ quantity_plot.png
│ │ │ ├─ sales_plot.png
│ │ │ └─ shipping_cost_plot.png
│ │ │
│ │ └─ categorical_plots/
│ │ ├─ category_countplot.png
│ │ ├─ market_countplot.png
│ │ ├─ region_countplot.png
│ │ ├─ segment_countplot.png
│ │ ├─ ship_mode_countplot.png
│ │ └─ sub_category_countplot.png
│ │
│ ├─ bivariate/
│ │ ├─ category_vs_profit.png
│ │ ├─ category_vs_sales.png
│ │ ├─ correlation_heatmap.png
│ │ ├─ discount_vs_profit.png
│ │ ├─ region_vs_category.png
│ │ ├─ region_vs_profit.png
│ │ ├─ segment_vs_profit.png
│ │ ├─ segment_vs_ship_mode.png
│ │ ├─ segment_vs_sales.png
│ │ ├─ region_vs_profit.png
│ │ └─ shipmode_vs_deliverydays.png
│ │
│ ├─ multivariate/
│ │ ├─ profitabilityByRegionCategory.png
│ │ ├─ region-category-subcategory.png
│ │ ├─ SalesProfitBySegmentCategory.png
│ │ ├─ top10customers.png
│ │ └─ top10ProductsByLoss.png
│ │
│ ├─ trend_analysis/
│ │ ├─ category-sales-trend.png
│ │ ├─ monthlytrend.png
│ │ ├─ orderstrend.png
│ │ ├─ yearlytrend.png
│ │ └─ year-month-trend.png
│ │
│ └─ advance_analysis/
│ ├─ AOVyearly.png
│ ├─ cohort.png
│ ├─ pareto.png
│ ├─ profitability.png
│ └─ unique_customers.png
│
├─ LICENSE
├─ requirements.txt
└─ README.md
Import all required libraries and define configurations.
Load dataset and perform initial exploration (shape, dtypes, missing values).
- Convert date columns and remove extra spaces
- Handle missing values & duplicates
- Engineer features like
Delivery_Days,Month_Year, etc. - Detect and cap outliers for
Sales,Profit, andShipping_Cost.
- Univariate: Distribution of sales, profit, discounts, categories.
- Bivariate: Relationship between discount, profit, and segment.
- Multivariate: Combined effect of region, category, and segment.
- Trend Analysis: Monthly and yearly sales & profit patterns.
- Customer Cohort: Retention and AOV per year.
- RFM & Pareto Analysis: Identify high-value customers and top contributors.
- Geographical Insights: Regional sales visualization.
To validate key findings:
- Shapiro–Wilk Test: Check normality of sales & profit.
- Correlation Test: Numeric relationships.
- Kruskal-Wallis Test: Profit differences by region, category, segment.
- Chi-Square Test: Association between categorical variables.
- Mann–Whitney U Test: Binary group comparisons (e.g., high vs low discount).
Summarized actionable findings based on analysis and statistics.
Wrap-up of project outcomes and business takeaways.
- Highest sales observed in Central region, but profitability varies across sub-categories.
- Furniture often shows high sales but low profit margins due to high shipping and discounts.
- Technology drives most of the profit, especially under Corporate and Home Office segments.
- Sales peak in Q4 each year, suggesting strong seasonal patterns (likely due to festive or year-end promotions).
- Discounts beyond 20% drastically reduce profit margins.
- Top 20% customers contribute ~80% of total sales (Pareto principle).
- No significant profit difference across regions (statistical validation supports this).
- Optimize discount strategy — cap discounts near profitable thresholds (~15–20%).
- Focus marketing on top-performing segments (Technology & Office Supplies).
- Reduce shipping costs for bulky items (especially Furniture) via logistics partnerships.
- Leverage Q4 trend — plan stock and campaigns ahead of the sales surge.
- Customer retention programs for high-value customers identified in RFM analysis.
- Periodic data review — continuous tracking of profitability drivers.
This project demonstrates a complete retail data analytics pipeline using Python.
It combines cleaning, feature engineering, visualization, statistical inference, and business recommendations — bridging the gap between data and decision-making.
-
Clone the repository
git clone https://github.com/ajx7/Retail-Sales-Analysis.git
-
Navigate to the project directory
cd Retail-Sales-Analysis -
Install dependencies
pip install -r requirements.txt
-
Run the Jupyter Notebook
jupyter notebook notebooks/retail_sales_eda.ipynb
Global Superstore Dataset (publicly available on Kaggle): Global Superstore Dataset – Kaggle
This project is licensed under the MIT License — free to use with attribution.