Olympic Medal Data Analysis Project
A comprehensive R script that pulls the TidyTuesday Olympic medals dataset from GitHub and enriches it with 2016 population and GDP data via the World Bank API. It then produces a wide array of exploratory and analytical outputs, including mappings, distributions, regressions, and clustering.
Table of Contents
- Project Overview
- Features & Outputs
- Prerequisites
- Installation & Setup
- Usage
- Script Breakdown
- Interpreting the Results
- Extending & Customizing
- Data Sources & Citations
- License
Project Overview
This project provides a single R script, Olympic_Medal_Analysis.R
, that:
- Automatically fetches the TidyTuesday Olympic medals CSV from GitHub
- Aggregates all‐time medal counts by National Olympic Committee (NOC)
- Augments each NOC with 2016 population and GDP per capita from the World Bank API
- Computes two “efficiency” metrics: medals per million population and medals per USD 1,000 GDP per capita
- Generates a suite of visualizations and analyses (bar charts, choropleths, scatter plots, heatmaps, regression diagnostics, and k‐means clustering)
All plots render in sequence when you source the script; regression summaries and model diagnostics print to the console.
Features & Outputs
-
Top NOCs by Medals
- Bar chart of the top 10 NOCs by raw medal counts
-
Efficiency Rankings
- Bar chart of the top 10 NOCs by medals per million inhabitants
-
Global Choropleth
- World map shaded by total medals
-
Medal‐Type Distribution
- Pie chart of Gold, Silver, and Bronze proportions
-
Medal Distributions
- Histogram of total medals across all NOCs
-
Pairwise Relationships
-
Scatter plots:
- Gold vs. Silver
- Total vs. Population
- Medals per Million vs. GDP per Capita
-
Correlation Heatmap
- Heatmap of correlations among raw and efficiency variables
-
Regression Analysis
- Linear model predicting total medals from population, GDP per capita, and efficiency
- Summary statistics and residuals‐versus‐fitted diagnostic plot
-
K-Means Clustering
- Four‐cluster segmentation of NOCs by medal profiles and efficiency
- PCA–based cluster visualization
Prerequisites
- R (≥ 4.0)
- Internet access (to fetch both the CSV and World Bank data)
R Packages
The script will auto-install any missing packages. It relies on:
- tidyverse (ggplot2, dplyr, tidyr, readr, forcats)
- countrycode (country ↔ ISO code conversion)
- WDI (World Bank API interface)
- maps (world map boundaries)
- viridis (color scales)
- corrplot (correlation heatmaps)
- factoextra (cluster visualization)
Installation & Setup
- Clone or download this repository.
- Ensure R ≥ 4.0 is installed.
- From an R console or RStudio, set your working directory to the project folder.
No additional build steps are required.
Usage
In R or RStudio:
# Source the analysis script
source("Olympic_Medal_Analysis.R")
Plots will appear one after another, and model summaries will print to the console. To save plots, wrap the plotting calls in your own ggsave()
calls or modify the script accordingly.
Script Breakdown
-
Setup
- Defines package list; installs & loads missing ones.
-
Data Fetch
- Reads Olympic medals CSV from TidyTuesday GitHub.
- Queries World Bank API for 2016 population & GDP per capita.
-
Data Preparation
- Aggregates medal counts by NOC.
- Merges with population/GDP; computes efficiency metrics.
-
Visualizations
- Bar charts (raw counts & efficiency)
- Choropleth (world map)
- Pie chart & histogram (distributional)
- Scatter plots (pairwise relationships)
- Correlation heatmap
-
Statistical Modeling
- Linear regression predicting total medals.
- Diagnostic plots.
-
Clustering
- k-means on scaled medal & efficiency measures.
- PCA cluster plot.
Interpreting the Results
- Top‐10 Charts: Highlight traditional powerhouses vs. smaller but highly efficient NOCs.
- Choropleth: Shows geographic concentration of medal success.
- Efficiency Metrics: Reveal which countries “punch above their weight” relative to population or wealth.
- Correlations: Indicate how medal types co‐vary and how efficiency relates to GDP/population.
- Regression: Quantifies the contributions of population, GDP, and efficiency to raw medal counts.
- Clustering: Groups countries into distinct profiles (e.g., large teams vs. niche specialists).
Extending & Customizing
- Time‐Series: Pull medals by year to study trends.
- Additional Predictors: Include variables like host‐nation status, GDP growth, or sports‐sector investment.
- Alternative Models: Try Poisson or negative‐binomial regression for count data.
- Interactive Maps: Use
leaflet
or plotly
for web‐friendly exploration.
Data Sources & Citations
License
This project is released under the MIT License. See LICENSE for details.