NAC Breda Player Valuation System

Machine learning system for market value prediction to support recruitment

📅 November 2022 - January 2023 (Year 1 - Block B)
⚽ NAC Breda Collaboration
🔧 RandomForest, XGBoost
👤 Solo project (8 weeks)

Project Overview

Machine learning system for NAC Breda that predicts player market value based on performance statistics. Dataset of 14,445 professional football players from 41 European leagues with 115 features from Opta professional match data. A Random Forest model with position-specific feature engineering achieved 76% accuracy for goalkeepers. Six ML algorithms were systematically compared (Linear Regression, Logistic Regression, Decision Tree, Random Forest, XGBoost, SVM) to select the best model.

Challenges

Building position-specific models, as different positions have different value drivers (goalkeepers: saves; attackers: goals). Feature selection from 115 features with RFECV to isolate position-relevant statistics. Dataset is imbalanced: most players have market value <€1M. Market value is subjective and influenced by factors beyond performance, such as hype, nationality, and contract length. Similar positions needed to be separated (central vs. full-backs, box-to-box vs. attacking midfielders).

Results

The Random Forest model achieved 76% accuracy for goalkeepers, the best performance of all tested models, which is an excellent score given the subjectivity of football transfers. Position-specific feature selection identified the most important statistics per position (goalkeepers: saves and clean sheets; attackers: goals and conversion %; defenders: defensive actions). Delivered: complete Jupyter notebook and 12-page professional report with recruitment recommendations. Project executed in collaboration with NAC Breda, including stadium tour and kickoff presentation.

76%
Accuracy (Goalkeepers)
14,445
Players Analyzed
41
Leagues (Europe)
6
Value Categories

Data & Feature Selection

📊 Opta Professional Football Data

Dataset from NAC with extensive performance statistics of 14,445 players from 41 European leagues. Opta data is generated live via a combination of human annotation, computer vision, and AI and is considered the industry standard for professional football analytics.

Dataset Features

Data Preparation

Position-specific Feature Selection

RFECV (Recursive Feature Elimination with Cross-Validation) used with Decision Tree to select most relevant features per position:

Exploratory Data Analysis: Key Insights

Player Demographics & Market Dynamics

Correlation Analysis: Defensive Statistics

Strong relationships in goalkeeper statistics:

  • Goals conceded per 90 ↔ Shots against per 90 ↔ xG against per 90: Teams with more shots against have higher chance of conceding goals
  • Save % negatively correlated with goals conceded: Better goalkeepers keep more clean sheets
  • Age vs minutes played: Younger players often get more playing time

Significance: These relationships helped in choosing features for goalkeeper-specific models and confirmed logical patterns in the data.

Machine Learning: Model Development

Comparing Multiple Models

Different ML algorithms tested to select the most suitable model for player valuation:

📉 Linear Regression

R² = 0.113 – too simple for complex relationships

📊 Logistic Regression

11% accuracy – not suitable for multiple categories

🌳 Decision Tree

62% accuracy – interpretable, but prone to overfitting

🌲 Random Forest

76% accuracy (goalkeepers) – best and most stable result

⚡ XGBoost

72% accuracy – good, but requires more computational power

🔷 SVM

75% accuracy – strong competitor, but less interpretable

Why Random Forest?

🏆 Best Model Choice

  • Highest accuracy: 76% goalkeepers, 62% overall – better than alternatives
  • Stability: Averaging over multiple trees prevents overfitting
  • Feature importance: Insight into most predictive statistics per position
  • Complex relationships: Recognizes intricate connections between performance and market value
  • Mixed data: Suitable for numerical and categorical features

Results & Performance

76%
Best accuracy (Goalkeepers)
62%
Overall accuracy
70/20/10
Train/Val/Test split
6
Models tested

Performance per Position

Model Evaluation Statistics

  • Accuracy: General correctness of predictions – main metric
  • Precision: Correctness of positive predictions – minimizes false alarms
  • Recall: Ability to detect all positive cases – minimizes missed cases
  • F1 Score: Harmonic mean of precision and recall – balanced view
  • K-Fold cross-validation: 5-fold cross-validation for reliable performance estimates

Technologies Used

🐍 Python
🐼 Pandas
🔢 NumPy
📊 Matplotlib
🎨 Seaborn
🤖 Scikit-Learn
⚡ XGBoost
📈 Missingno

Complete ML Process

Academic Collaboration

Project executed in collaboration with NAC Breda, including stadium tour and kickoff presentation. The project demonstrates how data science can contribute to professional football organizations by developing objective, data-driven recruitment tools.

Challenges & Lessons Learned

Error Analysis: Problems between Positions

Ethical Considerations

Check Out My Other Projects