Machine learning system for market value prediction to support recruitment
Machine learning system for NAC Breda that predicts player market value based on performance statistics. Dataset of 14,445 professional football players from 41 European leagues with 115 features from Opta professional match data. A Random Forest model with position-specific feature engineering achieved 76% accuracy for goalkeepers. Six ML algorithms were systematically compared (Linear Regression, Logistic Regression, Decision Tree, Random Forest, XGBoost, SVM) to select the best model.
Building position-specific models, as different positions have different value drivers (goalkeepers: saves; attackers: goals). Feature selection from 115 features with RFECV to isolate position-relevant statistics. Dataset is imbalanced: most players have market value <€1M. Market value is subjective and influenced by factors beyond performance, such as hype, nationality, and contract length. Similar positions needed to be separated (central vs. full-backs, box-to-box vs. attacking midfielders).
The Random Forest model achieved 76% accuracy for goalkeepers, the best performance of all tested models, which is an excellent score given the subjectivity of football transfers. Position-specific feature selection identified the most important statistics per position (goalkeepers: saves and clean sheets; attackers: goals and conversion %; defenders: defensive actions). Delivered: complete Jupyter notebook and 12-page professional report with recruitment recommendations. Project executed in collaboration with NAC Breda, including stadium tour and kickoff presentation.
Dataset from NAC with extensive performance statistics of 14,445 players from 41 European leagues. Opta data is generated live via a combination of human annotation, computer vision, and AI and is considered the industry standard for professional football analytics.
RFECV (Recursive Feature Elimination with Cross-Validation) used with Decision Tree to select most relevant features per position:
Strong relationships in goalkeeper statistics:
Significance: These relationships helped in choosing features for goalkeeper-specific models and confirmed logical patterns in the data.
Different ML algorithms tested to select the most suitable model for player valuation:
R² = 0.113 – too simple for complex relationships
11% accuracy – not suitable for multiple categories
62% accuracy – interpretable, but prone to overfitting
76% accuracy (goalkeepers) – best and most stable result
72% accuracy – good, but requires more computational power
75% accuracy – strong competitor, but less interpretable
Project executed in collaboration with NAC Breda, including stadium tour and kickoff presentation. The project demonstrates how data science can contribute to professional football organizations by developing objective, data-driven recruitment tools.