Data Sources

We collected detailed player and team statistics from FBref.com, a trusted source for comprehensive football analytics. FBref provides in-depth data on players, teams, matches, and tournaments, making it an ideal platform for gathering performance metrics.

The data spans four football seasons, from 2020-2021 to 2023-2024, covering major leagues and tournaments such as La Liga, the Premier League, and the UEFA Champions League. Our primary focus was to extract performance metrics that could reveal insights into player efficiency, team strategies, and overall trends in football performance.

Each season's data was scraped separately, capturing the dynamic nature of football where player forms, team compositions, and tactical approaches vary each year. The data was then organized into two comprehensive datasets:

The collected data serves as the foundation for our analysis, allowing us to explore various aspects of football performance, such as the correlation between player metrics and match outcomes, the impact of team formations on success rates, and trends in disciplinary actions over time.

By leveraging FBref's robust and reliable dataset, we ensured the accuracy and relevance of the insights derived from our analysis. The structured data collection process enabled us to combine season-specific statistics into unified datasets for comprehensive evaluations and visualizations.

Data Collection Process

1. Collecting Player Stats

Player statistics were scraped using Selenium and BeautifulSoup. Selenium was used to handle dynamic content on the FBref website. For example, the following code was used to scrape data for the 2020-2021 La Liga season:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

# URL for La Liga 2020-2021 player stats
url = 'https://fbref.com/en/comps/12/2020-2021/stats/2020-2021-La-Liga-Stats'

# Setup Selenium with headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

# Open the URL and extract the page source
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find the player stats table by its ID
table = soup.find('table', {'id': 'stats_standard'})
df = pd.read_html(str(table))[0]  # Parse the table into a DataFrame

# Save the data to a CSV file
df.to_csv('la_liga_2020_2021_player_stats.csv', index=False)

This process was repeated for the 2017-2018, 2018-2019, 2019-2020, and 2020-2021 seasons, resulting in four separate files. The files were later combined into a single dataset: Player_stats.csv.

2. Collecting Team Stats

Team statistics were collected using BeautifulSoup and requests. We fetched URLs for each team, scraped relevant tables, and combined the data. Here’s an example of extracting team stats for the 2020-2021 La Liga season:

from bs4 import BeautifulSoup
import requests

# Fetch team URLs
response = requests.get('https://fbref.com/en/comps/12/2020-2021/2020-2021-La-Liga-Stats')
soup = BeautifulSoup(response.text, 'html.parser')
team_links = [f"https://fbref.com{a['href']}" for a in soup.find_all('a', href=True) if '/squads/' in a['href']]

# Scrape "Scores & Fixtures" and "Shooting" tables for each team
for team_url in team_links:
    team_page = requests.get(team_url)
    team_soup = BeautifulSoup(team_page.text, 'html.parser')
    scores_fixtures = pd.read_html(str(team_soup), match="Scores & Fixtures")[0]
    # Process and save the data...

This process was also repeated for four seasons and the resulting files were merged into Team_Stats.csv.

3. Collecting Additional Team Data

In addition to the initial Team_Stats.csv dataset, another dataset named Team Dataset 2 (originally cleaned_dataset_no_index.csv) was created. This dataset aggregates and enhances the initial team statistics with additional cleaned and processed metrics. It combines player and team-level information for a more comprehensive analysis.

import pandas as pd

# Load the cleaned dataset
cleaned_data = pd.read_csv('cleaned_dataset_no_index.csv')

# Display the first few rows of the dataset
print(cleaned_data.head())

The Team Dataset 2 was used alongside the original Team_Stats.csv for advanced team-level analysis, providing deeper insights into team dynamics and player contributions.

Data Cleaning Steps

Data cleaning is a crucial part of the process to ensure accuracy and usability. Here are the steps we followed to clean our dataset:

        
        # Remove missing values
        df = df.dropna(subset=["Goals", "Assists"])
        
        # Rename columns
        df.rename(columns={"Gls": "Goals", "Ast": "Assists"}, inplace=True)
        
        # Filter active players
        df = df[df["Minutes Played"] > 0]
        
            

Data Visualizations

Data visualizations are essential for interpreting complex datasets and conveying insights effectively. By representing data visually, we can identify patterns, trends, and outliers that may not be apparent in raw data. Below are some of the visualizations we created to illustrate key findings from our analysis:

Football Pitch Lineup Visualization

This model generates unique football lineups for the 4-3-3 formation, visualized on a football pitch for each season. Players are assigned positions based on their performance metrics, primarily "Goals + Assists," ensuring the strongest lineup for each season.

2020-2021 Lineup
Football Pitch Lineup for 2020-2021

Forward Line: Iago Aspas (Right Wing), Antoine Griezmann (Striker), Yannick Carrasco (Left Wing).
Midfield: Gerard Moreno, Lionel Messi, Marcos Llorente.
Defense: José Luis Gayà (Left Back), Jordi Alba (Center Back), Fidel (Center Back), Marko Dmitrović (Right Back).
Goalkeeper: Marko Dmitrović.

2021-2022 Lineup
Football Pitch Lineup for 2021-2022

Forward Line: Iago Aspas (Right Wing), Karim Benzema (Striker), Vinícius Júnior (Left Wing).
Midfield: Ángel Correa, José Luis Morales, Gonçalo Guedes.
Defense: Johan Mojica (Left Back), Jordi Alba (Center Back), Alfonso Pedraza (Center Back), Álex Moreno (Right Back).
Goalkeeper: Yassine Bounou.

2022-2023 Lineup
Football Pitch Lineup for 2022-2023

Forward Line: Vinícius Júnior (Right Wing), Karim Benzema (Striker), Yannick Carrasco (Left Wing).
Midfield: Gabriel Veiga, Antoine Griezmann, Rodrygo.
Defense: Alejandro Balde (Left Back), Arnau Martinez (Center Back), Brian Oliván (Center Back), Álvaro Aceve (Right Back).
Goalkeeper: Álvaro Aceve.

2023-2024 Lineup
Football Pitch Lineup for 2023-2024

Forward Line: Robert Lewandowski (Right Wing), Alexander Sørloth (Striker), Sávio (Left Wing).
Midfield: Jude Bellingham, Antoine Griezmann, Marcos Llorente.
Defense: Miguel Gutiérrez (Left Back), Lucas Vázquez (Center Back), Yan Couto (Center Back), Artem Dovbyk (Right Back).
Goalkeeper: Juleh Agirrezabala.