Multimodal Emotion Analysis

Project Overview

Complete NLP pipeline developed for a Content Intelligence Agency that automatically analyzes emotions in YouTube videos. The system receives a video URL as input and generates a CSV with timestamps, transcriptions, translated sentences, and emotion labels based on the Ekman model (happiness, sadness, anger, surprise, fear, disgust, neutral). The pipeline consists of four steps: audio download (PyTubeFix), speech-to-text conversion (Whisper Large-v3-Turbo), Dutch → English translation (MarianMT), and emotion classification (RobBERT). Eight different models were tested for emotion classification.

Challenges

There was limited Dutch emotion data available, so the entire class collectively labeled 4,000 sentences for training. Additionally, there was strong class imbalance: neutral and positive emotions were very common, while fear, disgust, and anger were rare. Data augmentation via Dutch WordNet and SpaCy (synonym replacement) expanded rare emotion classes by 6×. An important choice concerned the model: RobBERT (native Dutch, F1-score 85%) versus BERT on translated data (F1-score 92%), a trade-off between accuracy and system complexity.

Results

The RobBERT model achieved 85% F1-score on Dutch emotion classification, significantly better than traditional models (LSTM ~65%, SVM ~58%). Despite BERT on English translations achieving 92%, we chose direct Dutch processing to avoid nuance loss and translation errors. The pipeline is fully operational and processes YouTube URLs into a structured CSV file. Systematic model comparison showed that Transformers perform 20-30% better than traditional NLP methods for emotion classification.

4000+

Manually Labeled Sentences

40 min

Test Video

8

Models Tested

85%

F1-Score

Pipeline Architecture

A fully automated system that processes a YouTube URL into a structured CSV file with timestamps, transcriptions, translations, and emotion labels.

1

Audio Download

Automatic audio download from YouTube

PyTubeFix

2

Speech-to-Text

Transcription of Dutch speech with timestamps

Whisper Large-v3-Turbo

3

Emotion Classification

RobBERT predicts emotion per sentence

RobBERT

4

Translation

Automatic translation Dutch → English

MarianMT

Step Details

Step 1 - Audio Download: PyTubeFix library downloads audio from YouTube video in highest available quality. Output: .mp3 file for further processing
Step 2 - Transcription: Whisper Large-v3-Turbo model (OpenAI) transcribes Dutch speech to text with timestamps per segment. Automatically chooses correct language
Step 3 - Emotion Classification: RobBERT fine-tuned on Dutch emotion data labels each sentence with one of 7 Ekman emotions
Step 4 - Translation: MarianMT neural machine translation model translates Dutch text to English for BERT processing. Preserves timestamps

Model Comparison

Systematically tested and evaluated 8 different models on our 4000-sentence dataset.

Performance Overview

Model	Type	F1-Score	Status
BERT (English)	Transformer on translated data	92%	Highest Score
RobBERT (NL)	Dutch Transformer	85%	Used
LSTM	Recurrent Network	~65%	Baseline
RNN	Basic Recurrent	~62%	Baseline
SVM	Support Vector Machine	~58%	Baseline
Logistic Regression	Linear	~55%	Baseline
Naive Bayes	Probabilistic	~53%	Baseline

🎯 Choice for RobBERT

Although English BERT scored highest (92%), we chose RobBERT (85%) because:

Direct Dutch processing: Understands Dutch nuances better than translated text
No translation errors: Avoids errors from the translation step
Better generalization: Worked more reliably on new content
Simpler system: One less model = fewer failure points

Real Examples from Test Video

Below are concrete examples of how the pipeline works on a 40-minute test video:

⏱️ 00:03:45 Happiness

Dutch:
"Wat geweldig om je hier te zien! Ik ben zo blij dat je er bent."

English:
"How wonderful to see you here! I'm so happy you're here."

⏱️ 00:12:18 Sadness

Dutch:
"Het is zo moeilijk om afscheid te nemen, dit doet pijn."

English:
"It's so hard to say goodbye, this hurts."

⏱️ 00:18:32 Anger

Dutch:
"Dit is onacceptabel! Hoe durf je dit te doen?"

English:
"This is unacceptable! How dare you do this?"

⏱️ 00:25:09 Surprise

Dutch:
"Wow, dat had ik echt niet verwacht! Wat een verrassing!"

English:
"Wow, I really didn't expect that! What a surprise!"

⏱️ 00:31:47 Neutral

Dutch:
"De vergadering begint om drie uur in de conferentiezaal."

English:
"The meeting starts at three o'clock in the conference room."

Technology Stack

🎭 RobBERT

🎤 Whisper v3

🔄 MarianMT

🤗 Transformers

🐍 Python

🐼 Pandas

📊 Scikit-Learn

🔥 PyTorch

📝 SpaCy

📚 NLTK

📹 PyTubeFix

☁️ Azure Cloud

Complete NLP Toolkit

Speech-to-Text: Whisper Large-v3-Turbo for state-of-the-art Dutch speech recognition
Machine Translation: MarianMT Helsinki-NLP model for Dutch → English translation
Emotion Classification: RobBERT fine-tuned on 4000 manually labeled Dutch sentences
Data Augmentation: Dutch WordNet and SpaCy for contextually appropriate synonyms
Model Training: HuggingFace Transformers library for BERT/RobBERT fine-tuning
Baseline Models: Scikit-Learn for SVM, Logistic Regression, Naive Bayes comparison
Deep Learning: PyTorch for LSTM and RNN implementations
Deployment: Azure cloud for model hosting (temporary for course)

Challenges & Solutions

🎯 Lack of Labeled Dutch Data

Few public datasets with Dutch sentences labeled with Ekman emotions. Existing datasets are often English or use different emotion categories.

Solution:

Entire class labeled 4000 sentences together (each 25 sentences). Collected transcript data from Dutch TV shows and systematically labeled according to 7 Ekman categories.

⚖️ Unequal Emotion Distribution

Natural speech contains mainly neutral and positive emotions. Fear, disgust, and anger occurred much less frequently, leading to poor performance on these categories.

Solution:

Data augmentation via synonyms. Fear, anger, and disgust expanded 6×, sadness and surprise 3×. Used Dutch WordNet and SpaCy for contextually appropriate synonyms.

🤔 Traditional Models Insufficient

LSTM, RNN, SVM, and other traditional models achieved F1-scores below 65%. Missed contextual nuances for accurate emotion detection.

Solution:

Switch to Transformers. RobBERT's pretrained language understanding gave boost to 85% F1. Attention mechanism captures context much better.

Results & Model Performance

85%

RobBERT F1-Score

Chosen model - native Dutch

92%

BERT F1-Score

Highest but requires translation

+20%

Transformers Boost

vs traditional models

7

Ekman Categories

Incl. neutral

Per-Emotion Performance

RobBERT achieved good results for all emotion categories:

Happiness: Highest precision - clear positive signal words ("wonderful", "happy", "fantastic")
Neutral: Most common, good recall - baseline for all classifications
Sadness: Good balance precision/recall - contextual markers like "difficult", "pain", "sad"
Anger: Significantly improved after augmentation - benefits from synonym expansion
Surprise: Hardest to classify - often confused with happiness for positive surprises
Fear & Disgust: Rare in data but after 6× augmentation workable performance

Lessons Learned

Complete NLP pipeline development: From raw video to structured data - understanding every step in a production system
Systematic model comparison: Evaluating 8+ models taught objective selection based on metrics AND practical considerations
Data annotation is crucial: Manual labeling gave insight into importance of data quality for model performance
Pretrained models are powerful: Transformers perform 20-30% better than traditional approaches
Native language processing wins: Direct Dutch processing worked better than first translating
Class imbalance requires action: Data augmentation essential for good performance on rare emotions

What I Would Do Differently

Invest more time in data collection for larger, more diverse dataset
Ensemble methods: Combine multiple models (RobBERT + BERT + LSTM) for more robust predictions
A/B testing for RobBERT vs English BERT in production context with real users
Fine-tuning experiments with different learning rates and batch sizes

Project Overview

Challenges

Results

Pipeline Architecture

Step Details

Model Comparison

Performance Overview

🎯 Choice for RobBERT

Real Examples from Test Video

Technology Stack

Complete NLP Toolkit

Challenges & Solutions

Results & Model Performance

Per-Emotion Performance

Lessons Learned

What I Would Do Differently

Check Out My Other Projects