Multimodal Emotion Analysis

Complete NLP system for automatic emotion detection in Dutch video content

📅 February - April 2025 (Year 2 - Block C)
🏢 NLP Course
👥 Team of 3 students (8 weeks)
🎯 Content Intelligence Agency

Project Overview

Complete NLP pipeline developed for a Content Intelligence Agency that automatically analyzes emotions in YouTube videos. The system receives a video URL as input and generates a CSV with timestamps, transcriptions, translated sentences, and emotion labels based on the Ekman model (happiness, sadness, anger, surprise, fear, disgust, neutral). The pipeline consists of four steps: audio download (PyTubeFix), speech-to-text conversion (Whisper Large-v3-Turbo), Dutch → English translation (MarianMT), and emotion classification (RobBERT). Eight different models were tested for emotion classification.

Challenges

There was limited Dutch emotion data available, so the entire class collectively labeled 4,000 sentences for training. Additionally, there was strong class imbalance: neutral and positive emotions were very common, while fear, disgust, and anger were rare. Data augmentation via Dutch WordNet and SpaCy (synonym replacement) expanded rare emotion classes by 6×. An important choice concerned the model: RobBERT (native Dutch, F1-score 85%) versus BERT on translated data (F1-score 92%), a trade-off between accuracy and system complexity.

Results

The RobBERT model achieved 85% F1-score on Dutch emotion classification, significantly better than traditional models (LSTM ~65%, SVM ~58%). Despite BERT on English translations achieving 92%, we chose direct Dutch processing to avoid nuance loss and translation errors. The pipeline is fully operational and processes YouTube URLs into a structured CSV file. Systematic model comparison showed that Transformers perform 20-30% better than traditional NLP methods for emotion classification.

4000+
Manually Labeled Sentences
40 min
Test Video
8
Models Tested
85%
F1-Score

Pipeline Architecture

A fully automated system that processes a YouTube URL into a structured CSV file with timestamps, transcriptions, translations, and emotion labels.

1
Audio Download
Automatic audio download from YouTube
PyTubeFix
2
Speech-to-Text
Transcription of Dutch speech with timestamps
Whisper Large-v3-Turbo
3
Emotion Classification
RobBERT predicts emotion per sentence
RobBERT
4
Translation
Automatic translation Dutch → English
MarianMT

Step Details

Model Comparison

Systematically tested and evaluated 8 different models on our 4000-sentence dataset.

Performance Overview

Model Type F1-Score Status
BERT (English) Transformer on translated data 92% Highest Score
RobBERT (NL) Dutch Transformer 85% Used
LSTM Recurrent Network ~65% Baseline
RNN Basic Recurrent ~62% Baseline
SVM Support Vector Machine ~58% Baseline
Logistic Regression Linear ~55% Baseline
Naive Bayes Probabilistic ~53% Baseline

🎯 Choice for RobBERT

Although English BERT scored highest (92%), we chose RobBERT (85%) because:

  • Direct Dutch processing: Understands Dutch nuances better than translated text
  • No translation errors: Avoids errors from the translation step
  • Better generalization: Worked more reliably on new content
  • Simpler system: One less model = fewer failure points

Real Examples from Test Video

Below are concrete examples of how the pipeline works on a 40-minute test video:

⏱️ 00:03:45 Happiness
Dutch:
"Wat geweldig om je hier te zien! Ik ben zo blij dat je er bent."
English:
"How wonderful to see you here! I'm so happy you're here."
⏱️ 00:12:18 Sadness
Dutch:
"Het is zo moeilijk om afscheid te nemen, dit doet pijn."
English:
"It's so hard to say goodbye, this hurts."
⏱️ 00:18:32 Anger
Dutch:
"Dit is onacceptabel! Hoe durf je dit te doen?"
English:
"This is unacceptable! How dare you do this?"
⏱️ 00:25:09 Surprise
Dutch:
"Wow, dat had ik echt niet verwacht! Wat een verrassing!"
English:
"Wow, I really didn't expect that! What a surprise!"
⏱️ 00:31:47 Neutral
Dutch:
"De vergadering begint om drie uur in de conferentiezaal."
English:
"The meeting starts at three o'clock in the conference room."

Technology Stack

🎭 RobBERT
🎤 Whisper v3
🔄 MarianMT
🤗 Transformers
🐍 Python
🐼 Pandas
📊 Scikit-Learn
🔥 PyTorch
📝 SpaCy
📚 NLTK
📹 PyTubeFix
☁️ Azure Cloud

Complete NLP Toolkit

Challenges & Solutions

🎯 Lack of Labeled Dutch Data
Few public datasets with Dutch sentences labeled with Ekman emotions. Existing datasets are often English or use different emotion categories.
Solution:
Entire class labeled 4000 sentences together (each 25 sentences). Collected transcript data from Dutch TV shows and systematically labeled according to 7 Ekman categories.
⚖️ Unequal Emotion Distribution
Natural speech contains mainly neutral and positive emotions. Fear, disgust, and anger occurred much less frequently, leading to poor performance on these categories.
Solution:
Data augmentation via synonyms. Fear, anger, and disgust expanded 6×, sadness and surprise 3×. Used Dutch WordNet and SpaCy for contextually appropriate synonyms.
🤔 Traditional Models Insufficient
LSTM, RNN, SVM, and other traditional models achieved F1-scores below 65%. Missed contextual nuances for accurate emotion detection.
Solution:
Switch to Transformers. RobBERT's pretrained language understanding gave boost to 85% F1. Attention mechanism captures context much better.

Results & Model Performance

85%
RobBERT F1-Score
Chosen model - native Dutch
92%
BERT F1-Score
Highest but requires translation
+20%
Transformers Boost
vs traditional models
7
Ekman Categories
Incl. neutral

Per-Emotion Performance

RobBERT achieved good results for all emotion categories:

Lessons Learned

  • Complete NLP pipeline development: From raw video to structured data - understanding every step in a production system
  • Systematic model comparison: Evaluating 8+ models taught objective selection based on metrics AND practical considerations
  • Data annotation is crucial: Manual labeling gave insight into importance of data quality for model performance
  • Pretrained models are powerful: Transformers perform 20-30% better than traditional approaches
  • Native language processing wins: Direct Dutch processing worked better than first translating
  • Class imbalance requires action: Data augmentation essential for good performance on rare emotions

What I Would Do Differently

Check Out My Other Projects