VibeTune: Multi-Modal Emotion-Based Music Recommender

Production-ready emotion detection with real-time Spotify integration

ResNet50Wav2Vec2DistilRoBERTaComputer VisionSpeech RecognitionNLPSpotify APIStreamlitDockerCI/CD

Overview

VibeTune is a production-ready web application that analyzes your emotional state through three different modalities—face, voice, and text—and provides personalized music recommendations from Spotify to match your current vibe.

Deployed with one-click on Render, featuring lazy-loaded ML models, automated CI/CD pipeline, Docker containerization, and comprehensive monitoring with Sentry error tracking and Prometheus metrics.

Key Features

Face Analysis

ResNet50 fine-tuned on RAF-DB dataset achieving 74% validation accuracy. Detects 7 emotions from webcam or uploaded images.

Text Analysis

DistilRoBERTa transformer model from Hugging Face for contextual emotion classification from user-provided text with 6-emotion support.

Voice Analysis

Wav2Vec2-based speech emotion recognition supporting both live recording and audio file uploads with multi-emotion detection.

Spotify Integration

Real-time playlist generation with emotion-aligned tracks, album art, and 30-second audio previews across 8 emotion categories.

Production Features

Lazy-loaded ML models, automated CI/CD pipeline with GitHub Actions, Docker containerization with health checks, and comprehensive error tracking.

Monitoring & Metrics

Sentry error tracking, Prometheus metrics exposure, Grafana dashboards, and in-app memory management controls.

Model Architecture

System Architecture Diagram

End-to-end multi-modal emotion detection pipeline with Spotify integration

Face Emotion Detection

Model: ResNet50 fine-tuned on RAF-DB dataset

Performance: 74% validation accuracy on test set

Emotions: Happy, Sad, Angry, Surprised, Fearful, Disgust, Neutral

Input: Webcam capture or image upload

Text Emotion Classification

Model: j-hartmann/emotion-english-distilroberta-base

Architecture: DistilRoBERTa transformer (pre-trained)

Emotions: 6-class emotion classification

Input: User-provided text (any length)

Voice Emotion Recognition

Model: ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

Architecture: Wav2Vec2 trained on speech emotion datasets

Input: Live audio recording or file upload

Processing: Librosa for audio feature extraction

Tech Stack

ML Frameworks

TensorFlow 2.16KerasPyTorch 2.3TransformersHugging Face

Audio & Vision Processing

LibrosaPydubSoundDeviceFFMPEGOpenCVNumPy

Web & API

StreamlitSpotipySpotify Web API

DevOps & Monitoring

DockerGitHub ActionsRenderSentryPrometheusGrafana

Implementation Highlights

▹Lazy Model Loading: ML models loaded on-demand to optimize memory usage and startup time (first run can take longer)

▹Normalized Emotion Mapping: Consistent emotion labels across all three modalities for stable downstream recommendations

▹Render Deployment: One-click deployment with render.yaml, automatic builds, health checks, and environment secrets management

▹CI Pipeline: GitHub Actions runs compile checks on every push/PR to main for code quality assurance

▹Interactive UI: Clean Streamlit interface with tabbed workflows, in-app memory controls, and real-time feedback

▹Error Handling: Graceful fallbacks when modalities are unavailable, comprehensive error tracking with Sentry

Deployment & Monitoring

Deployment Stack

Docker runtime with automatic builds

Health checks and auto-deploy on push

Pre-configured environment variables

Prometheus metrics on port 9100

Render one-click deployment

Monitoring & Observability

Sentry for error tracking & tracing

Prometheus metrics exposure at /metrics

Grafana dashboards (optional local stack)

Environment-based configuration

In-app memory management controls

Future Improvements

Temporal emotion smoothing for stability

Personalized emotion-to-genre mappings

Reinforcement learning feedback loops

User history and preference tracking

Real-time playlist updates

Multi-user session support