Multimodal real-time emotion recognition system based on deep convolutional neural networks
DOI: 10.31673/2412-9070.2026.026805
DOI:
https://doi.org/10.31673/2412-9070.2026.026805Abstract
The scientific paper substantiates the conceptual framework for developing a multimodal inteligent system for real-time human emotion recognition, leveraging dynamic three-dimensional convolutional neural networks (3D-CNN) and advanced unsupervised machine learning techniques. The primary scientific novelty lies in the formulation of a methodological approach for autonomous spatiotemporal feature extraction, which significantly reduces reliance on pre-labeled datasets while enhancing the system's adaptability to the unique nuances of non-verbal expression. A substantial contribution involves the technical refinement of the MediaPipe framework through the integration of a specialized, modified algorithm for anthropometric marker detection, tailored for capturing the intricate patterns of pediatric facial expressions.
This method for identifying smiles in children is based on a rigorous analysis of the dynamic curvature of nasolabial folds and malar point elevation, ensuring that positive emotional states are correctly identified rather than being misinterpreted as background noise or neutral expressions. The methodological foundation is built upon a dual-channel architecture that conducts parallel analysis of facial micro-expressions and body kinematics, effectively mitigating risks associated with partial facial occlusion or suboptimal camera angles. By employing 3D-CNNs the system processes video data as cohesive spatiotemporal structures, while automated pseudo-labeling within unsupervised latent space clustering enables the system to autonomously structure basic emotional categories.
Experimental validation using a Late Fusion modality integration strategy confirmed the model's robustness against noisy signals and inconsistent lighting. Results demonstrate that the proposed model achieves the high-speed processing required for real-time operation, making it suitable for integration into intelligent educational platforms, pediatric diagnostics, and security systems.
Keywords: emotion recognition, multimodal system, 3D-CNN, unsupervised learning, late fusion, real-time, computer vision, non-verbal behavior, feature clustering, facial expressions.