ANALYSIS OF SPATIO-TEMPORAL CONVOLUTIONAL NEURAL NETWORKS FOR THE ACTION DETECTION TASKS
DOI:
https://doi.org/10.26577/jpcsit2024-v2-i4-a3Keywords:
action detection, convolutional neural networks, spatio-temporal convolutional neural networks, YOWOAbstract
This study investigates the effectiveness of Spatio-Temporal Convolutional Neural Networks (ST-CNNs) for action detection tasks, with a comprehensive comparison of state-of-the-art models including You Only Watch Once (YOWO), YOWOv2, YOWO-Frame, and YOWO-Plus. Through extensive experiments conducted on benchmark datasets such as UCF-101, HMDB-51, and AVA, we evaluate these architectures using metrics like frame-based Mean Average Precision (frame-mAP), video-mAP, computational efficiency (FPS), and scalability. The experiments also include real-time testing of the YOWO family using an IP camera and RTSP protocol to assess their practical applicability. Results highlight the superior accuracy of YOWO-Plus in capturing complex spatio-temporal dynamics, albeit at the cost of processing speed, and the efficiency of YOWO-Frame for live applications. This analysis underscores the trade-offs between speed and accuracy inherent in single-stage ST-CNN architectures. Our findings from the comparative analysis provide a robust foundation for the development of real-time systems capable of efficient and reliable operation in action detection tasks.