Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences

Πανάγος, Ιάσων - Ιωάννης

Please use this identifier to cite or link to this item: https://olympias.lib.uoi.gr/jspui/handle/123456789/39749

Full metadata record

DC Field	Value	Language
dc.contributor.author	Πανάγος, Ιάσων - Ιωάννης	el
dc.date.accessioned	2026-02-02T09:55:17Z	-
dc.date.available	2026-02-02T09:55:17Z	-
dc.identifier.uri	https://olympias.lib.uoi.gr/jspui/handle/123456789/39749	-
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Speech recognition	en
dc.subject	Macine learning	en
dc.subject	Lip reading	en
dc.subject	Computer vision	en
dc.title	Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences	en
dc.type	doctoralThesis	en
heal.type	doctoralThesis	el
heal.type.en	Doctoral thesis	en
heal.type.el	Διδακτορική διατριβή	el
heal.classification	Speech recognition
heal.classification	Macine learning
heal.classification	Lip reading
heal.classification	Computer vision
heal.dateAvailable	2026-02-02T09:56:18Z	-
heal.language	en	el
heal.access	free	el
heal.recordProvider	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή	el
heal.publicationDate	2025	-
heal.bibliographicCitation	Panagos, I. I. (2025). Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences, [Doctoral dissertation], University of Ioannina.	en
heal.abstract	Visual Speech Recognition (VSR) is a computer vision problem that aims to decode spoken words of one or more speakers from visual media without the presence of sound. Applications of VSR are found in numerous domains, with profound impacts on various aspects of everyday life. A notable application lies in the field of accessibility in medicine, where a VSR system can assist individuals with speech impairments, significantly enhancing their quality of life. Other applications include, but are not limited to, video captioning, and personal security systems, each with their own value. While recently there has been a steady increase in research interest regarding VSR, the issue of practicality has not been adequately explored. More specifically, the proposed models and methods often fail to consider the computational costs associated with their architectures, which severely limits or outright prevents their applicability in real-world scenarios. In this dissertation, we focus on addressing this oversight by developing lightweight and efficient end-to-end models for practical Visual Speech Recognition of isolated words. To realize this objective, we explore a multitude of approaches to reduce network size and complexity using a wide variety of methods. Owing to these reduced hardware requirements, such models can be applied to a broader range of applications and cover a sizable amount of practical real-life scenarios, offering a series of benefits. The fundamental design of a VSR system follows a two-step structure that employs expensive components such as deep convolutional neural networks with large hardware overheads that are prohibitively expensive to deploy. Our goal is reducing these resource requirements while maintaining acceptable recognition rates. To that end, we first employ techniques that exploit efficient formulations and low-cost operations to shrink model sizes without severely compromise performance. We replace the standard, resource-intensive components in existing networks with more efficient ones, achieving significant reductions in model parameter counts as well as in computational complexity. Moreover, we design a lightweight temporal block blueprint that is flexible in its design and can be adapted to the resources at hand and use it to develop highly-efficient networks with minimal hardware demands. Next, we shift our attention to a more holistic approach, by designing a lightweight VSR model using efficient components. A systematic study is conducted evaluating multiple networks and structures for visual feature extraction as well as sequence modeling. We select the best-performing components and combine them in a unified end-to-end architecture that achieves very high recognition accuracy while being compact, outperforming all other lightweight approaches in the literature. Finally, using this model as a baseline, we explore techniques to improve its performance without raising its complexity, attempting to bridge the gap with larger models. To that end, we incorporate channel attention in its temporal blocks to enhance feature representation, while we refine its training process by introducing regularization that allows the networks to learn more descriptive features from the data. Finally, we combine these additions to achieve significant recognition uplifts without affecting the network overhead.	en
heal.advisorName	Νίκου, Χριστόφορος	el
heal.committeeMemberName	Νίκου, Χριστόφορος	el
heal.committeeMemberName	Σφήκας, Γεώργιος	el
heal.committeeMemberName	Κόντης, Λυσίμαχος	el
heal.committeeMemberName	Λύκας, Αριστείδης	el
heal.committeeMemberName	Μπλέκας, Κωνσταντίνος	el
heal.committeeMemberName	Κεσίδης, Αναστάσιος	el
heal.committeeMemberName	Κακογεωργίου, Ιωάννης	el
heal.academicPublisher	Πανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικής	el
heal.academicPublisherID	uoi	el
heal.numberOfPages	250	el
heal.fullTextAvailability	true	-
Appears in Collections:	Διδακτορικές Διατριβές - ΜΗΥΠ

Show simple item record

Files in This Item:

File	Description	Size	Format
Δ.Δ. Πανάγος Ιάσων - Ιωάννης (2025).pdf		1.8 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License

Repository of UOI "Olympias"