Please use this identifier to cite or link to this item: https://olympias.lib.uoi.gr/jspui/handle/123456789/39749
Full metadata record
DC FieldValueLanguage
dc.contributor.authorΠανάγος, Ιάσων - Ιωάννηςel
dc.date.accessioned2026-02-02T09:55:17Z-
dc.date.available2026-02-02T09:55:17Z-
dc.identifier.urihttps://olympias.lib.uoi.gr/jspui/handle/123456789/39749-
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/*
dc.subjectSpeech recognitionen
dc.subjectMacine learningen
dc.subjectLip readingen
dc.subjectComputer visionen
dc.titleLightweight Methods and Models for Practical Visual Speech Recognition from Video Sequencesen
dc.typedoctoralThesisen
heal.typedoctoralThesisel
heal.type.enDoctoral thesisen
heal.type.elΔιδακτορική διατριβήel
heal.classificationSpeech recognition
heal.classificationMacine learning
heal.classificationLip reading
heal.classificationComputer vision
heal.dateAvailable2026-02-02T09:56:18Z-
heal.languageenel
heal.accessfreeel
heal.recordProviderΠανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολήel
heal.publicationDate2025-
heal.bibliographicCitationPanagos, I. I. (2025). Lightweight Methods and Models for Practical Visual Speech Recognition from Video Sequences, [Doctoral dissertation], University of Ioannina.en
heal.abstractVisual Speech Recognition (VSR) is a computer vision problem that aims to decode spoken words of one or more speakers from visual media without the presence of sound. Applications of VSR are found in numerous domains, with profound impacts on various aspects of everyday life. A notable application lies in the field of accessibility in medicine, where a VSR system can assist individuals with speech impairments, significantly enhancing their quality of life. Other applications include, but are not limited to, video captioning, and personal security systems, each with their own value. While recently there has been a steady increase in research interest regarding VSR, the issue of practicality has not been adequately explored. More specifically, the proposed models and methods often fail to consider the computational costs associated with their architectures, which severely limits or outright prevents their applicability in real-world scenarios. In this dissertation, we focus on addressing this oversight by developing lightweight and efficient end-to-end models for practical Visual Speech Recognition of isolated words. To realize this objective, we explore a multitude of approaches to reduce network size and complexity using a wide variety of methods. Owing to these reduced hardware requirements, such models can be applied to a broader range of applications and cover a sizable amount of practical real-life scenarios, offering a series of benefits. The fundamental design of a VSR system follows a two-step structure that employs expensive components such as deep convolutional neural networks with large hardware overheads that are prohibitively expensive to deploy. Our goal is reducing these resource requirements while maintaining acceptable recognition rates. To that end, we first employ techniques that exploit efficient formulations and low-cost operations to shrink model sizes without severely compromise performance. We replace the standard, resource-intensive components in existing networks with more efficient ones, achieving significant reductions in model parameter counts as well as in computational complexity. Moreover, we design a lightweight temporal block blueprint that is flexible in its design and can be adapted to the resources at hand and use it to develop highly-efficient networks with minimal hardware demands. Next, we shift our attention to a more holistic approach, by designing a lightweight VSR model using efficient components. A systematic study is conducted evaluating multiple networks and structures for visual feature extraction as well as sequence modeling. We select the best-performing components and combine them in a unified end-to-end architecture that achieves very high recognition accuracy while being compact, outperforming all other lightweight approaches in the literature. Finally, using this model as a baseline, we explore techniques to improve its performance without raising its complexity, attempting to bridge the gap with larger models. To that end, we incorporate channel attention in its temporal blocks to enhance feature representation, while we refine its training process by introducing regularization that allows the networks to learn more descriptive features from the data. Finally, we combine these additions to achieve significant recognition uplifts without affecting the network overhead.en
heal.advisorNameΝίκου, Χριστόφοροςel
heal.committeeMemberNameΝίκου, Χριστόφοροςel
heal.committeeMemberNameΣφήκας, Γεώργιοςel
heal.committeeMemberNameΚόντης, Λυσίμαχοςel
heal.committeeMemberNameΛύκας, Αριστείδηςel
heal.committeeMemberNameΜπλέκας, Κωνσταντίνοςel
heal.committeeMemberNameΚεσίδης, Αναστάσιοςel
heal.committeeMemberNameΚακογεωργίου, Ιωάννηςel
heal.academicPublisherΠανεπιστήμιο Ιωαννίνων. Πολυτεχνική Σχολή. Τμήμα Μηχανικών Ηλεκτρονικών Υπολογιστών και Πληροφορικήςel
heal.academicPublisherIDuoiel
heal.numberOfPages250el
heal.fullTextAvailabilitytrue-
Appears in Collections:Διδακτορικές Διατριβές - ΜΗΥΠ

Files in This Item:
File Description SizeFormat 
Δ.Δ. Πανάγος Ιάσων - Ιωάννης (2025).pdf1.8 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons