Deep Learning Based Text Extraction from Video Using CNN, LSTM, and Transformer Models
DOI:
https://doi.org/10.70454/JRICST.2025.20304Keywords:
Optical Character Recognition, Text Extraction, Deep Learning, Convolutional Neural Networks , Long Short-Term MemoryAbstract
This study offers a deep learning-based method for text extraction from video frames, addressing issues like motion blur, variable text orientations, and background noise. Traditional optical character recognition (OCR) methods like Tesseract suffer from these problems, while contemporary deep learning models offer notable advancements. The suggested model uses Convolutional Neural Networks (CNNs) to identify text regions, Transformer-based models to increase recognition accuracy, and Long Short-Term Memory (LSTM) networks to maintain sequences. Several tests demonstrate that by striking a balance between accuracy and real-time functionality, the CNN + LSTM architecture performs better than conventional OCR algorithms. The results show that transformer-based methods have the highest accuracy but the highest computational cost. deep learning models like CNN, LSTM, and Transformers can handle contextual recognition, temporal sequencing, and spatial detection, they are particularly well suited for video text extraction. This hybrid approach, in contrast to traditional OCR, guarantees high accuracy even in video frames that are noisy, blurry, or multilingual.
References
[1] K. Bayoudh, R. Knani, F. Hamdaoui and A. Mtibaa, "A survey on deep multimodal learning for
computer vision: advances, trends, applications, and datasets," The Visual Computer, vol. 38, no. 8,
pp. 2939–2970, 2022.
[2] T. Chauhan and H. Palivela, "Optimization and improvement of fake news detection using deep
learning approaches for societal benefit," International Journal of Information Management Data
Insights, vol. 1, no. 2, p. 100051, 2021.
[3] J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang and M. Wang, "Dual encoding for video
retrieval by text," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8,
pp. 4065–4080, 2021.
[4] S. Jabeen, X. Li, M. S. Amin, O. Bourahla, S. Li and A. Jabbar, "A review on methods and
applications in multimodal deep learning," ACM Transactions on Multimedia Computing,
Communications and Applications, vol. 19, no. 2s, pp. 1–41, 2023.
[5] M. L. Joshi and N. Kanoongo, "Depression detection using emotional artificial intelligence and
machine learning: A closer review," Materials Today: Proceedings, vol. 58, pp. 217–226, 2022.
[6] A. Onan, "Sentiment analysis on massive open online course evaluations: a text mining and deep
learning approach," Computer Applications in Engineering Education, vol. 29, no. 3, pp. 572–589,
2021.
[7] Y. Qi and Z. Shabrina, "Sentiment analysis using Twitter data: a comparative application of
lexicon-and machine-learning-based approach," Social Network Analysis and Mining, vol. 13, no. 1,
p. 31, 2023.
Journal of Recent Innovations in Computer Science and Technology
[8] V. Sharma, M. Gupta, A. Kumar and D. Mishra, "Video processing using deep learning
techniques: A systematic literature review," IEEE Access, vol. 9, pp. 139489–139507, 2021.
[9] X. Shu and Y. Ye, "Knowledge Discovery: Methods from data mining and machine learning,"
Social Science Research, vol. 110, p. 102817, 2023.
[10] J. Summaira, X. Li, A. M. Shoib, S. Li and J. Abdul, "Recent advances and trends in multimodal
deep learning: A review," arXiv preprint arXiv:2105.11087, 2021.
[11] C. Tarchi, S. Zaccoletti and L. Mason, "Learning from text, video, or subtitles: A comparative
analysis," Computers & Education, vol. 160, p. 104034, 2021.
[12] M. D. Venkata, P. Donda, N. B. Madhavi, P. P. Singh, A. A. J. Pazhani and S. R. Banu,
"Personalized recognition system in online shopping by using deep learning," EAI Endorsed
Transactions on Internet of Things, vol. 10, pp. 1–8, 2024.
[13] Y. Xu, Y. Zhou, P. Sekula and L. Ding, "Machine learning in construction: From shallow to deep
learning," Developments in the Built Environment, vol. 6, p. 100045, 2021.
[14] X. Zhao, Z. Tang and S. Zhang, "Deep personality trait recognition: a survey," Frontiers in
Psychology, vol. 13, p. 839619, 2022.
[15] S. M. M. H. Chowdhury, M. Rahman, M. T. Oyshi and M. A. Hasan, "Text Extraction through
Video Lip Reading Using Deep Learning," in 2019 8th International Conference System Modeling
and Advancement in Research Trends (SMART), Moradabad, India, 2019, pp. 240–243, doi:
10.1109/SMART46866.2019.9117224.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Manender Dutt, Ritu Sharma (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
This is an Open Access article distributed under the term's of the Creative Common Attribution 4.0 International License permitting all use, distribution, and reproduction in any medium, provided the work is properly cited.