Deep Learning Based Text Extraction from Video Using CNN, LSTM, and Transformer Models

Authors

  • Manender Dutt Assistant Professor, Unitedworld Institute of Technology, Karnavati University, Gandhinagar, Gujarat. Author
  • Ritu Sharma Assistant Professor, Unitedworld Institute of Technology, Karnavati University, Gandhinagar, Gujarat. Author

DOI:

https://doi.org/10.70454/JRICST.2025.20304

Keywords:

Optical Character Recognition, Text Extraction, Deep Learning, Convolutional Neural Networks , Long Short-Term Memory

Abstract

This study offers a deep learning-based method for text extraction from video frames, addressing issues like motion blur, variable text orientations, and background noise. Traditional optical character recognition (OCR) methods like Tesseract suffer from these problems, while contemporary deep learning models offer notable advancements. The suggested model uses Convolutional Neural Networks (CNNs) to identify text regions, Transformer-based models to increase recognition accuracy, and Long Short-Term Memory (LSTM) networks to maintain sequences. Several tests demonstrate that by striking a balance between accuracy and real-time functionality, the CNN + LSTM architecture performs better than conventional OCR algorithms. The results show that transformer-based methods have the highest accuracy but the highest computational cost. deep learning models like CNN, LSTM, and Transformers can handle contextual recognition, temporal sequencing, and spatial detection, they are particularly well suited for video text extraction. This hybrid approach, in contrast to traditional OCR, guarantees high accuracy even in video frames that are noisy, blurry, or multilingual.

References

[1] K. Bayoudh, R. Knani, F. Hamdaoui and A. Mtibaa, "A survey on deep multimodal learning for

computer vision: advances, trends, applications, and datasets," The Visual Computer, vol. 38, no. 8,

pp. 2939–2970, 2022.

[2] T. Chauhan and H. Palivela, "Optimization and improvement of fake news detection using deep

learning approaches for societal benefit," International Journal of Information Management Data

Insights, vol. 1, no. 2, p. 100051, 2021.

[3] J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang and M. Wang, "Dual encoding for video

retrieval by text," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8,

pp. 4065–4080, 2021.

[4] S. Jabeen, X. Li, M. S. Amin, O. Bourahla, S. Li and A. Jabbar, "A review on methods and

applications in multimodal deep learning," ACM Transactions on Multimedia Computing,

Communications and Applications, vol. 19, no. 2s, pp. 1–41, 2023.

[5] M. L. Joshi and N. Kanoongo, "Depression detection using emotional artificial intelligence and

machine learning: A closer review," Materials Today: Proceedings, vol. 58, pp. 217–226, 2022.

[6] A. Onan, "Sentiment analysis on massive open online course evaluations: a text mining and deep

learning approach," Computer Applications in Engineering Education, vol. 29, no. 3, pp. 572–589,

2021.

[7] Y. Qi and Z. Shabrina, "Sentiment analysis using Twitter data: a comparative application of

lexicon-and machine-learning-based approach," Social Network Analysis and Mining, vol. 13, no. 1,

p. 31, 2023.

Journal of Recent Innovations in Computer Science and Technology

[8] V. Sharma, M. Gupta, A. Kumar and D. Mishra, "Video processing using deep learning

techniques: A systematic literature review," IEEE Access, vol. 9, pp. 139489–139507, 2021.

[9] X. Shu and Y. Ye, "Knowledge Discovery: Methods from data mining and machine learning,"

Social Science Research, vol. 110, p. 102817, 2023.

[10] J. Summaira, X. Li, A. M. Shoib, S. Li and J. Abdul, "Recent advances and trends in multimodal

deep learning: A review," arXiv preprint arXiv:2105.11087, 2021.

[11] C. Tarchi, S. Zaccoletti and L. Mason, "Learning from text, video, or subtitles: A comparative

analysis," Computers & Education, vol. 160, p. 104034, 2021.

[12] M. D. Venkata, P. Donda, N. B. Madhavi, P. P. Singh, A. A. J. Pazhani and S. R. Banu,

"Personalized recognition system in online shopping by using deep learning," EAI Endorsed

Transactions on Internet of Things, vol. 10, pp. 1–8, 2024.

[13] Y. Xu, Y. Zhou, P. Sekula and L. Ding, "Machine learning in construction: From shallow to deep

learning," Developments in the Built Environment, vol. 6, p. 100045, 2021.

[14] X. Zhao, Z. Tang and S. Zhang, "Deep personality trait recognition: a survey," Frontiers in

Psychology, vol. 13, p. 839619, 2022.

[15] S. M. M. H. Chowdhury, M. Rahman, M. T. Oyshi and M. A. Hasan, "Text Extraction through

Video Lip Reading Using Deep Learning," in 2019 8th International Conference System Modeling

and Advancement in Research Trends (SMART), Moradabad, India, 2019, pp. 240–243, doi:

10.1109/SMART46866.2019.9117224.

Downloads

Published

2025-07-29

Issue

Section

Article

How to Cite

Dutt, M., & Sharma, R. (2025). Deep Learning Based Text Extraction from Video Using CNN, LSTM, and Transformer Models. Journal of Recent Innovations in Computer Science and Technology, 2(3), 34-44. https://doi.org/10.70454/JRICST.2025.20304

Similar Articles

1-10 of 14

You may also start an advanced similarity search for this article.