Duoduaah, Doreen (2022) Pre-trained CNN and bi-directional LSTM for no-reference video quality assessment. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
1MBDuoduaah_MASc_F2022.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
A challenge in objective no-reference video quality assessment (VQA) research is incorporating memory effects and long-term dependencies observed in subjective VQA studies. To address this challenge, we propose to use a stack of six bi-directional Long-Short Term Memory (LSTM) layers of different units to model temporal characteristics of video sequences. We feed this bi-directional LSTM network with spatial features extracted from video frames using pre-trained convolution neural network (CNN); we assess three pre-trained CNN, MobileNet, ResNet-50 and Inception-ResNet-V2, as feature extractors and select ResNet-50 since it showed the best performance. In this thesis, we assess the stability of our VQA method and conduct an ablation study to highlight the importance of the bi-directional LSTM layers. Furthermore, we compare the performance of the proposed method with state-of-the-art VQA methods on three publicly available datasets, KoNVid-1K, LIVE-Qualcomm, and CVD2014; these experiments, using same set of parameters, demonstrate that our method outperforms these VQA methods by a significant margin in terms of Spearman’s Rank-Order Correlation Coefficient (SROCC), Pearson’s Linear Correlation Coefficient (PLCC), and Root Mean Square Error (RMSE).
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Duoduaah, Doreen |
Institution: | Concordia University |
Degree Name: | M.A. Sc. |
Program: | Electrical and Computer Engineering |
Date: | 24 June 2022 |
Thesis Supervisor(s): | Amer, Maria |
Keywords: | Video quality assessment, pre-trained CNN, transfer learning, bi-directional LSTM, long-term dependencies, deep spatial and temporal features. |
ID Code: | 990683 |
Deposited By: | Doreen Duoduaah |
Deposited On: | 27 Oct 2022 14:30 |
Last Modified: | 31 Aug 2023 00:00 |
References:
[1] M. Agarla, L. Celona, and R. Schettini. “An efficient method for no-reference video quality assessment”. In: Journal of Imaging 7.3 (2021), p. 55.[2] S. Ahn and S. Lee. “Deep blind video quality assessment based on temporal hu- man perception”. In: 25th IEEE International Conference on Image Processing (ICIP). 2018, pp. 619–623.
[3] L. Ali et al. “Performance evaluation of deep CNN-based crack detection and localization techniques for concrete structures”. In: Sensors 21.5 (2021), p. 1688.
[4] J. Benesty et al. “Pearson correlation coefficient”. In: Noise reduction in speech processing. Springer, 2009, pp. 1–4.
[5] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with gradient descent is difficult”. In: IEEE transactions on neural networks 5.2 (1994), pp. 157–166.
[6] H. Boujut et al. “No-reference video quality assessment of H. 264 video streams based on semantic saliency maps”. In: IS&T/SPIE Electronic Imaging. Vol. 8293. 2012, pp. 8293–28.
[7] T. Brandao and M.P. Queluz. “No-reference quality assessment of H. 264/AVC encoded video”. In: IEEE Transactions on Circuits and Systems for Video Tech- nology 20.11 (2010), pp. 1437–1447.
[8] R. Cahuantzi, X. Chen, and S. Gu ̈ttel. “A comparison of LSTM and GRU networks for learning symbolic sequences”. In: arXiv preprint arXiv:2107.02248 (2021).
[9] F. Chollet et al. Keras. https://keras.io. 2015.
[10] Z.L. Chu, T.J. Liu, and K.H. Liu. “No-Reference Video Quality Assessment by
A Cascade Combination of Neural Networks and Regression Model”. In: IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2020, pp. 4116– 4121.
48
[11] J. Deng et al. “Imagenet: A large-scale hierarchical image database”. In: IEEE conference on computer vision and pattern recognition. 2009, pp. 248–255.
[12] M. Dimitrievski and Z. Ivanovski. “No-reference quality assessment of highly compressed video sequences”. In: IEEE 15th International Workshop on Mul- timedia Signal Processing (MMSP). 2013, pp. 266–271.
[13] J. Donahue et al. “Decaf: A deep convolutional activation feature for generic visual recognition”. In: International conference on machine learning. PMLR. 2014, pp. 647–655.
[14] Q. Dou et al. “Automatic detection of cerebral microbleeds from MR images via 3D convolutional neural networks”. In: IEEE transactions on medical imaging 35.5 (2016), pp. 1182–1195.
[15] J. Duchi, E. Hazan, and Y. Singer. “Adaptive subgradient methods for online learning and stochastic optimization.” In: Journal of machine learning research 12.7 (2011).
[16] V. Frants et al. “Blind visual quality assessment for smart cloud-based video storage”. In: IEEE International Conference on Smart Cloud (SmartCloud). 2018, pp. 171–174.
[17] L. Gatys, A. Ecker, and M. Bethge. “A Neural Algorithm of Artistic Style”. In: Journal of Vision 16.12 (2016), pp. 326–326.
[18] A.G ́eron.Hands-onmachinelearningwithScikit-Learn,Keras,andTensor- Flow: Concepts, tools, and techniques to build intelligent systems. ” O’Reilly Media, Inc.”, 2019.
[19] D. Ghadiyaram et al. “In-capture mobile video distortions: A study of subjective behavior and objective algorithms”. In: IEEE Transactions on Circuits and Systems for Video Technology 28.9 (2017), pp. 2061–2077.
[20] A. Graves et al. “A novel connectionist system for unconstrained handwriting recognition”. In: IEEE transactions on pattern analysis and machine intelli- gence 31.5 (2008), pp. 855–868.
[21] A. Graves and N. Jaitly. “Towards end-to-end speech recognition with recurrent neural networks”. In: International conference on machine learning. PMLR. 2014, pp. 1764–1772.
[22] Z. Guan et al. “A novel objective quality assessment method for video confer- encing coding”. In: China Communications 16.4 (2019), pp. 89–104.
49
[23] J. Han and C. Moraga. “The influence of the sigmoid function parameters on the speed of backpropagation learning”. In: International workshop on artificial neural networks. Springer. 1995, pp. 195–201.
[24] K. He et al. “Deep residual learning for image recognition”. In: IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778.
[25] S. Hochreiter and J. Schmidhuber. “Long short-term memory”. In: Neural com- putation 9.8 (1997), pp. 1735–1780.
[26] V. Hosu et al. “The Konstanz natural video database (KoNViD-1k)”. In: IEEE Ninth international conference on quality of multimedia experience (QoMEX). 2017.
[27] A.G. Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).
[28] B. Karlik and A. V. Olgac. “Performance analysis of various activation functions in generalized MLP architectures of neural networks”. In: International Journal of Artificial Intelligence and Expert Systems 1.4 (2011), pp. 111–122.
[29] A. Karpathy et al. “Large-scale video classification with convolutional neural networks”. In: IEEE conference on Computer Vision and Pattern Recognition. 2014, pp. 1725–1732.
[30] D.P Kingma and J. Ba. “Adam: A Method for Stochastic Optimization”. In: International Conference on Learning Representations (Poster). 2015.
[31] J. Korhonen. “Two-level approach for no-reference consumer video quality as- sessment”. In: IEEE Transactions on Image Processing 28.12 (2019), pp. 5923– 5938.
[32] A Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: Master’s thesis, University of Tront (2009).
[33] A. Krizhevsky, I. Sutskever, and G.E. Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information pro- cessing systems 25 (2012).
[34] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning”. In: nature 521.7553 (2015), pp. 436–444.
[35] D. Li, T. Jiang, and M. Jiang. “Quality assessment of in-the-wild videos”. In: 27th ACM International Conference on Multimedia. 2019, pp. 2351–2359.
50
[36] D. Li, T. Jiang, and M. Jiang. “Recent advances and challenges in video quality assessment”. In: ZTE Communications 17.1 (2019), pp. 3–11.
[37] L. Li et al. “Hyperband: A novel bandit-based approach to hyperparameter opti- mization”. In: The Journal of Machine Learning Research 18.1 (2017), pp. 6765– 6816.
[38] R. Li, B. Zeng, and M.L. Liou. “A new three-step search algorithm for block motion estimation”. In: IEEE transactions on circuits and systems for video technology 4.4 (1994), pp. 438–442.
[39] S. Li et al. “Image quality assessment by separately evaluating detail losses and additive impairments”. In: IEEE Transactions on Multimedia 13.5 (2011), pp. 935–949.
[40] Y. Li et al. “Video quality assessment with deep architecture”. In: IEEE In- ternational Conference on Artificial Intelligence and Computer Applications (ICAICA). 2021, pp. 268–271.
[41] Z. Li et al. “Toward a practical perceptual video quality metric”. In: The Netflix Tech Blog 6.2 (2016).
[42] T. Lin et al. “Microsoft COCO: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755.
[43] W. Lu et al. “A spatiotemporal model of video quality assessment via 3D gra- dient differencing”. In: Information Sciences 478 (2019), pp. 141–151.
[44] W.S. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in ner- vous activity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115– 133.
[45] L. Mou, Y. Hua, and X. Zhu. “Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial im- ages”. In: IEEE Transactions on Geoscience and Remote Sensing 58.11 (2020), pp. 7557–7569.
[46] V. Nair and G.E. Hinton. “Rectified linear units improve restricted boltzmann machines”. In: Icml. 2010.
[47] Y. Nesterov. “A method for unconstrained convex minimization problem with the rate of convergence O (1/kˆ 2)”. In: Doklady an ussr. Vol. 269. 1983, pp. 543– 547.
51
[48] M. Nuutinen et al. “CVD2014 — A database for evaluating no-reference video quality assessment algorithms”. In: IEEE Transactions on Image Processing 25.7 (2016), pp. 3073–3086.
[49] T. O’Malley et al. Keras Tuner. https://github.com/keras-team/keras- tuner. 2019.
[50] I. Oksuz et al. “Deep learning-based detection and correction of cardiac MR motion artefacts during reconstruction for high-quality segmentation”. In: IEEE Transactions on Medical Imaging 39.12 (2020), pp. 4001–4010.
[51] S. Pang et al. “Spineparsenet: spine parsing for volumetric MR image by a two- stage segmentation framework with semantic image representation”. In: IEEE Transactions on Medical Imaging 40.1 (2020), pp. 262–273.
[52] J. Park et al. “Video quality pooling adaptive to perceptual distortion severity”. In: IEEE Transactions on Image Processing 22.2 (2012), pp. 610–620.
[53] R. Pascanu, T. Mikolov, and Y. Bengio. “On the difficulty of training recurrent neural networks”. In: International conference on machine learning. PMLR. 2013, pp. 1310–1318.
[54] A. Paszke et al. “Pytorch: An imperative style, high-performance deep learning library”. In: Advances in neural information processing systems 32 (2019).
[55] M.H. Pinson and S. Wolf. “A new standardized method for objectively measur- ing video quality”. In: IEEE Transactions on broadcasting 50.3 (2004), pp. 312– 322.
[56] S. Rimac-Drlje, M. Vranjes, and D. Zagar. “Influence of temporal pooling method on the objective video quality evaluation”. In: IEEE International Sym- posium on Broadband Multimedia Systems and Broadcasting. 2009, pp. 1–5.
[57] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal represen- tations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[58] M.A. Saad, A.C. Bovik, and C. Charrier. “Blind prediction of natural video quality”. In: IEEE Transactions on Image Processing 23.3 (2014), pp. 1352– 1365.
[59] H. Sak, A. Senior, and F. Beaufays. “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition”. In: arXiv preprint arXiv:1402.1128 (2014).
52
[60] A.L. Samuel. “Some studies in machine learning using the game of checkers. II—Recent progress”. In: IBM Journal of research and development 11.6 (1967), pp. 601–617.
[61] M.J. Scott et al. “Do personality and culture influence perceived video quality and enjoyment?” In: IEEE Transactions on Multimedia 18.9 (2016), pp. 1796– 1807.
[62] K. Seshadrinathan and A.C. Bovik. “Motion tuned spatio-temporal quality as- sessment of natural videos”. In: IEEE transactions on image processing 19.2 (2009), pp. 335–350.
[63] K. Seshadrinathan and A.C. Bovik. “Temporal hysteresis model of time varying subjective video quality”. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). 2011, pp. 1153–1156.
[64] H.O. Shahreza, A. Amini, and H. Behroozi. “No-reference video quality assess- ment using recurrent neural networks”. In: IEEE 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS). 2019, pp. 1–5.
[65] H.R. Sheikh and A.C. Bovik. “Image information and visual quality”. In: IEEE Transactions on image processing 15.2 (2006), pp. 430–444.
[66] D.J. Sheskin. “Spearman’s rank-order correlation coefficient”. In: Handbook of parametric and nonparametric statistical procedures 1353 (2007).
[67] M. Shi, K. Wang, and C. Li. “A C-LSTM with word embedding model for news text classification”. In: IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS). 2019, pp. 253–257.
[68] K. Simonyan, A. Vedaldi, and A. Zisserman. “Deep inside convolutional net- works: visualising image classification models and saliency maps”. In: Interna- tional Conference on Learning Representations. 2014.
[69] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large- scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[70] R. Soundararajan and A.C. Bovik. “Video quality assessment by reduced refer- ence spatio-temporal entropic differencing”. In: IEEE Transactions on Circuits and Systems for Video Technology 23.4 (2012), pp. 684–694.
[71] C. Szegedy et al. “Going deeper with convolutions”. In: IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.
53
[72] C. Szegedy et al. “Inception-v4, inception-resnet and the impact of residual con- nections on learning”. In: Thirty-first AAAI conference on artificial intelligence. 2017.
[73] C. Szegedy et al. “Rethinking the inception architecture for computer vision”. In: IEEE conference on computer vision and pattern recognition. 2016, pp. 2818– 2826.
[74] B. Thomee et al. “The new data and new challenges in multimedia research”. In: arXiv preprint arXiv:1503.01817 1.8 (2015).
[75] T. Tominaga et al. “Performance comparisons of subjective quality assessment methods for mobile video”. In: IEEE Second international workshop on quality of multimedia experience (QoMEX). 2010, pp. 82–87.
[76] Z. Tu et al. “A comparative evaluation of temporal pooling methods for blind video quality assessment”. In: IEEE International Conference on Image Pro- cessing (ICIP). 2020, pp. 141–145.
[77] D. Varga. “No-reference video quality assessment based on the temporal pooling of deep features”. In: Neural Processing Letters 50.3 (2019), pp. 2595–2608.
[78] P.V. Vu, C.T. Vu, and D.M. Chandler. “A spatiotemporal most-apparent- distortion model for video quality assessment”. In: 18th IEEE International Conference on Image Processing. 2011, pp. 2505–2508.
[79] C. Wang, L. Su, and W. Zhang. “COME for no-reference video quality as- sessment”. In: IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). 2018, pp. 232–237.
[80] Z. Wang et al. “Image quality assessment: from error visibility to structural similarity”. In: IEEE transactions on image processing 13.4 (2004), pp. 600– 612.
[81] J. Xu et al. “No-reference video quality assessment via feature learning”. In: IEEE international conference on image processing (ICIP). 2014, pp. 491–495.
[82] F. Yang et al. “No-reference quality assessment for networked video via primary analysis of bit stream”. In: IEEE Transactions on Circuits and Systems for Video Technology 20.11 (2010), pp. 1544–1554.
[83] F. Yi et al. “Attention Based Network For No-Reference UGC Video Quality Assessment”. In: IEEE International Conference on Image Processing (ICIP). 2021, pp. 1414–1418.
54
[84] J. You and J. Korhonen. “Deep neural networks for no-reference video quality assessment”. In: IEEE International Conference on Image Processing (ICIP). 2019, pp. 2349–2353.
[85] C. Zhang and J. Kim. “Object detection with location-aware deformable convo- lution and backward attention filtering”. In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. 2019, pp. 9452–9461.
[86] Y. Zhang et al. “Blind video quality assessment with weakly supervised learning and resampling strategy”. In: IEEE Transactions on Circuits and Systems for Video Technology 29.8 (2018), pp. 2244–2255.
[87] Y. Zhang, J. Lu, and J. Zhou. “Objects are different: Flexible monocular 3D object detection”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 3289–3298.
[88] W. Zhou and Z. Chen. “Deep local and global spatiotemporal feature aggrega- tion for blind video quality assessment”. In: IEEE International Conference on Visual Communications and Image Processing (VCIP). 2020, pp. 338–341.
[89] K. Zhu et al. “A no-reference video quality assessment based on laplacian pyra- mids”. In: IEEE International Conference on Image Processing. 2013, pp. 49– 53.
Repository Staff Only: item control page