IMAGE CAPTIONING USING MACHINE LEARNING

Authors

  • Ivan Činčurak Autor

DOI:

https://doi.org/10.24867/22BE20Cincurak

Keywords:

image captioning, text similarity metrics, encoder-decoder architecture, attention mechanism

Abstract

Image captioning has become an attractive topic in the last few years. There is a substantial need for machine description of situations in the automotive industry. Google Image search could be improved. Also, it would be possible to enhance surveillance cameras by eliminating the need for a person to constantly monitor the cameras and wait for a certain situation to occur instead of only looking when the description of the image in the video is close to some predefined set of texts. In this paper, three methods for automatic image description were tested. The first is by applying the encoder-decoder architecture with the attention mechanism, the second is without this mechanism, and the third is by using recurrent neural networks. The solution was evaluated with BLeU, ROUGE, and Doc2Vec metrics. The models were trained and tested on the MSCOCO dataset. Additionally, the model was tested with scraped data from a Google Image search.

References

[1] Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. PMLR, 2015.
[2] Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291-304
[3] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 39(4), 652-663.
[4] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International journal of computer vision, 115(3), 211-252.
[5] Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318)
[6] Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39-52
[7] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. InEuropean Conference on Computer Vision, pages 740–755. Springer, 2014.

Published

2023-03-06

Issue

Section

Electrotechnical and Computer Engineering