Skip to main navigation menu Skip to main content Skip to site footer

Electrotechnical and Computer Engineering

Vol. 36 No. 07 (2021): Proceedings of the Faculty of Technical Sciences

MICROSERVICE FOR TEXT EXTRACTION FROM WORD AND PDF DOCUMENTS

  • Dejan Bešić
DOI:
https://doi.org/10.24867/13BE26Besic
Submitted
July 4, 2021
Published
2021-07-04

Abstract

This paper will describe the solution of text extraction from documents in Word and PDF format. The solution will be created as a microservice architecture. In addition to the implementation of the solution itself, the libraries required for extraction as well as conversion will be discussed. This paper will describe the structure of PDF documents, why they are in us and what are possible disadvantages of text extraction from this type of documents. The problem of extracting text from PDF and Word documents, will be reduced to the problem of extracting text from PDF documents.

References

[1] Servisno orijentisana arhitektura i integrisanje poslovnih aplikacija. Preuzeto sa https://www2.masfak.ni.ac.rs/uploads/articles/www2_5._soa_skraceno.pdf
[2] Servisno-orijentisana arhitektura, IBM, https://www.ibm.com/cloud/learn/soa
[3] Microservice Architecture, https://microservices.io/patterns/microservices.html
[4] Apache POI, https://en.wikipedia.org/wiki/Apache_POI
[5] Gotenberg, https://thecodingmachine.github.io/gotenberg
[6] Apache PDFBox, https://en.wikipedia.org/wiki/Apache_PDFBox