Research of text recognition systems and data removal for ukrainian-­language documents

DOI: 10.31673/2412-9070.2020.066163

Authors

  • К. О. Гордієнко, (Hordiyenko K. O.) State University of Telecommunications, Kyiv
  • А. Б. Коба, (Koba A. B.) State University of Telecommunications, Kyiv
  • Т. П. Довженко, (Dovzhenko T. P.) State University of Telecommunications, Kyiv

DOI:

https://doi.org/10.31673/2412-9070.2020.066163

Abstract

This article discusses the existing software, the main task of which is to extract information from digitized documents. From all the software was selected what is based on neural network technology and deep learning. To extract information from documents, manual work of personal computer operators can be used, which takes a long time and does not exclude the influence of the human factor, as well as digitization of documents with further processing in software based on the principle of subordination of documents to templates and rules, data processing speed and the need to make changes to the settings due to a change in the type of document. The article aims to investigate the existing software for extracting data from digital documents based on neural network technology, and their applicability to Ukrainian-language documents. To do this, a simple set of invoices was created and uploaded to the system. The development of a system for extracting information from digitized Ukrainian-language documents using neural networks will speed up data processing, provide an opportunity for their processing depending on the scope of the user of this software. It is established that at present, there are no systems that can independently determine what data is needed for extraction from Ukrainian-language documents. Existing systems require the creation of software that will act as a cover for the functionality of systems that transmit their information through the REST API. Google Form Parser is considered to be the best system, but it requires a constant connection to the Internet, which can be a serious obstacle to the use of such a product in certain areas of activity.

Keywords: optical character recognition; neural network; deep learning; machine learning; data extraction.

References
1. Lebourgeois F., Henry J.-L., Emptoz H. An OCR System for Printed Documents. 1992. Р. 83–86.
2. Sudharshan Chandra Babu from Nanonets (2020). Automating Receipt Digitization with OCR and Deep Learning [Електронний ресурс]. URL: https://nanonets.com/blog/receipt-ocr/
3. Семенов С. Как научить машину понимать инвойсы и извлекать из них данные [Електронний ресурс]. URL: https://habr.com/ru/company/abbyy/blog/440310/
4. Intellix – End-User Trained Information Extraction for Document Archiving / D. Schuster, K. Muthmann, D. Esser [et al.] // Proceedings of the International Conference on Document Analysis and Recognition, ICDAR. 10.1109/ICDAR.2013.28.
5. Azure vs AWS vs GCP (Part 2: Form Recognizers) [Електронний ресурс]. URL: https://cazton.com/blogs/executive/form-recognition-azure-aws-gcp.
6. Form Recognizer documentation [Електронний ресурс]. URL: https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer
7. Document AI Documentation [Електронний ресурс]. URL: https://cloud.google.com/document-ai/docs
8. Amazon Textract Developer Guide [Електронний ресурс]. URL: https://docs.aws.amazon.com/textract/latest/dg/what-is.html
9. Nanonets [Електронний ресурс]. URL: https://nanonets.com/

Published

2021-03-25

Issue

Section

Articles