13th IAPR International Workshop on Document Analysis Systems

24-27 April
Vienna, Austria


DAS 2018 is the 13th edition of the 100% participation single-track workshop focusing on system-level issues and approaches in document analysis and recognition. The workshop comprises invited speaker presentations, oral, poster, tutorial sessions, as well as working group discussions. The Conference Publishing Services (CPS) will publish the workshop proceedings.

DAS 2018 will be organized at TU Wien (Vienna University of Technology), in the heart of Vienna’s city center, which places the attendees within walking distance of a large variety of world-famous historical and cultural attractions.

DAS 2018 will include both long and short papers, posters and demonstrations of working or prototype systems. All submissions will undergo a rigorous single blind review process with a minimum of 3 reviews considering the originality of work, the quality of research or analysis of experience, the relevance to document analysis systems, and quality of presentation.



Of the 131 submissions received in an open call for papers, 77 were accepted for presentation at the workshop (58.8%). Of these, 32 papers were designated for oral presentation (24.4%) and 45 for poster presentation (34.35%) after a rigorous review process directed by the four Program Chairs (Basilis Gatos, Koichi Kise, Dan Lopresti and Jean-Marc Ogier). All submissions received at least 2 reviews from 67 members of the program committee; most papers received 3 reviews. In addition, 12 short papers describing emerging ideas and work in progress were also accepted for poster presentation.

The IAPR Nakano Best Paper Award was presented to the best overall paper presented at the workshop: “Word Spotting and Recognition using Deep Embedding” by Praveen Krishnan, Kartik Dutta and C. V. Jawahar during the workshop banquet. The IAPR Best Student Paper Award was presented to Daniel Stromer, Vincent Christlein, Andreas Maier, Xiaolin Huang, Patrick Zippert, Eric Helmecke and Tino Hausotte for their paper “Non-Destructive Digitization of Soiled Historical Chinese Bamboo Scrolls”.


Tutorials will be on Tuesday, April 24th. The welcome takes place on Tuesday 7pm at TUtheSky. You can download the  .

Time Wednesday, April 25th Thursday, April 26th Friday, April 27th
08:00 Registration  
08:30 Registration Registration
09:00 Opening Oral Session 5
Scene Text Detection and Recognition
Keynote 2
Rolf Ingold
09:30 Keynote 1
Lawrence O'Gorman
10:00 Oral Session 7
Document Understanding and Table Recognition
10:30 Coffee Break Coffee Break
11:00 Oral Session 1
Word Spotting
Discussion groups Coffee Break
11:30 Oral Session 8
Graphics Recognition
12:00 Oral Session 2
Handwriting Recognition
13:00 Lunch Lunch Lunch
14:30 Oral Session 3
Historical Document Analysis
Oral Session 6
Document Analysis Applications
Oral Session 9
Forensic Document Analysis
16:00 Teasers Teasers Discussion group reports
16:30 Poster Session 1 Poster Session 2
17:00 Conclusion and awards

Oral Session 4
Databases and Benchmarking

18:00 Social Event
19:00 Workshop banquet


Please let us know which tutorials you plan to attend. (Click the images for more details)

From Digital Libraries to Kind Cameras

Making Sense of Multimedia Signals While (Unsuccessfully) Avoiding Security

Wednesday, April 25th 09:30 - 10:30
Lawrence O'Gorman, Nokia Bell Labs, Murray Hill, NJ USA

In the last 30 years, we have made great strides in computer analysis and understanding of signals from images to documents to video. In this talk, I describe projects whose initial objective was a useful and disruptive - and sometimes fun - multimedia recognition system, but for which security issues were discovered that complicated design and usability.

The first project involves document layout analysis methods to facilitate one of the first digital libraries, Bell Labs RightPages. However, publishers would not offer material through the system until we developed watermarking methods to assert their ownership. The second project is a voice-only system for healthcare workers to enable hands-free communications. But the system was impractical without authentication: how do you securely speak a password? The third project was for security purposes only, to design a counterfeit-resistant photo-ID card that can be retrofited to current non-secure cards, printed on paper, and even duplicated. We accomplished this in the early days of public key cryptography. Finally I will describe current work in "Kind Cameras", for which video analytics methods have been developed to extend past security cameras to interactive cameras for fun and art. The slides are available .

Lawrence O'Gorman is a Fellow at Nokia Bell Labs Research in Murray Hill, NJ. He works in the areas of video analysis and multimedia signal processing. Previously he was Chief Scientist at Veridicom, a biometric company, spun off from Lucent, and before that a Distinguished Member of Technical Staff at Bell Labs. He has taught in the area of multimedia security at Cooper Union and NYU/Poly. His video analytics work is the basis of the "Pixelpalooza" exhibit at the Liberty Science Center in New Jersey, and other public art and game exhibits.

He has published over 70 technical papers, 8 book chapters, holds over 25 patents, and is co-author of the books, "Practical Algorithms for Image Analysis" published by Cambridge University Press, and "Document Image Processing" published by IEEE Press. He is a Fellow of the IEEE and of the IAPR. In 1996, he won the Best Industrial Paper Award at the ICPR and an R&D 100 Award for one of "the top 100 innovative technologies of that year." He has been on the editorial boards of 4 journals, and has served on US government committees to NIST, NSF, NIJ, and NAE, and to France's INRIA.

He received the B.A.Sc., M.S., and Ph.D. degrees in electrical engineering from the University of Ottawa, University of Washington, and Carnegie Mellon University respectively.

Lessons from 10 Years of Experience on Historical Document Analysis

Friday, April 27th 09:00 - 10:00
Rolf Ingold, DIVA Group, University of Fribourg, Switzerland

Libraries and archives all around the world continuously increase their efforts in digitizing historical manuscripts. Integrating such manuscripts into digital libraries requires meaningful information for indexing. To support the extraction of the needed meta-information or to provide full-text transcription, advanced pattern recognition and machine learning methods are required. This talk will describe the outcome of a series of "HisDoc" research projects funded by the Swiss National Foundation, covering pioneering attempts to study the whole processing chain from layout analysis to information retrieval of historical manuscripts, including script analysis, word spotting and handwriting recognition. This description will be complemented with an overview of other related research projects, in order to convey the current state of the art in the field and outline future trends.

Rolf Ingold is full professor of computer science at the University of Fribourg, in Switzerland. He graduated in mathematics and received a PhD degree in Computer Science from the Swiss Federal Institute of Technology Lausanne (EPFL). He is an international expert in the field of document analysis and modelling. Since 2008, he is Chair of the Swiss Pattern Recognition Association and member of the Governing Board of the International Association for Pattern Recognition (IAPR). At National level, he was highly involved in the Swiss National Center of Competence on Interactive Multimodal Information Management. He is also member of the Swiss Academy of Science and Technology.

His expertise covers several aspects of multimodal signal processing, pattern recognition, and machine learning, with applications in image and sound analysis, biometry, and gesture recognition. During the last fifteen years, he concentrated his research on historical document analysis and cultural heritage preservation. Since 2009, he is leading a series of research projects funded by the Swiss National Foundation on historical document analysis. In 2011, he received the James A. Lindner price awarded by the International Association of Sound and Audiovisual Archives (IASA) for "VisualAudio", a solution to recover sound from degraded sound tracks.

Introduction to DL in Theory and Practice

Tuesday, April 24th 09:30 - 10:30
René Donner, contextflow, Austria

In this tutorial we will look at the basics of Deep Learning (DL), the reasons for its recent success and which tasks it can be applied to. The participants will also get an overview over the current landscape of DL frameworks, teaching resources and how to get started with DL in their own work. The slides are available here: .

With a background in electrical engineering René has worked for 8 years at the Medical University Vienna as a researcher in computer vision, focusing on anatomical structure localization and content based image retrieval. He is now CTO at contextflow, applying deep learning to large scale medical image data and developing smart tools to aid radiologists in their challenging tasks.

Keyword Spotting for Large-Scale Indexing and Search in massive Document collections

Tuesday, April 24th 11:00 - 13:30
Alejandro H. Toselli, Emilio Granell, Joan Puigcerver
Universitat Politècnica de València

Libraries, archives and other cultural institutions all over the world are making accessible large amounts of digital handwriting documents most of which lacks transcripts. This fact has motivated the development of handwritten processing technologies; namely automatic/assisted handwritten text recognition (HTR) and keyword spotting (KWS), to provide access to the images textual contents. In this line, this tutorial is intended to present from theoretical and practical perspectives, a comprehensive insight of two state-of-the-art technologies, based on Deep Learning approaches: "Laia: A deep learning toolkit for HTR, based on Torch and Kaldi" and "Keyword Spotting for Large-Scale Indexing and Search in massive Document collections". Laia, besides obtaining highly accurate recognized transcripts, produces primary structures called "word-graphs" which are subsequently used by the KWS techniques to produce problematically-sound word confidence scores for indexing. The proposed tutorial is planed to be given in 2 hours. It is assumed that the students will have a prior general knowledge on Pattern Recognition and some experience in Handwriting Text Recognition and Deep Learning.

Slides are available here.

Alejandro H. Toselli received the M.S. degree in electrical engineering from Universidad Nacional de Tucumán (Argentina) in 1997 and the Ph.D. degree in Computer Science from Universitat Politècnica de València (Spain) in 2004. He did a post-doc at the "Institut de Recherche en Informatique et Systémes Aléatoires" (IRISA), France. His current research interest lies in the subject of Computer Assisted and Interaction in Pattern Recognition Systems: Handwritten Text Recognition and Keyword Spotting Applications. He has teaching experience in imparting different tutorial courses about the "Handwritten Text Recognition" topic.

Emilio Granell obtained his BSc in Telecommunications Engineering with the speciality in Sound and Image in 2006, his MSc degree in Artificial Intelligence, Pattern Recognition, and Digital Image in 2011, and his Ph.D. degree in Computer Science in 2017, all from Universitat Politècnica de València (UPV). Dr. Granell pertains to the Pattern Recognition and Human Language Technology (PRHLT) research center, where he develops his research on the topics of speech and handwriting recognition, dialogue systems, and interactive and multimodal systems. From 2010 he has participated in several research projects related with artificial intelligence, speech and handwriting recognition, and smart cities.

Joan Puigcerver is a PhD candidate in Computer Science at the Universitat Politècnica de València at the Pattern Recognition and Human Language Technology Research Center. He previously received the Engineer's degree in Computer Science (2012) and the Master's degree in Pattern Recognition and Artificial Intelligence (2014) from the same institution. He is broadly interested in statistical pattern recognition and machine learning, and its applications to computer vision, handwritten text recognition and keyword spotting. He is member of the IEEE Computer Society and the Spanish Society for Pattern Recognition and Image Analysis (AERFAI).

Deep Learning for Document Analysis, Text Recognition, and Language Modeling

Tuesday, April 24th 14:30 - 18:30
Thomas Breuel, NVIDIA Research, USA


The tutorial will cover applications of deep learning to problems in document analysis:

  • convolutional, one-dimensional, and multidimensional layers
  • the relationship between filters and deep learning models
  • different types of sequence models: LSTM, seq2seq, attention, CTC
  • DL models for noise removal, upscaling, skew correction
  • DL models for layout analysis and semantic segmentation
  • DL models for OCR, handwriting recognition, text recognition
  • DL models for language modeling and OCR post-correction
  • preprocessing, scaling, and GPU-based computing
The course will present numerous examples and workbooks based on PyTorch. Basic familiarity with deep learning and Python is recommended.


Thomas Breuel works on deep learning and its applications at NVIDIA Research. Before that, he was a researcher at Google Brain, IBM, and Xerox PARC. He was a professor of computer science and head of the Image Understanding and Pattern Recognition (IUPR) at the University of Kaiserslautern. He has published numerous papers in document analysis, computer vision, and machine learning and is a contributor to several open source projects in OCR, document analysis, and machine learning.

Reproducible Research in Document Image Analysis

Tuesday, April 24th 11:00 - 13:30
Marcel Würsch, Michele Alberti, Vinaychandran Pondenkandath, Marcus Liwicki
DIVA Group, University of Fribourg, Switzerland


An important topic not only in Document Image Analysis but in Machine Learning in general is the reproducibility of scientific results. Many papers get published today where it is increasingly difficult for the reader to independently reproduce and verify the reported numbers, due to various reasons. In this tutorial we will provide introductions and hands on sessions on existing solutions. DeepDIVA a Deep Learning toolkit with a focus on creating reproducible experiments, and DIVAServices a Web Service framework providing access to DIA methods.

The slides are available online.

Marcel Würsch is a PhD student in the DIVA research group at the University of Fribourg. The main work of his dissertation is DIVAServices, a Web Service framework for providing Document Image Analysis methods as RESTful Web Services.

Full Papers

Word Spotting

Wednesday, April 25th 11:00 - 12:00

Abstract: Deep CNN features for word images and textual embedding schemes have shown great success in word spotting. In this work, we follow these motivations to propose an end2end embedding framework which jointly learns both the text and image embeddings using state of the art CNN architectures. The three major contributions of this work are: (i) an end2end embedding scheme to learn a common representation for word images and its labels, (ii) building a state of art word image descriptor and demonstrating its utility as off-the-shelf features for word spotting and recognition, and (iii) use of synthetic data as a complementary modality to further enhance word spotting. We evaluate the proposed method on popular handwriting data sets. On the challenging IAM handwritten dataset, we report an mAP of 0.9585 for query-by-string based retrieval task. Under lexicon based word recognition, our proposed method report an 2.66 and 5.10 CER and WER respectively.
Sebastian Sudholt, Neha Gurjar and Gernot Fink. Learning Deep Representations for Word Spotting Under Weak Supervision
Abstract: Convolutional Neural Networks have made their mark in various fields of computer vision in recent years. They have achieved state-of-the-art performance in the field of document analysis as well. However, CNNs require a large amount of annotated training data and, hence, great manual effort. In our approach, we introduce a method to drastically reduce the manual annotation effort while retaining the high performance of a CNN for word spotting in handwritten documents. The model is learned with weak supervision using a combination of synthetically generated training data and a small subset of the training partition of the handwritten data set. We show that the network achieves results highly competitive to the state-of-the-art in word spotting with shorter training times and a fraction of the annotation effort.
Abstract: Deep convolutional neural networks are today the new baseline for a wide range of machine vision tasks. The problem of keyword spotting is no exception to this rule.Many successful network architectures and learning strategies have been adapted from other vision tasks to create successful keyword spotting systems. In this paper, we argue that various details concerning this adaptation could be re-examined, to the end of building stronger spotting models. In particular, we examine the usefulness of a pyramidal spatial pooling layer versus a more simple approach, and show that a zoning strategy combined with fixed-size inputs can be just as effective while less computationally expensive. We also examine the usefulness of augmentation, class balancing and ensemble learning strategies and propose an improved network. Our hypotheses are tested with numerical experiments on the IAM document collection, where the proposed network outperforms all other existing models.

Handwriting Recognition

Wednesday, April 25th 12:00 - 13:00

Abstract: In this paper, we present a study on sample preselection in large training data set for CNN-based classification. To do so, we structure the input data set in a network representation, namely the Relative Neighbourhood Graph, and then extract some vectors of interest. The proposed preselection method is evaluated in the context of Handwritten Character Recognition, by using two data sets, up to several hundred thousands of images. It is shown that the graph-based preselection can reduce the training data set without degrading the recognition accuracy of a non pre-trained CNN shallow model.
Kartik Dutta, Praveen Krishnan, Minesh Mathew and C V Jawahar. Unconstrained Handwriting Recognition on Devanagari Script using a new Benchmark Dataset
Abstract: Handwriting recognition in Indic Scripts like Devanagari is by themselves very challenging, due to the subtleties in the scripts, variations in rendering and the cursive nature of the handwriting. Lack of public handwriting datasets in Indic scripts has long stymied the development of deep learning based offline handwritten word recognizers and made comparison across different methods a tedious task. In this paper, we release a new handwritten word dataset for Devanagari, IIIT-HW-Dev to alleviate some of these issues. We benchmark the IIIT-HW-Dev dataset using a CNN-RNN hybrid architecture. Furthermore, using the above architecture, we empirically show that using synthetic data and cross lingual transfer learning helps alleviate the issue of lack of training data. We use this proposed pipeline on a public dataset, RoyDB and achieve state of the art results.
Abstract: We propose in this paper, an Arabic handwriting recognition system based on multiple BLSTM-CTC combination architectures. Given several feature sets, the low-level fusion consisted in projecting them into a unique feature space. Mid-level combination methods were performed using two techniques: the first one consists in averaging the a-posteriori probabilities of each individual BLSTM, and injecting them in the CTC decoding. The second is based on the training of a new BLSTM-CTC system using the sum of the a-posteriori probabilities generated by the individual systems. The high-level fusion is based on the combination of the individual decoding outputs. Lattice combination and ROVER strategies were evaluated in this context. The experiments conducted on the KHATT database showed that the high-level combination method significantly improves the recognition rate compared to the other fusion strategies.

Historical Document Analysis

Wednesday, April 25th 14:30 - 16:00

Michael Fink, Thomas Layer, Georg Mackenbrock and Michael Sprinzl. Baseline Detection in Historical Documents using Convolutional U-Nets
Abstract: Baseline detection is still a challenging task for heterogeneous collections of historical documents. We present a novel approach to baseline extraction in such settings, turning out the winning entry to the ICDAR 2017 Competition on Baseline detection (cBAD).
It utilizes deep convolutional nets (CNNs) for both, the actual extraction of baselines, as well as for a simple form of layout analysis in a pre-processing step. To the best of our knowledge it is the first CNN-based system for baseline extraction applying a U-net architecture and sliding window detection, profiting from a high local accuracy of the candidate lines extracted. Final baseline post-processing complements our approach, compensating for inaccuracies mainly due to missing context information during sliding window detection.
We experimentally evaluate the components of our system individually on the cBAD dataset. Moreover, we investigate how it generalizes to different data by means of the dataset used for the baseline extraction task of the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts (HisDoc). A comparison with the results reported for HisDoc shows that it also outperforms the contestants of the latter.
Abstract: Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated text-lines are needed to train an HTR system. In some scenarios, transcripts are only available at the paragraph level with no text-line information. In this work, we demonstrate how to train an HTR system with few labeled data. Specifically, we train a deep convolutional recurrent neural network (CRNN) system on only 10% of manually labeled text-line data from a dataset and propose an incremental training procedure that covers the rest of the data. Performance is further increased by augmenting the training set with specially crafted multi-scale data, we also propose a model-based normalization scheme which considers the variability in the writing scale at the recognition phase. We apply this approach to the publicly available READ dataset. Our system achieved the second best result during the ICDAR2017 competition.
Abstract: Scanned handwritten historical documents are often not well accessible due to the limited feasibility of automatic full transcriptions. Thus, Keyword Spotting (KWS) has been proposed as an alternative to retrieve arbitrary query words from this kind of documents. In the present paper, word images are represented by means of graphs. That is, a graph is used to represent the inherent topological characteristics of handwriting. The actual keyword spotting is then based on matching a query graph with all document graphs. In particular, we make use of a fast graph matching algorithm that considers the contextual substructure of nodes. The motivation for this inclusion of node context is to increase the overall KWS accuracy. In an experimental evaluation on four historical documents, we show that the proposed procedure clearly outperforms diverse other template-based reference systems. Moreover, our novel framework keeps up or even outperforms many state-of-the-art learning-based KWS approaches.
Abstract: For about 2000 years, no paper was used as a media in China but writings and drawings were captured on bamboo and wooden slips. Several slips were bound together with strips and rolled up to a scroll. The writings and drawings were either brushed or even carved into the wood. Those documents are very precious for culture inheritance and research, but due to aging processes, the discovered pieces are sometimes in a poor condition and also soiled. Because cleaning the slips is not only challenging but also writings could be erased, we developed a method to digitize such historical documents with out the need of cleaning. We perform a 3-D X-ray micro-CT scan resulting in a 3-D volume of the complete document. With our approach, we were able to investigate the scroll without cleaning and unwrapping it by hand. We showed that the method also works for heavily soiled scrolls where nothing is readable with the naked eye. This can help conservators to store all writings before they may be erased by the cleaning process. Finally, we present a manual technique to virtually unwrap and post-process the documents resulting in a 2-D image of all bamboo slips.

Databases and Benchmarking

Wednesday, April 25th 17:30 - 18:30

Abstract: In this paper we propose a system for reading utility meters. We propose a convolutional network that is trained on an end-to-end manner that is able to directly output the reading without any explicit digit localization step. We present our results in a very large dataset of images acquired by different users and devices.
Abstract: Computation of document image quality metrics often depends upon the availability of a ground truth image corresponding to the document. This limits the applicability of quality metrics in applications such as hyperparameter optimization of image processing algorithms that operate on-the-fly on unseen documents. This work proposes the use of surrogate models to learn the behavior of a given document quality metric on existing datasets where ground truth images are available. The trained surrogate model can later be used to predict the metric value on previously unseen document images without requiring access to ground truth images. The surrogate model is empirically evaluated on the Document Image Binarization Competition (DIBCO) and the Handwritten Document Image Binarization Competition (H-DIBCO) datasets.
Abstract: We present a metadata labelling framework for datasets, software tools, and workflows. An ontology for document image analysis was developed with deep support for historical data. An accompanying open source software framework was implemented to enable ontology editing, data and method annotation, workflow composition, and semantic search. A wide range of examples is used to illustrate real-world application.

Scene Text Detection and Recognition

Thursday, April 26th 9:00 - 10:30

Wafa Khlif, Nibal Nayef, Jean-Christophe Burie, Adel Alimi and Jean-Marc Ogier. Learning Text Component Features via Convolutional Neural Networks for Scene Text Detection
Abstract: Reading the text embedded in natural scene images is essential to many applications. In this paper, we propose a method for detecting text in scene images based on multi-level connected component (CC) analysis and learning text component features via convolutional neural networks (CNN), followed by a graph-based grouping of overlapping text boxes.
The multi-level CC analysis allows the extraction of redundant text and non-text components at multiple binarization levels to minimize the loss of any potential text candidates. The features of the resulting raw text/non-text components of different granularity levels are learned via a CNN. Those two modules eliminate the need for complex ad-hoc preprocessing steps for finding initial candidates, and the need for hand-designed features to classify such candidates into text or non-text.
The components classified as text at different granularity levels, are grouped in a graph based on the overlap of their extended bounding boxes, then, the connected graph components are retained. This eliminates redundant text components and forms words or textlines.
When evaluated on the "Robust Reading Competition" dataset for natural scene images, our method achieved better detection results compared to state-of-the-art methods. In addition to its efficacy, our method can be easily adapted to detect multi-oriented or multi-lingual text as it operates at low level initial components, and it does not require such components to be characters.
Abstract: Text in images can be utilized in many image understanding applications due to the exact semantic information. In this paper, we propose an integrated k-shortest paths optimization based text line extraction method. Firstly, the candidate text components are extracted by the Maximal Stable Extremal Region (MSER) algorithm on gray, red, green and blue channels. Then, one gray channel directed graph and one integrated directed graph on red, green, and blue channels are constructed upon the candidate text components. The integrated directed graph is transformed guided by the extracted text lines in gray channel. Finally, we use the k-shortest paths optimization algorithm to extract the text lines by taking advantage of the particular structure of the integrated directed graph. Experimental results demonstrate the effectiveness of the proposed method in comparison with state-of-the-art methods.
Abstract: TextBoxes is one of the most advanced text
detection method in both aspects of accuracy and efficiency, but it is still not very sensitive to the small text in natural scenes and often can not localize text regions precisely. To tackle these problems, we first present a Bidirectional Information Aggregation (BIA) architecture by effectively aggregating multi-scale feature maps to enhance local details and strengthen context information, making the detector not only work reliably on the small text, but also predict more precise boxes for texts. This architecture also results in a single classifier network, which allows our model to be trained much faster and easily with better generalization power. Then, we propose to use symmetrical feature maps for feature extraction both in the training and test stages for further improving the performance on the small text. To further promote precise predicting boxes, we present a statistical grouping method that operates on the training set bounding boxes to generate adaptive aspect ratios for default boxes. Finally, our model not only outperforms the TextBoxes without much time overhead, but also provides promising performance compared to the recent state-of-the-art methods on the ICDAR 2011 and 2013 database.
Abstract: The ICDAR Robust Reading Competition (RRC), initiated in 2003 and re-established in 2011, has become the de-facto evaluation standard for the international community. Concurrent with its second incarnation in 2011, a continuous effort started to develop an online framework to facilitate the hosting and management of competitions. This paper outlines the Robust Reading Competition Annotation and Evaluation Platform, the backbone of the Robust Reading Competition, comprising a collection of tools and processes that aim to simplify the management and annotation of data, and to provide online and offline performance evaluation and analysis services.

Document Analysis Applications

Thursday, April 26th 14:30 - 16:00

Xue-Hang Yang, Fei Yin and Cheng-Lin Liu. Online Video Text Detection with Markov Decision Process
Abstract: Online video text detection is important in many applications, such as real-time translator and wearable camera system for visually-impaired. Existing methods for video text detection perform unsatisfactorily mainly because of the inferior text detection result and insufficient utilization of spatial and temporal information. Besides, the majority of them work in offline mode. In this paper, we propose an online video text detection method which works nearly in real time. We detect texts in each frame using a EAST based text detector, and formulate the online text tracking problem as decision making in Markov Decision Processes (MDPs). The similarity function in tracking stage can be learned by reinforcement learning. Besides, text detection and tracking are naturally unified by state transactions in the MDP. Extensive experiments on three benchmark datasets, ICDAR 2015, Minetto, and Youtube Video Text, verify the effectiveness of our method.
Abstract: We present the Vocabulometer, a reading assistant system designed to record the reading activity of a user with an eye tracker and to extract mutual information about the users and the read documents. The Vocabulometer stands as a web platform and can be used for analyzing the comprehension of the user, the comprehensibility of the document, predicting the difficult words, recommending document according to the reader's skills, etc.
Since the last years, with the development of low-cost eye trackers, the technology is now accessible for many people, which will allow using data mining and machine learning algorithms for the mutual analysis of documents and readers.
Abstract: Transcription of handwritten text can be speed-up by using off-line Handwritten Text Recognition techniques, that allow to obtain an initial draft transcription of an image with handwritten text. However, this draft transcription usually contains errors that must be amended by the transcriber by providing a feedback signal.The usual approach is post-edition, where each error is corrected without modifying the rest of the current transcription. A more sophisticated approach can employ the current modification to provide a new whole transcription, hopefully with less errors. Apart from that, feedback can be provided in different modalities: keyboard input, on-line handwritten text, or speech. Each of these modalities presents different features with respect to ambiguity, derived errors, and final transcription time. In this work we study how the different modalities behave in the assisted transcription of a historical handwritten text document in Spanish and we evaluate their transcription productivity.
Abstract: In this paper, we present a novel solution for the problem of text extraction from PDF documents with incorrect encoding of embedded fonts. Instead of using a full-scale optical character recognition (OCR) process for such documents, our system detects problematic fonts and uses convolutional neural network (CNN) for optical character recognition of individual glyphs in combination with bigram language model to recover character codes. Using a large collection of PDF documents with valid font encoding allowed us to train and test the recognition model with error rate 0.97%. We demonstrate the benefits of our approach over full-scale OCR approach by comparing our system with Tesseract on a new dataset of multilingual PDF files, and share the dataset with research community.

Document Understanding and Table Recognition

Friday, April 27th 10:00 - 11:00

Abstract: Repetitively formatted historical books are tokenized and tagged according to eight token types (capitalized words, numbers, punctuation …). To extract family information, templates of short sequences of tags are generated around frequent proper nouns and specified tokens like “born”. Each template is associated with a user-assigned class (head of household, father, mother, spouse, geographic location …) and a pointer to an overlapping or nearby fragment of text to be extracted. Matching the template against the book text yields class-labeled factoids. In an interaction cycle, new extraction templates are proposed for user approval or editing. Each edit-then-extract cycle typically yields thousands of factoids and a dozen new templates. With five approximately half-hour interactive sessions, 44,000 genealogical factoids were extracted from a 17th century Scottish register of marriages and births and from published 19th–20th century Ohio funeral parlor records. The experience indicates that this method quickly yields quality results and is about two orders of magnitude faster than previously reported proposals for hand-constructed rule templates.
Hervé Déjean, Jean-Luc Meunier, Stéphane Clinchant, Eva Maria Lang and Florian Kleber. Comparing Machine Learning Approaches for Table Recognition in Historical Register Books
Abstract: We present in this paper experiments on Table Recognition in Handwritten register books. We first explain how the problem of row and column detection is modelled, and then compare two Machine Learning approaches (Conditional Random Field and Graph Convolutional Network) for detecting these table elements. Evaluation was conducted on death records provided by the Archive of the Diocese of Passau. Both methods show similar results. Software and dataset are open source/data.
Elvis Koci, Maik Thiele, Oscar Romero and Wolfgang Lehner. Table Recognitions in Spreadsheets via a Graph Representation
Abstract: Spreadsheets are one of the most popular data management tools. Their ease of use and abundant functionalities equip novices and professionals alike with the means to generate, transform, analyze, and visualize data. As a result, a great amount of information resides in this format. This accentuates the need for automatic understanding and extraction of spreadsheet contents. In this paper, we focus our efforts on recognizing tables in spreadsheets, as they are a great source of factual and structured information. Having inferred the layout role of the individual cells, from previous work, we build a graph representation per sheet. This graph encodes the spatial interrelations between layout regions (i.e., non-overlapping sets of adjacent cells having the same layout role). Based on this representation, we propose our rule-based method for table identification. The experimental evaluation shows we achieve high accuracy in a dataset of various tables from different domains.

Graphics Recognition

Friday, April 27th 11:30 - 13:00

Abstract: In this paper we propose a structured prediction based system for text/non-text classification and printed/handwritten texts separation at connected component (CC) level in complex documents. We formulate the separation of different elements as joint classification problems and use conditional random fields (CRFs) to integrate both local and contextual information for improving the classification accuracy. Both our unary and pairwise potentials are formulated as neural networks to better exploit contextual information. Considering the different properties in text/non-text classification and printed/handwritten texts separation, we use multilayer perception (MLP) and convolutional neural network (CNN) for potentials, respectively. To evaluate the performance of the proposed method, we provide a test paper document database named TestPaper1.0, which can be used for many other tasks as well. Our method achieve impressive results for both tasks on TestPaper1.0 dataset. Moreover, even with very shallow CNNs as potentials, our method achieves state-of-the-art performance for writing type (printed/handwritten) separation on the highly heterogeneous Maurdor dataset, surpassing Maurdor2013 and Maurdor2014 campaign winners. This demonstrates the superiority and effectiveness of our method.
Abstract: Symbol detection techniques in online handwritten graphics (e.g. diagrams and mathematical expressions) usually consist of methods specifically designed for a single graphic type. In this work, we evaluate the Faster R-CNN object detection algorithm as a general method for detection of symbols in handwritten graphics. We evaluate different configurations of the Faster R-CNN method, and point out issues relative to the handwritten nature of the data. Considering the online recognition context, we evaluate efficiency and accuracy trade-offs of using Deep Neural Networks of different complexities as feature extractors. For experimentation, we use publicly available flowchart and mathematical expressions (CROHME-2016) datasets. We obtained encouraging results on both datasets.
Penghui Sun, Yan Chen, Xiaoqing Lu, Bei Wang, Jingwei Qu and Zhi Tang. A Free-sketch Recognition Method for Chemical Structural Formula
Abstract: Chemical Structural Formula(CSF) recognition plays an important role in the molecular design and component retrieval. However, sketch-based CSF recognition remains an obstacle in current retrieval systems. This paper introduces a system for sketch CSF recognition on smart mobile devices. A dual-mode-based method is proposed to distinguish the gestures for character inputs and non-character inputs instead of the ordinary segmentation approaches. An attribute graph model is established to describe effectively all necessary information of a sketched CSF. The chemical knowledge is adopted to refine the candidates of structure relationship among elements. The experiments results demonstrate that the proposed method outperforms the existing methods for free-sketch CSFs on effectiveness and flexibility.
Alexander Pacha, Kwon-Young Choi, Bertrand Coüasnon, Yann Ricquebourg, Richard Zanibbi and Horst Eidenberger. Handwritten Music Object Detection: Open Issues and Baseline Results
Abstract: Optical Music Recognition (OMR) is the challenge of understanding the content of musical scores. Accurate detection of individual music objects is a critical step in processing musical documents, because a failure at this stage corrupts any further processing. So far, all proposed methods were either limited to typeset music scores or were built to detect only a subset of the available classes of music symbols. In this work, we propose an end-to-end trainable object detector for music symbols that is capable of detecting almost the full vocabulary of modern music notation in handwritten music scores. By training deep convolutional neural networks on the recently released MUSCIMA++ dataset which has symbol-level annotations, we show that a machine learning approach can be used to accurately detect music objects with a mean average precision of up to 80%.

Forensic Document Analysis

Friday, April 27th 14:30 - 16:00

Abstract: The encoding of local features is an essential part for writer identification and writer retrieval. While CNN activations have already been used as local features in related works, the encoding of these features has attracted little attention so far.
In this work, we compare the established VLAD encoding with triangulation embedding. We further investigate generalized max pooling as an alternative to sum pooling and the impact of decorrelation and Exemplar SVMs. With these techniques, we set new standards on two publicly available datasets (ICDAR13, KHATT).
Abstract: In this paper, the problem of finding a metric to separate previously unseen writer hands in a forensic setting is addressed using statistical learning. An unsupervised feature learning approach, based on dense contour descriptor sampling, is combined with a novel way of learning a general space for clustering writer hands. The metric learning inference is based on multi-class Gaussian process classification. Evaluation is performed using the popular datasets IAM and CVL combined, containing close to 1000 writer hands. This paper builds on earlier work from our group on building a system for estimating the production dates of medieval manuscripts, and aims to include writer identification for future use on historical data.
Abstract: In the literature, the document image watermarking schemes in spatial domain mainly focus on text content, so they need to be further improved to be applied on general document content. In this paper, we propose a blindly invisible watermarking approach for grayscale document images in spatial domain, which is based on stable regions and object fill. In order to detect stable regions, the document is transformed into an intermediate form by taking advantage of image processing operations prior to applying nonsubsampled contourlet transform (NSCT). Next, the separated objects in stable regions are obtained by object segmentation. The stroke and fill of obtained objects are detected, and only the locations of object fill are marked as referential ones for mapping to gray level values where data hiding and detection are conducted. Then, the watermarking algorithm is developed by using every group of gray level values corresponding to locations of each object fill for carrying one watermark bit. The experiments are performed with various document contents, and our approach shows high performance in terms of imperceptibility, capacity and robustness against distortions like JPEG compression, geometric transformation and print-and-scan noise.
Ulrich Scherhag, Christian Rathgeb and Christoph Busch. Towards detection of morphed face images in electronic travel documents
Abstract: The vulnerability of face recognition systems to attacks based on morphed biometric samples has been established in the recent past. Such attacks pose a severe security threat to a biometric recognition system in particular within the widely deployed border control applications. However, so far a reliable detection of morphed images has remained an unsolved research challenge.
In this work, automated morph detection algorithms based on general purpose pattern recognition algorithms are benchmarked for two scenarios relevant in the context of fraud detection for electronic travel documents, i.e. single image (no-reference) and image pair (differential) morph detection. In the latter scenario a trusted live capture from an authentication attempt serves as additional source of information and, hence, the difference between features obtained from this bona fide face image and a potential morph can be estimated. A dataset of 2,206 ICAO compliant bona fide face images of the FRGCv2 face database is used to automatically generate 4,808 morphs. It is shown that in a differential scenario morph detectors which utilize a score level-based fusion of detection scores obtained from a single image and differences between image pairs generally outperform no-reference morph detectors with regard to the employed algorithms and used parameters. On average a relative improvement of more than 25% in terms of detection equal error rate is achieved.

Poster Session I (Wednesday 25th April, 16:00-17:30)

Zhen Zhu, Minghui Liao, Baoguang Shi and Xiang Bai. Feature Fusion for Scene Text Detection
Abstract: A significant challenge in scene text detection is the large variation in text sizes. In particular, small text are usually hard to detect. This paper presents an accurate oriented text detector based on Faster R-CNN. We observe that Faster R-CNN is suitable for general object detection but inadequate for scene text detection due to the large variation in text size. We apply feature fusion both in RPN and Fast R-CNN to alleviate this problem and furthermore, enhance model's ability to detect relatively small text. Our text detector achieves comparable results on ICDAR 2015 and MSRA-TD500, showing its advantage and applicability.
Abstract: Cuneiform tablets appertain to the oldest textual artifacts used for more than three millennia and are comparable in amount and relevance to texts written in Latin or ancient Greek. We present a complete digital analysis workflow enabling modern text processing on the complex and non-linear script. Our tools encompass the whole pipeline, from digitization and wedge extraction to word-spotting and frequent pattern mining facilities.
Tablets are being acquired from different sources requiring different methods for digitalization. Each representation is typically processed with its own tool-set. To homogenize these data sources, we introduce an unifying minimal wedge constellation description. For this representation, we develop similarity metrics based on the optimal assignment of wedge configurations.
We combine our wedge features with work on segmentation-free word spotting using part-structured models. The presented search and similarity facilities enable the development of advances linguistic tools for cuneiform sign indexing and spatial n-gram mining of signs.
Piercarlo Dondi, Alessandro Danani, Luca Lombardi, Marco Malagodi and Maurizio Licchelli. Handwriting identification of short historical manuscripts
Abstract: The study of historical documents is an important and complex branch of document analysis. Ancient manuscripts are often altered by time or ambient conditions, and the availability of large data sets is not always guaranteed. In this work, we propose a new solution for the attribution of short historical documents that uses both allographic and texture features. Paleographic data was used as reference during the development. Tests were conducted on Antonio Stradivari’s relics, a series of technical drawings and artifacts of the famous violin maker that contain his working notes, currently held in “Museo del Violino” in
Cremona (Italy). This data set is very complex, it contains only short writings (few words or text lines), faded or damaged areas, different supports (wood or paper), and various annotations added by the owners of this collection during the centuries. Experimental results is promising showing an accuracy greater than 90% using short texts both as training and as target.
Abstract: In 2014, Google announced its object-recognition and OCR service (launched only for Latin-based languages) for images in the cloud, extending its keyword searching capabilities beyond the conventional limitations of the text-based search. The implications of this announcement were huge for content-based querying, the need for which grows each day with an exponential increase in the amount of data at our disposal. In this paper, by proposing an OCR framework for recognizing ticker text in Urdu news videos, we take a step in the direction of automating the transcription, indexing and captioning of Urdu news video content. We make four contributions in this paper. Firstly, we produce an extensive database of Urdu news tickers collected from 41 different news channels. In the part of the world where the adoption of Digital Television still remains a distant reality, the inclusion of low-quality, blurred and distorted tickers, makes this dataset a robust test case for any automatic Urdu news recognition system arriving in the near future. Secondly, we identify the key challenges in Urdu newsticker text recognition, at the same time, also addressing some of them. As a third contribution, we propose an adjustment in the ground-truth labeling strategy to increase readability in the recognized output. Lastly, we present our recognition results obtained by training a Recurrent Neural Network, with a 1-dimensional (1-d) Bi-Directional Long Short-Term Memory (BDLSTM) architecture, over a subset of the dataset.
Abstract: Historical manuscripts act as a source of information about many incidents that happened in the past. In recent years, digitization of large quantities of historical handwritten documents is in vogue. This trend has generated a plethora of information about our medieval past. However, such digital archives can be only useful if automatic indexing and retrieval of document images can be provided to the end users of a digital library. An automatic transcription of the full digital archive using traditional Optical Character Recognition (OCR) is still not possible with sufficient accuracy. If full transcription is not available at all, generally the end users are interested in indexing and retrieving of particular document pages of their interest. Hence recognition of certain keywords from within the corpus will suffice the end users needs. Recently, deep-learning based methods have shown competence in image classification problems. However, one bottleneck with deep-learning based techniques is that it requires a huge amount of training samples per class. Since the number of samples per word class is scarce for collections that are freshly scanned, this is a serious hindrance for direct usage of the deep-learning technique for the purpose of word image recognition in historical document images. This paper aims to investigate the problem of recognizing words from historical document images using a deep-learning based framework for feature extraction and classification while countering the problem of the low amount of image samples using off-line data augmentation techniques. Encouraging results (highest accuracy of 90.47%) were obtained while dealing with 365 different word classes.
Abstract: Word-spotting methods based on the Bag-of-Visual-Words framework have demonstrated a good retrieval performance even when used in a completely unsupervised manner. Although unsupervised approaches are suitable for large document collections due to the cost of acquiring labeled data, these methods also present some drawbacks. For instance, having to train a suitable “codebook” for a certain dataset has a high computational cost. Therefore, in this paper we present a database agnostic codebook which is trained from synthetic data. The aim of the proposed approach is to generate a codebook where the only information required is the type of script used in the document. The use of synthetic data also allows to easily incorporate semantic information in the codebook generation. So, the proposed method is able to determine which set of codewords have a semantic representation of the descriptor feature space. Experimental results show that the resulting codebook obtains better results than traditional unsupervised method while obtaining a more compact representation.
Reem Alaasam, Berat Kurar and Jihad El-Sana. Word Spotting Using Convolutional Siamese Network
Abstract: We present a method for word spotting using convolutional siamese network. A convolutional siamese network employs two identical convolutional network to rank similarity between two input word images. Once the network is trained, it can then be used to spot not just words with varying writing styles and backgrounds but also to spot out of vocabulary words that are not in the training set. Experiments on the historical Arabic manuscript dataset VML, and on the George Washington dataset shows comparable results with the state of the art.
Han Xiao, Jun Sun and Xiaoyi Yu. Compact Binary Feature for Open Set Recognition
Abstract: Most pattern recognition systems are closed set recognition systems in which any input sample is to be classified as belonging to one of the given classes. This paper, however, addresses the open set recognition problem in which a test sample may either come from one of the labeled “known classes” or come from an unknown class. The number of unknown classes could potentially be unlimited. A compact binary feature is proposed to solve the open set recognition problem. By randomly partitioning the known classes into two groups and training a binary classifier to separate them apart, and by repeating such a procedure for many times, rich information is extracted from the training set in the form of an ensemble binary classifier which can associate any test sample with a compact binary feature that can be matched according to Hamming distance which is very efficient to compute. According to the experiments on the Dunhuang ancient Chinese character dataset, this ensemble binary classifier can boost the recognition performance significantly compared with a single feedforward convolutional neural network (CNN). Apart from that, such compact binary feature is very efficient for storage and saves lots of time in feature matching at the cost of more computation in the training phase.
Vincent Poulain d'Andecy, Aurélie Joseph and Jean-Marc Ogier. InDUS : Incremental Document Understanding System, focus on document Classification
Abstract: Our concern is to propose a Document Understanding System for digital Mailroom application which can cope with two challenges: (1) do not require a lot of configuration work from expert user or skilled engineers and (2) adapt incrementally the system in quasi real-time on new kind of documents. We describe an end-to-end system based on existing incremental algorithms for the document understandig i.e. both document classification and field extraction. In this paper, we mainly focus on the document classification issue. The main contribution is to adapt the Incremental Growing Neural Gas (A2IGN) with dynamic incremental feature vector and embbed this algorithm in a more general framework. Moreover, our generic framework selects automatically textual descriptors relying on a performance measure in order to make the A2IGN converged.
Ahmed Hamdi, Mickaël Coustaty, Aurelie Joseph, Vincent Poulain d'Andecy, Antoine Doucet and Jean-Marc Ogier. Feature Selection for Document Flow Segmentation
Abstract: In this paper, we describe a method to restore a flow of continuous documents.
The flow is a collection of consecutive scanned pages without explicit separation marks between documents. Our method is based on contextual and layout descriptors meant to specify the relationship between each pair of consecutive pages. The relationships are represented using vectors of features with boolean values indicating the presence or the absence of descriptors on concerned pages. The segmentation task therefore consists in classifying such vectors into continuities or breaks. The continuity class indicates that pages belong to the same document while the break class ends the ongoing document and starts a new one. The experimental part is based on a large collection of real administrative documents.
Vincent Poulain d'Andecy, Emmanuel Hartmann and Marçal Rusiñol. Field Extraction by hybrid incremental and a priori structural templates
Abstract: In this paper, we present an incremental framework for extracting field information from administrative documents. First, we demonstrate some
limits of the existing state-of-the-art methods such as the delay of the system efficiency. This is a concern in industrial context when we have only few samples of each document class. Based on this analysis, we propose a hybrid system combining incremental learning by means of tf-idf statistics and a-priori generic models. We report in the experimental section our results obtained with a large dataset of real invoices.
Yuki Daiku, Motoi Iwata, Olivier Augereau and Koichi Kise. Comics Story Representation System Based on Genre
Abstract: Comics are usually classified into broad categories called "genres" according to their contents such as comedy, horror, science fiction, etc. Because a genre expresses a comics story briefly, people access comics which have contents based on their interest, by relying on comics genres. However, giving only one genre to one comics can not express the detailed difference of the story. In this paper, we propose a system for generating comics story representation as a sub-sequence of genres. Our comics story representation can be applied to a new search engine based on stories or to a recommendation system which analyzes the tastes of the user's favorite comics by finding comics with similar comics story representation. We use a deep neural network to classify each comics page into the corresponding genre. Experimental results confirm the advantage of the performance of the proposed system.
Abstract: In the context of document image analysis, image binarization is an important preprocessing step for other document analysis algorithms, but also relevant on its own by improving the readability of images of historical documents. While historical document image binarization is challenging due to common image degradations, such as bleedthrough, faded ink or stains, achieving good binarization performance in a timely manner is a worthwhile goal to facilitate efficient information extraction from historical documents.
In this paper, we propose a recurrent neural network based algorithm using Grid Long Short-Term Memory cells for image binarization, as well as a pseudo F-Measure based weighted loss function. We evaluate the binarization and execution performance of our algorithm for different choices of footprint size, scale factor and loss function. Our experiments show a significant trade-off between binarization time and quality for different footprint sizes. However, we see no statistically significant difference when using different scale factors and only limited differences for different loss functions. Lastly, we compare the binarization performance of our approach with the best performing algorithm in the 2016 handwritten document image binarization contest and show that both algorithms perform equally well.
Abstract: With advancements in deep learning, artificial neural networks have been used increasingly in various document analysis problems such as character recognition, layout analysis, and orientation identification of documents. However, because of the ambiguity of the document image (caused by complicated appearances, multiple languages, etc.), it is difficult to use Convolutional Neural Networks (CNN) directly for orientation identification of the document. In order to solve this problem, we present offset neural networks (ONN), a new type of neural network that is especially designed for orientation identification. The ONN successfully reduces the negative influence of the ambiguous parts whose orientation cannot be distinguished. Meanwhile, the distinguishable parts of the document can be enhanced, which further improves the performance of the whole model. In the experiment, ONN shows better performance and robustness compared with the ordinary CNN. Especially for some extreme cases, the ONN is still able find the correct orientation. Considering that no one has ever proposed a dedicated neural network for orientation identification, our work is very practical and innovative.
Abstract: Nowadays the number of mobile applications are a fast growing. Among them, mobile applications based on Optical Character Recognition (OCR) play an important part. One of main challenges of such applications to overcome is that the image acquisition procedure is in a manner unreliable and may contain many distortions. As a consequence, a suitable OCR output requires efforts to enhance the quality of the captured image. This leads to an increase of computation time and cost. In this paper, we focus on the prediction of image’s sharpness quality. We choose to concentrate on the image’s sharpness quality because blur distortions seriously alter readability for both human and computer. Our contribution consists of a method combining focus and sharpness measures as image’s descriptors and a Support Vector Machine to classify image’s sharpness quality. This approach is fast, reliable, and can be easily implemented on a mobile device. Experimental results show that our method is efficient for OCR based mobile-captured document images.
Muhammad Hanif, Anna Tonazzini, Pasquale Savino, Emanuele Salerno and Greg Tsagkatakis. Document Bleed-through Removal using Sparse Image Inpainting
Abstract: Bleed-through is a pervasive degradation in ancient documents, caused by the ink of the opposite side of the sheet that has seeped through the paper fiber, and appears as an extra, interfering text. Bleed-through severely impairs document readability and make it difficult to decipher the information on the page. Digital image restoration techniques have been successfully employed to remove or significantly reduce this distortion. The main theme of such techniques is to identify the bleed-through pixels and estimate an appropriate replacement for them, in accordance to their surrounding. This paper proposes a two- step image restoration method, exploiting information from the recto and verso images. First, based on a non-stationary linear model of the two texts overlapped in the recto- verso pair, the bleed-through pixels are identified. In the second step, a sparse representation based image inpainting technique, with a non-negative sparsity constraint, is used to find an appropriate replacement for the bleed-though pixels. Thanks to the power of dictionary learning and sparse image reconstruction methods, the natural texture of the background is well reproduced in the bleed-through areas, and even a possible overestimation of the bleed-through areas is effectively corrected, so that the original appearance of the document is preserved. The experiments are conducted on the images of a popular database of ancient documents, and the results validate the performance of the proposed method compared to the state of the art.
Abstract: We propose a high-performance fully convolutional neural network (FCN) for historical handwritten document segmentation that is designed to process a single page in one step. The advantage of this model beside its speed is its ability to directly learn from raw pixels instead of using preprocessing steps e. g. feature computation or superpixel generation. We show that this network yields better results than existing methods on different public data sets. For evaluation of this model we introduce a novel metric that is independent of ambiguous ground truth called Foreground Pixel Accuracy (FgPA). This pixel based measure only counts foreground pixels in the binarized page, any background pixel is omitted. The major advantage of this metric is, that it enables researchers to compare different segmentation methods on their ability to successfully segment text or pictures and not on their ability to learn and possibly overfit the peculiarities of an ambiguous hand-made ground truth segmentation.
Abstract: In this paper, we explore the use of second-order derivatives to detect text lines on handwritten document images. Taking advantage that the second derivative gives a minimum response when a dark linear element over a bright background has the same orientation as the filter, we use this operator to create a map with the local orientation and strength of putative text lines in the document. Then, we detect line segments by selecting and merging the filter responses that have a similar orientation and scale. Finally, text lines are found by merging the segments that are within the same text region. The proposed segmentation algorithm, is learning-free while showing a performance similar to the state of the art methods in publicly available datasets.
Abstract: Layout Analysis, mainly including binarization and page/line segmentation, is one of the most important performance determining steps of an OCR system for complex medieval historical document images, which contain noise, distortions and irregular layouts. In this paper, we present a novel layout error correction technique which include a VGG Net to classify non-textline and adversarial network approach to obtain the layout bounding mask. The presented layout error correction technique are applied to a collection of 15th century Latin documents, which achieved more than 75% accuracy for segmentation techniques.
Xiaode Zhang, Liangcai Gao, Yilun Huang, Yaoxiong Huang, Lianwen Jin, Dong An and Zhi Tang. A Sequence Labeling Based Approach for Character Segmentation of Historical Documents
Abstract: As an important prerequisite step of historical document image analysis, character segmentation is fundamental but challenging. In this paper, we propose a novel approach for the character segmentation of historical documents by treating it as a sequence labeling problem. In more detail, the proposed model first segment document image into lines, then each column in the line image is given a label to indicate it is a segmentation position or not. The segmentation labeling is achieved by a neural model, which combines a CNN for feature extraction, a LSTM for sequence modeling and a CRF for sequence labeling. The performance of our method has been evaluated on a 300-page dataset including 96,479 characters. The experimental results demonstrate that the proposed method achieves superior or highly competitive performance compared with other methods.
Abstract: The complete track system of Germany covers more than 42.000 kilometers - some built before 1970. As a consequence, technical drawings are typically of manual origin. Newer plans are generated in a computer-aided way but remain drawings in the sense that semantics are not captured in the electronic files themselves. The engineer decides the meaning of a symbol while viewing the document. For project realization (e.g., engineering of some interlocking system), these plans are digitized manually into some machine interpretable format.
In this paper, we propose an approach to digitize track layouts (semi-)automatically. We use fingerprint recognition techniques to digitize manually created track plans efficiently. At first, we detect tracks by detecting line endings and bifurcations. Secondly, we eliminate false candidates and irregularities. Finally, we translate the resulting graph into an interchangeable format RailML.
We evaluate our method by comparing our results with different track plans. Our results indicate that the proposed method is a promising candidate, reducing the effort of digitization.
Gantugs Atarsaikhan, Brian Kenji Iwana and Seiichi Uchida. Contained Neural Style Transfer for Decorated Logo Generation
Abstract: Making decorated logos require image editing skills, without sufficient skills it could be a time-consuming task.
Although, there are many on-line web services to make new logos, they have limited designs and duplicates could be made.
We propose using neural style transfer with clip art and text for the creation of new and genuine logos.
We introduce a new loss function based on distance transform of input image, which allows the preservation of the silhouettes of text and objects.
The proposed method contains style transfer to only a designated area.
We demonstrate the characteristics of proposed method.
Finally, we show the results of logo generation with various input images.

Poster Session II (Thursday 26th April, 16:00-17:30)

Syed Saqib Bukhari, Manabendra Saha, Praveen Kumar Badimala Giridhara, Manesh Kumar Lohano and Andreas Dengel. anyAlign: An Intelligent and Interactive Text-Alignment Web-Application for Historical Document
Abstract: Text alignment is an important performance determining step of a OCR system for printed and historical documents. With the increase in transcripts it becomes important to align the text with that of the transcripts. It is a time and labor
intensive work for many paleographers. Here we have presented an end to end semi automatic interactive text alignment system for historical document. OCRopus [14] is used for binarization and line segmentation of the historical document image. Text line segmentation followed by text alignment is done automatically by the system using ORB (Oriented Fast and Rotated Brief) local image feature descriptors. ORB features are matched by KNN. The system provides an interactive user interface for rectifying wrong text segmentation and text alignment. The results are discussed in evaluation section.
Abstract: State-of-the-art Handwritten Text Recognition techniques are based on statistical models such as hidden Markov models or recurrent neural networks for optical modeling of characters and N-grams for language modeling. These models are trained using well known, learning techniques, most of them based on the Expectation-Maximization algorithm. Therefore, training data is needed to build these models. In the case of the optical models the training data consist of text line images with their corresponding transcripts. When the transcript of a handwritten document is available, putting in correspondence automatically the physical lines in the images with the lines of the transcripts is not an easy task. We present a method for automatically aligning handwritten text images and their respective transcripts. The approach automatically segments the images into lines and then recognizes them. An alignment confidence is obtained using the Levenshtein distance between the recognition results and the transcripts. The most confident lines are then used for training. Experiments carried out using a historical document present encouraging results.
Abstract: Web Services are being adapted into the workflows of many Document Image Analysis researchers. However, so far, there is no common platform for providing access to algorithms in the community. DIVAServices aims at becoming this by providing a platform that is open to the whole community to provide their own methods as Web Services.In this paper we present updates and enhancements of the existing DIVAServices platform. These includes a new computational backend, a revamped execution workflow based on asynchronous communication, and the possibility for methods to specify their outputs. Furthermore we discuss the importance of an ``ecosystem'' for such a platform. We argue that only providing a RESTful API is not enough. Users need tools and services around the framework that supports them in adapting the Web Services and we introduce some of the tools that we built around DIVAServices.
Mathias Seuret, Manuel Bouillon, Fotini Simistira, Marcel Würsch, Marcus Liwicki and Rolf Ingold. A Semi-Automatized Modular Annotation Tool for Ancient Manuscript Annotation
Abstract: In this paper, we present DIVAnnotation, an ancient document annotation tool which is freely available as open source. This software is easily modular thanks to the splitting of the different annotation steps through the use of a tabbed graphical user iterface. State-of-the-art document image analysis methods are included through web services, thus allowing users to generate automatically annotations and correct them manually when needed. The annotations are stored into a highly structured TEI file which makes data access and manipulation simple. A Java library for managing TEI files generated by DIVAnnotation is also provided as open source.
Aliona Dangla, Elodie Puybareau, Guillaume Tochon and Jonathan Fabrizio. A first step toward a fair comparison of evaluation protocols for text detection algorithms
Abstract: Text detection is an important topic in pattern recognition, but evaluating the reliability of such detection algorithms is challenging. While many evaluation protocols have been developed for that purpose, they often show dissimilar behaviors when applied in the same context. As a consequence, their usage may lead to misinterpretations, potentially yielding erroneous comparisons between detection algorithms or their incorrect parameters tuning. This paper is a first attempt to derive a methodology to perform the comparison of evaluation protocols. We then apply it on five state-of-the-art protocols, and exhibit that there indeed exist inconsistencies among their evaluation criteria. Our aim here is not to rank the
investigated evaluation protocols, but rather raising awareness in the community that we should carefully reconsider them in order to converge to their optimal usage.
Tobias Grüning, Markus Diem, Florian Kleber, Stefan Fiel and Roger Labahn. READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents
Abstract: Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing ground truth by these means is laborious and not needed to determine a method’s quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization and it can handle skewed as well as rotated text lines. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts used this evaluation scheme. Finally, we present results achieved by a recently published text line detection algorithm.
Abstract: This paper proposes a new evaluation metric based on the existing ZoneMap metric. The ZoneMap method, designed to perform a zone segmentation evaluation and classification, is considered in the context of OCR evaluation. Its limits are spotted, described and a new algorithm, ZoneMapAlt (ZoneMap Alternative) is proposed to solve the identified limits while keeping the properties of the original one. To validate the new metric, experiments have been made on a dataset of scientific articles. Results demonstrate that the ZoneMapAlt algorithm provides greater details on segmentation errors and is able to detect critical segmentation errors.
Abstract: Based on the hypothesis that a good / poor quality document image is most probably a readable / unreadable document, document image quality and readability have interchangeably been used in the literature. These two terms, however, have different meanings implying two different perspectives of looking at document images by human being. In document images the level of quality and the degree of readability may have a relation / correlation considering human perception. However, to the best of our knowledge there is no specific study to characterise this relation and also validate the abovementioned hypothesis. In this work, at first, we created a dataset composed of mostly camera-based document images with various distortion levels. Each document image has then been evaluated / assessed with regard to two different measures including the level of quality and the degree of readability by different individuals. A detailed Normalised Cross Correlation analysis along with different statistical analysis based on Shapiro-Wilks and Wilcoxon tests has further been provided to demonstrate how document image quality and readability are linked. Our findings indicate that the correlation between quality and readability was as high as 0.99 with a very low significance (2.2e-16). The statistical tests further revealed that quality and readability were somewhat different in terms of medians and distributions of the measures.
Abstract: As digitised documents normally contain a large variety of structures, a page segmentation- and layout-free method for document image retrieval is preferable. In this research work, therefore, wavelet transform as a transform-based approach is initially used to provide different under-sampled images from the original image. Then, Gist operator, as a feature extraction technique, is employed to extract a set of global features from the original image as well as the sub-images obtained from the wavelet transform. Moreover, the column-wise variances of the values in each sub-image are computed and they are then concatenated to obtain another set of features. Considering each feature set, locality-sensitive hashing is employed to compute similarity distances between a query and the document images in the database. Finally, a classifier fusion technique using the mean function is taken into account to provide a document image retrieval result. The combination of these features and a clustering score fusion strategy provides higher document image retrieval accuracy. Two different databases of the document image are considered for experimentation. The results obtained from the experimental study are encouraging.
Abstract: Identity document verification consists on checking its conformity to one or eventually a set of authentic documents. This verification is usually performed through visible patterns matching. In this paper, we propose a new efficient visual descriptor for pattern comparison.
As most of existing descriptors incorporate either color or spatial information; the proposed descriptor, called Grid-3CD, includes both information.
This descriptor is based on color connected components (CC) extracted from a quantified image.
It consists of a set of 6-tuples computed on a grid of pixels sampled from the color-quantified image. The 6-tuple of a given pixel describes the density, the mass center, the bounding box and the color of the CC that contains this pixel. The efficiency of this descriptor for identity document verification is shown using two strategies of pattern comparison. The first one is unsupervised and based on a distance measure whereas the second is supervised and based on one-class Support Vector Machine (SVM).
The experimentation of the new descriptor on four datasets of identity documents totaling 3250 documents shows an average accuracy of about 90%, outperforming state-of-the-art descriptors.
Chandranath Adak, Simone Marinai, Bidyut B. Chaudhuri and Michael Blumenstein. Offline Bengali Writer Verification by PDF-CNN and Siamese Net
Abstract: Automated Handwriting Analysis is a popular area of research owing to its evolving variation of patterns. Writer verification is one of the most challenging branches of this research having direct impact on biometrics and forensics. In this paper, we deal with offline writer verification on complex handwriting patterns. Therefore, we choose a comparatively complex script, i.e., Indic Abugida script Bengali (or, Bangla) having more than 250 compound characters. From a handwritten sample, probability distribution functions (PDFs) of some handcrafted features are obtained and input to a convolutional neural network (CNN). For such a CNN architecture, we coin the term “PDF-CNN”, where handcrafted feature PDFs are hybridized with auto-derived CNN features. Such hybrid features are then fed into a Siamese neural network for writer verification. The experiments are performed on a Bengali offline handwritten dataset of 100 writers. Our system is able to achieve encouraging results which sometimes exceed the results of state-of-the-art techniques on writer investigation.
Abstract: Smartphones have became an easy and convenient mean to acquire documents. In this paper, we focus on the automatic segmentation of identity documents in smartphone photos or videos using visual saliency (VS). VS-based approaches, which pertain to computer vision, have not be considered yet for this particular task. Here we compare different VS methods, and we propose a new VS scheme, based on a recent distance belonging to the scope of mathematical morphology. We show that the saliency maps we obtain are competitive with state-of-the-art visual saliency methods and, that such approaches are very promising for use in identity document detection and segmentation, even without taking into account any prior knowledge about document contents. In particular they can work in real-time on smartphones.
Abstract: Multispectral imaging allows for analysis of images in multiple spectral bands. Over the past three decades, airborne and satellite multispectral imaging have been the focus of extensive research in remote sensing. In the recent years, ground based multispectral imaging has gained immense interest in the areas of computer vision, medical imaging, and art and archaeology. The rich information content in multispectral images allows the forensic experts to examine the chemical composition of forensic traces. Due to its rapid, non-contact and non-destructive characteristics, multispectral imaging is an effective tool for visualization, age estimation, detection and identification of forensic traces in document images. Ink mismatch is a key indicator of forgery in a document. Inks of different materials exhibit different spectral signature even if they have the same color. Multispectral analysis of questioned documents images allows identification and discrimination of visually similar inks. In this paper, an efficient automatic ink mismatch detection technique is proposed which uses Fuzzy C-Means Clustering (FCM) to divide the spectral responses of ink pixels in handwritten notes into different clusters which relate to the unique inks used in the document. Sauvola’s local thresholding technique is employed to efficiently segment foreground text from the document image. Furthermore, feature selection is used to optimize the performance of the proposed method. The proposed method gives a clear discrimination between inks in questioned documents. This research contributes to exploring the potential of multispectral imaging in document analysis and forensic science.
Abstract: When extracting information from handwritten documents, text transcription and named entity recognition are usually faced as separate subsequent tasks. This has the disadvantage that errors in the first module affect heavily the performance of the second module. In this work we propose to do both tasks jointly, using a single neural network with a common architecture used for plain text recognition.
Results of experiments are presented to show the effect on the performance for: different ways of encoding the information, doing or not transfer learning and processing at text line or region level.
The results are comparable to the ones obtained on the ICDAR 2017 Information Extraction competition, even though the proposed technique does not use any dictionaries, language modeling or post processing.
Abstract: Offline handwriting recognition systems often use LSTM networks, trained with line- or word-images. Multiline text makes it necessary to use segmentation to explicitly obtain these images. Skewed, curved, overlapping, incorrectly written text, or noise can lead to errors during segmentation of multi-line text and reduces the overall recognition capacity of the system. Last year has seen the introduction of deep learning methods capable of segmentation-free recognition of whole paragraphs. Our method uses Conditional Random Fields to represent text and align it with the network output to calculate a loss function for training. Experiments are promising and show that the technique is capable of training a LSTM multi-line text recognition system.
Abstract: Deep neural network architectures have been used successfully in various document analysis studies. Its strength in producing human like performance has already been explored in handwritten English numeral recognition task. In this context, a natural question often arises in the practitioners' mind: whether an increase in the depth of the network should eventually lead to improved recognition performance on unknown samples. A goal of the present work is to search for an answer of the same through a case study of a larger class handwriting recognition problem. Here, we have studied recognition of handwritten Devanagari characters. In this study, we have implemented convolutional neural network (CNN) architectures of five different depths. We have also implemented additional neural architectures by adding two Bidirectional Long Short Term Memory (BLSTM) layers between the convolutional stack and the fully connected part of each of these five CNN networks. Simulations have been performed on two different databases of handwritten Devanagari characters consisting of 30408 and 36172 samples and a combined set consisting of 58451 samples. The recognition accuracy obtained in the best case improves significantly the existing state-of-the-art of this handwriting recognition problem. Also, further analysis of our simulation results provide of an answer to the above question. Additionally, we have trained a BLSTM network alone using the Histogram of Oriented Gradient (HOG) features. Performance of this later architecture failed to compete with the performance of CNN-BLSTM hybrid architecture.
Shah Nawaz, Alessandro Calefati, Nisar Ahmed and Ignazio Gallo. Hand Written Characters Recognition via Deep Metric Learning
Abstract: Deep metric learning plays an important role in measuring similarity through distance metrics among arbitrary group of data. MNIST dataset is typically used to measure similarity however this dataset has few seemingly similar classes, making it less effective for deep metric learning methods. In this paper, we created a new handwritten dataset named Urdu-Characters with interesting set of classes suitable for deep metric learning. With this work, we compare the performance of two state-of-the-art deep metric learning methods i.e. Siamese and Triplet network. We show that a Triplet network is more powerful than a Siamese network. In addition, we show that the performance of a Triplet or Siamese network can be improved using most powerful underlying Convolutional Neural Network architectures.
Christian Reul, Uwe Springmann, Christoph Wick and Frank Puppe. Improving OCR Accuracy on Early Printed Books by utilizing Cross Fold Training and Voting
Abstract: In this paper we introduce a method that significantly reduces the character error rates for OCR text obtained from OCRopus models trained on early printed books. The method uses a combination of cross fold training and confidence based voting. After allocating the available ground truth in different subsets several training processes are performed, each resulting in a specific OCR model. The OCR text generated by these models then gets voted to determine the final output by taking the recognized characters, their alternatives, and the confidence values assigned to each character into consideration. Experiments on seven early printed books show that the proposed method outperforms the standard approach considerably by reducing the amount of errors by up to 50% and more.
Abstract: Although the performance of the state-of-the-art OCR systems is very high, they can still introduce errors due to various reasons. and When it comes to historical documents with old manusrips the preformance of such systems gets even worse. That is why Post-OCR error correction has been an open problem for many years. Many state-of-the-art approaches have been introduced thorough the recent years.
This paper contributes to the field of Post-OCR Error Correction by introducing two Novel deep learning approaches to improve the accuracy of OCR systems, and a post processing technique that can further enhance the quality of the output results. These approaches are based on Neural Machine Translation and were motivated by the great success that deep learning introduced to the field of Natural Language Processing. Finally, we will compare the state-of-the-art approaches in Post-OCR Error Correction with the newly introduced systems and discuss the results.
Abstract: It presents our recent study of segmentation-free recognition of unconstrained cursive online handwriting of Devanagari and Bangla, the two most popular Indic scripts. Here, we have devised an efficient algorithm for obtaining the core (or, busy) region of a handwritten word and compute a robust feature set which includes certain measures based on the knowledge of this core region. It also includes first and second order discrete derivatives which estimates certain geometric properties of discrete curves along the pen trajectory of input online handwritten word. The proposed feature vector is obtained at each point on the trajectory of the preprocessed words sample based on a window (called vicinity) centered at the point. A hybrid deep neural network architecture consisting of a convolutional neural network (CNN), a bidirectional recurrent neural network (BiRNN) having Long Short Term Memory (LSTM) cells and a connectionist temporal classification (CTC) layer receives this high level feature vector as input for labelling the character sequence in the online word sample. This study shows that the present hybrid architecture recognizes online handwriting more efficiently than a BLSTM network alone.
Rathin Radhakrishnan Nair, Nishant Sankaran, Bhargava Urala Kota, Sergey Tulyakov, Srirangaraj Setlur and Venu Govindaraju. Knowledge Transfer using Neural network based approach for Handwritten Text Recognition
Abstract: The goal of a writer adaptive handwriting recognition system is to build a model that improves the recognition of a generic recognition model for a specific author. In this work, we show how structural representation learned from a generic writer-independent handwriting recognition model can be customized to individual authors. Convolutional Neural Network in the past has shown outstanding performance in learning image-based representation that was used for classification in areas like visual recognition. Additionally, they have been used along with Recurrent Neural Network (RNN) or its variations like Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU) layers to analyze and understand sequences in handwriting recognition, sentence analysis, voice recognition etc. In most cases, the CNNs serve as a feature extractor instead of low-level hand-designed features that were used previously for the above-mentioned classification tasks.
We design a method to reuse layers trained on the IAM offline handwritten dataset to compute mid-level image representation for text in the Washington and Bentham dataset. We show that despite differences in the writing style, fonts etc. across these datasets, the transferred representation is able to capture a spatiotemporal representation leading to significantly improved recognition results. Additionally, we hypothesize that the performance is solely not dependent on the number of samples and our experimental evaluation tests the model with varying amount of fine-tuning samples showcasing promising results.
Abstract: Ancient Sanskrit manuscripts are a rich source of knowledge about Science, Mathematics, Hindu mythology, Indian civilization, and culture. It therefore becomes critical that access to these manuscripts is made easy, to share this knowledge with the world and to facilitate further research on this Ancient literature. In this paper, we propose a Convolutional Neural Network (CNN) based Optical Character Recognition system (OCR) which accurately digitizes Ancient Sanskrit manuscripts (Devanagari Script) that are not necessarily in good condition. We use an image segmentation algorithm for calculating pixel intensities to identify letters in the image. The OCR considers typical compound characters (half letter combinations) as separate classes in order to improve the segmentation accuracy. The novelty of the OCR is its robustness to image quality, image contrast, font style and font size, which makes it an ideal choice for digitizing soiled and poorly maintained Sanskrit manuscripts.

Short Papers

Short Papers Booklet

Poster Session I (Wednesday 25th April, 16:00-17:30)

Abstract: This paper presents work that has been carried out in the National Library of Finland to detect names of locations and persons in a Finnish historical newspaper and journal collection 1771–1929. Work and results reported in the paper are based on a 500 000 word ground truth (GT) sample of the Finnish language part of the whole collection with different Optical Character Recognition quality.
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. Performance of a NER system is usually heavily genre and domain dependent. Entity categories used in NER may also vary. The most used set of named entity categories is usually some version of three partite categorization of locations, persons and organizations [1].
In our work we use a standard trainable statistical NER engine, Stanford NER . Considering the quality of our data and complexities of Finnish language, our NER results can be considered as good. With our ground truth data we achieve F-score of 0.89 with locations and 0.81 with persons. With re-OCRed Tesseract v. 3.04.01 output the F-score results are 0.79 and 0.72, respectively, for locations and persons.
Abstract: With Document Image Analysis gaining a strong foothold in the domain of paleography and the emergence of Human-Document Interaction as a narrative, we effectively need to reimagine the way scholars in humanities currently interact with digitized manuscripts. We initially evaluate related work in the current context of post-WIMP interfaces and then introduce our proposed system AMAP. With a strong focus on manuscript exploration, it attempts to harness the current state-of-the-art interaction paradigms to develop an intuitive system to engage with the manuscripts.
Shubham Gupta, Jayanta Mukherjee, Dipali Bhattacharya, Himadri Majumder, Rahul Roy and Bidyut Chaudhuri. An Efficient approach for designing Deep Learning Network on Title extraction for Architecture, Engineering & Construction Documents
Abstract: Till now Architectural, Engineering and Construction industry heavily relies on the paper documents and their management. Digitization of such documents before technical advances was a time intensive manual process. For design documents, the Title block is a unique identifier. This identifier facilitates features like search and auto hyperlinking of documents. In this paper, automation of title extraction has been achieved through OCR on the document image. This process involves multiple steps like scanning of documents, automatic title block selection, applying OCR on the block and extraction of relevant information. However, still the user has to review all the extracted information to get error-free data. The cumulative results on all steps add to a considerable error rate because of practical challenges. We hereby propose a hybrid system capable of learning in real time using concepts of machine/deep learning applied on the user specific data. This approach has substantially reduced the user intervention, thus improving user experience and efficiency of working on the platform. The applied learning algorithm is based on a simple feed-forward model of neural network designed to take care of the common flaws in the implementation of deep learning that requires large datasets and computational power for training
Mohammad Mohsin Reza, Md. Ajraf Rakib, Syed Saqib Bukhari and Andreas Dengel. A High-Performance Document Image Layout Analysis for Invoices
Abstract: Layout analysis for document is an important step in OCR pipeline and currently an intensive amount of research is going on to extract searchable full text from scanned images. Invoices are different in nature as compared to pages of books, magazine, loan documents and others, since, there are tables, header, footer, large white spaces, currency, item name, item amount, logo in the invoice. The standard layout analysis proves inefficient on invoices. In this paper we are proposing an advanced layout analysis for invoices that integrate the following steps in the standard layout analysis: (i) removal of table cell lines, (ii) reassigning page frame, (iii) merging blocks, and (iv) crop full text lines from blocks. Additionally, we integrated the proposed layout analysis for invoices into the anyOCR system , which was mainly developed for standard document layouts. In the performance evaluation section, we will compare our advanced layout analysis pipeline with the standard anyOCR~\cite{Kadi} pipeline and with a commercial OCR system like ABBYY. Our advanced layout analysis achieved better OCR accuracy as compared to the other mentioned systems.
Anoop R Katti, Johannes Hoehne, Steffen Bickel and Jean Baptiste Faddoul. Applying Sequence-to-Mask Models for Information Extraction from Invoices
Abstract: Automatically extracting information from scanned invoices can result in significant time and cost savings in Accounts Payable processing. We present an RNN model for extracting information from scanned invoices. Our model significantly outperforms a Random Forest baseline.

Poster Session II (Thursday 26th April, 16:00-17:30)

Abstract: This paper presents work that has been carried out in the National Library of Finland to improve optical character recognition (OCR) quality of a Finnish historical newspaper and journal collection 1771–1929. Work and results reported in the paper are based on a 500 000 word sample of the Finnish language part of the whole collection. The sample has three different parallel parts: a manually corrected ground truth version, original OCR with ABBYY FineReader v. 7 or v. 8, and an ABBYY FineReader v. 11 re-OCRed version. Based on this sample and its page image originals we have developed a re-OCRing procedure using the open source software package Tesseract v. 3.04.01. Our methods in the re-OCR include image preprocessing techniques, usage of a morphological analyzer and a set of weighting rules for resulting words. Besides results based on the GT sample we present also results of re-OCR for a 10 year period of one newspaper of our collection, Uusi Suometar.
Abstract: Scene character recognition is challenging owing to various factors such as rotation, geometric distortion, uncontrolled lighting, blur, and noise. In addition, Japanese
character recognition requires a large number of training data since thousands of character classes exist in the language. In order to enhance Japanese scene character recognition, we have developed a training data augmentation method and a recognition system using multi-scale classifiers. Experimental results show that the multi-scale scheme effectively improves the recognition accuracy.
Abstract: Reduction of the computational cost in handwritten Chinese character recognition is crucial especially on mobile devices with limited processor performance. In this paper, we propose a candidate reduction technique based on the combination of CNN-based feature extractor and the Hierarchical Overlapping Clustering. Experimental results show that the CNN can be successfully combined with our former candidate reduction method and yields 7.73% higher accuracy at 16% faster speed.
Martin Schall, Haiyan Buehrig, Marc-Peter Schambach and Matthias Franz. LSTM Networks for Edit Distance Calculation with Exchangeable Dictionaries
Abstract: Algorithms for calculating the string edit distance are used in e.g. information retrieval and document analysis systems or for evaluation of text recognizers. Text recognition based on CTC-trained LSTM networks includes a decoding step to produce a string, possibly using a language model, and evaluation using the string edit distance. The decoded string can further be used as a query for database search, e.g. in document retrieval. We propose to closely integrate dictionary search with text recognition to train both combined in a continuous fashion. This work shows that LSTM networks are capable of calculating the string edit distance while allowing for an exchangeable dictionary to separate learned algorithm from data. This could be a step towards integrating text recognition and dictionary search in one deep network.
Abstract: This paper introduces a continuous competition and the underlying system that enables it based on the ICDAR Competition on Recognition of Documents with Complex Layouts – the most recent being RDCL2017. It is shown how researchers can perform the evaluation of their results using new functionality of the Aletheia system and how the outcome can be published on the competition website for comparison with other evaluated approaches.
Jake Walker, Yasuhisa Fujii and Ashok Popat. A Web-Based OCR Service for Documents
Abstract: Google has developed a system capable of high-accuracy OCR over many languages. It is available for general use via the Google Cloud Vision API. This paper outlines the most recent instantiation of the system behind the API: its structure and organization and the basic functioning of its components. We explain some design decisions that relate to providing OCR as a high-capacity web service, a scenario which presents specific challenges. Accuracy results are provided using an internal evaluation dataset, comparing against the Tesseract open-source system. Some limitations of the current approach are noted.


Abstract: Nowadays, there are still many handwritten historical documents in archives waiting to be transcribed and indexed. Since manual transcription is tedious and time consuming, the automatic transcription seems the path to follow. However, the performance of current handwriting recognition techniques is not perfect, so a manual validation is mandatory. Given that crowdsourcing is usually boring, we propose experiences based in gamification to increase the interest of users. Concretely, we propose to validate the automatic transcription via gamesourcing through an application for Android mobile devices.

Attendee Information

You can enjoy some wine or beer at the welcome attendance which takes place at TUtheSky.

Registration Information

All participants need to register in order to attend the workshop. Admission to the workshop is not allowed without registration. Registration for the workshop, the tutorials and the social events is available online.

Register Here

In order to correctly organize the workshop (social events, goodies, ...), the regular registration  process will close after Monday 16th April 2018. The registration fee (both regular and student) includes admission to the workshop, coffee-breaks, lunch, USB proceedings, banquet and welcome reception.

In order to include the papers in the proceedings, at least one author of each paper must complete the registration form by February 16th, 2018. Student participants presenting more than one paper in the workshop are required to pay the regular registration fee (not student fee). According to IAPR's policy, should an author have more than one paper accepted, only one registration is required for publication although other authors are encouraged to register and participate in the workshop.

Each paper (oral or poster) must be presented by an author at the workshop. Failure to present a paper during the workshop will likely result in withdrawal of the paper from the conference digital library (Xplore).

Registration Fees

For early planning purposes, all attendees are encouraged to complete online registration before February 2, 2018, to enjoy the early bird discount registration fee.
  Student* IAPR Standard
Early Bird** 400€ 550€ 600€
Regular 450€ 600€ 650€

* Master or PhD students
** until Feb 2, 2018

Social Events

The welcome reception takes place at TUtheSky (see picture above) which is in the heart of Vienna. The conference banquet will be held at the city hall (details will follow soon).  Don’t miss the announcement of Best Paper and Best Student Paper winners and join your DAS colleagues for a dinner of wining, dining, and shining examples of research quality.

Discussion Groups

According to a long-standing tradition at the DAS workshops, we will hold small-group discussions on topics of special interest to attendees. It is a nice opportunity to meet other researchers and discuss on relevant topics for the community. Everyone is welcome to participate in the discussions. Moreover, each group needs a moderator and a scribe. Their roles are:

  • The moderator encourages everyone to speak and helps to focus and clarify the discussion.
  • The scribe takes written notes of the discussion and summarizes the results in a plenary session.

The moderator and the scribe will co-author a short summary report after the workshop is over, which will be posted to the workshop website and at the TC-11 website The names of the moderators and scribes will be also listed at the website.

Please fill this form choosing the topics of your interest, and feel free to propose some.

We are looking forward to your participation! Please contact the Discussion Groups coordinators with your suggestions and ideas:


Travel Information

If you fly to Vienna, you will arrive at Vienna International Airport (VIE), located at the city border about 20 minutes by public transportation. Conference attendees can take the City Airport Train (CAT), which directly connects the airport with the city. Alternatively you can take the bus (Vienna AirportLines) to Schwedenplatz or the railway S7. Vienna has a very dense and efficient public transportation network, and it is therefore easy to travel between the conference site, hotels and social venues; attendees will not need to rent a car. The network consists of five underground lines, trains, trams, and buses and runs around the clock. The central part of the city is relatively compact so you can reach most of Vienna’s attractions by foot.

Visa support letters

For those of you who require a visa support letter to attend the conference, please send us an e-mail: das2018@caa.tuwien.ac.at.

Austrian Airlines is Official Carrier of DAS 2018

If you book your flights to DAS 2018 online at www.austrian.com you can save 15% on all applicable fares. To do so, simply enter the following code in the eVoucher field on the Austrian homepage booking engine:

Code: DAS18

  • Booking period: now until April 27th, 2018
  • Valid for flights to Vienna and return as of April 17th, 2018 until Mai 4th, 2018 (final date for return flight) on flights operated by Austrian Airlines.

Book your flights here!

Travelling by train to Vienna

Vienna has direct connections to most European cities nearby, including Bratislava, Munich, Frankfurt, Budapest, Hamburg, Prague, Warsaw and overnight trains to Berlin, Venice, Rome, Warsaw and Zürich. Most long-distance trains run from the Hauptbahnhof (main station); many of these trains also serve Meidling. Some semi-fast services towards Salzburg start and end at Westbahnhof. Don’t confuse the main station (Hauptbahnhof) with the central station (Wien Mitte); the latter is only served by local and regional trains.

Travelling within Vienna

Vienna has an extensive public transport system (www.wienerlinien.at) that consists of five underground (U-Bahn) lines (U1, U2, U3, U4 and U6), trams and buses, which makes it easy to reach the conference venue, even if you are staying on the other side of the city. A single trip costs €2.20 and is valid on any reasonable route to your destination; changes are permitted. If you are going to spend time sightseeing or need to commute from your hotel, consider buying a pass. These are available for 24 hours (€7.60), 48 hours (€13.30) or 72 hours (€16.50). A weekly pass (€16.20) is even better value, but is only valid from Monday midnight (00:00hrs) to the following Monday, 9am. In common with many other European cities, Vienna’s public transport uses the honour system; the penalty for not having a valid ticket is €103. Note that many tickets require validation (stamping) before entering the platform; these are marked “Bitte entwerten/Please validate”.

Transfer from the airport to the city center

The Vienna International Airport (VIE) in Schwechat is about 20 km away in the southeast of Vienna. Taking a taxi directly at the airport is rather expensive (about €45), but you can get better value by pre-booking at airportdriver.at, flughafentaxi-wien.at (cost around €30), or myDriver.

Express trains (Railjet and Intercity) run at half-hourly intervals from the airport to the Hauptbahnhof (main railway station, line U1) and Meidling (U6) stations and take 15-18 minutes. If you are staying near the conference venue, you will probably want to use this service. A stopping service (S7) also runs across the city via Wien Mitte (Landstraße U3, U4), Praterstern (U1, U2) and Handelskai (U6), and connects with all underground lines; the travelling time to Wien Mitte is approx. 24 minutes. Regardless of which route you take, the fare within the integrated tariff system is €3.90 and this includes onward travel via subway, tram, bus, etc. to your destination in Vienna. If you already have a pass for Vienna, you need to purchase an extension ticket from the city boundary (ab Stadtgrenze, €2.20).

More information on the railway connections, including the timetable, is available in the following leaflet from the Austrial Federal Railways (ÖBB).

Premium services also include the City Airport Train (CAT, €12) and the Vienna Airport Lines buses (€8). The CAT runs non-stop to Wien Mitte, where it terminates; the main advantage is being able to check-in luggage at the railway station (select airlines only) on the day of your return flight. Departures are at 06 and 36 past the hour in both directions. Note that tickets purchased for these services are not valid for onward travel and are also not valid on the regular trains if you miss your connection.



TU Wien
Institute of Computer Aided Automation
Computer Vision Lab

Karlsplatz 13
A-1040 Vienna, Austria
Phone: +43-1-58801-18376
e-mail: das2018@cvl.tuwien.ac.at


General Chairs

Robert Sablatnig (Austria)
Florian Kleber (Austria)

Publicity Chairs

Markus Diem (Austria)
David Doermann (USA)
Gernot A. Fink (Germany)

Discussion Group Chairs

Alicia Fornés (Spain)
Marcus Liwicki (Germany)

Publication Chair

Stefan Pletschacher (UK)

Program Chairs

Basilis Gatos (Greece)
Koichi Kise (Japan)
Dan Lopresti (USA)
Jean-Marc Ogier (France)

Tutorial Chairs

Michael Blumenstein (Australia)
Stefan Fiel (Austria)
Cheng-Lin Liu (China)

Program Committee

Adel Alimi (Germany)
Apostolos Antonacopoulos (UK)
Oliver Augereau (Japan)
Elisa H. Barney Smith (USA)
Abdel Belaid (France)
Vincent Christlein (Germany)
Hervé Déjean (France)
Andreas Dengel (Germany)
Rafael Dueire Lins (Brazil)
Véronique Eglin (France)
Jihad El-Sana (Israel)
Andreas Fischer (Switzerland)
Volkmar Frinken (USA)
Utpal Garain (India)
Lluis Gomez (Spain)
Venu Govindaraju (USA)
Masakazu Iwamura (Japan)
Motoi Iwata (Japan)
Dimosthenis Karatzas (Spain)
Bart Lamiroy (France)
Laurence Likforman-Sulem (France)
Josep Lladós (Spain)
George Louloudis (Greece)
Andreas Maier (Germany)
R. Manmatha (USA)
Simone Marinai (Italy)

Program Committee (continued)

Jean-Luc Meunier (France)
Guenter Muehlberger (Austria)
Masaki Nakagawa (Japan)
Premkumar Natarajan (USA)
Umapada Pal (India)
Shivakumara Palaiahnakote (Malaysia)
Thierry Paquet (France)
Vincent Poulain D'Andecy (France)
Ioannis Pratikakis (Greece)
Jean-Yves Ramel (France)
Oriol Ramos Terrades (Spain)
Marcal Rusinol (Spain)
Joan Andreu Sanchez (Spain)
Marc-Peter Schambach (Germany)
Srirangaraj Setlur (USA)
Faisal Shafait (Pakistan)
Fotini Simistira (Switzerland)
Nikolaos Stamatopoulos (Greece)
Karl Tombre (France)
Alejandro Toselli (Spain)
Seiichi Uchida (Japan)
Berrin Yanikoglu (Turkey)
Mauricio Villegas (Turkey)
Berrin Yanikoglu (Turkey)
Konstantinos Zagoris (Greece)
Richard Zanibbi (USA)

Author Information

Important Dates

All deadlines are at 11:59pm UTC

Full papers  
abstracts due Nov 30, 2017
manuscripts due Dec 10, 2017
acceptance notice Jan 15, 2018
camera ready Mar 02, 2018
Short papers  
manuscripts due Jan 24, 2018
acceptance notice Jan 31, 2018
Tutorial proposals  
proposals due Dec 18, 2017

Important Links

Camera Ready Submissions

Full paper
Short paper


Full papers describing complete works of research (up to 6 pages). They will undergo a rigorous single blind review process with a minimum of 3 reviews considering the originality of work.

Short papers provide an opportunity to report on research in progress and to present novel positions on document analysis systems (up to 2 pages). Short papers will also undergo review and will appear in an extra booklet, not in the official DAS2018 proceedings.

Demonstrations of novel systems, to be presented to the workshop as a whole during breaks between paper sessions (up to 2 pages). Demonstration proposals will undergo review and abstracts of accepted demonstrations will appear in the extra DAS2018 booklet, as in the case of short papers.

Submission of a paper involves a commitment that, if accepted, one or more authors will attend the workshop. Templates and detailed formatting instructions will be made available here.


Camera Ready Submission

For accepted full papers, the camera ready version should be submitted following the guidelines provided by IEEE. Authors of accepted short papers should prepare the camera ready version, which is formated according to the templates below, by updating the EasyChair submssion. Both camera ready versions must be submitted before March 02, 2018. Please read carefully the information provided on the following page.

Full paper submission
Short paper submission


All papers must conform to the IEEE Conference Proceedings format. Please download the here. Manuscript length: 6 pages maximum for full papers, 2 pages maximum for short papers. These are strict limits. Papers must be submitted in PDF format. Please make sure to use the IEEE Conference Proceedings template above.


DAS2018 is an IAPR Sponsored event and it requires that Authors comply with the IAPR Ethical Requirements. By submitting a paper to DAS2018, Authors implicitly certify their acceptance of those requirements and their full compliance with them.

By submitting a manuscript to DAS2018, authors acknowledge that it has not been previously published or accepted for publication in substantially similar form in any peer-reviewed venue including journals, conferences or workshops. Furthermore, no paper substantially similar in content has been or will be submitted to a journal, another conference or workshop during the review period (Dec 2017 - Jan 2018). Note that such a definition does not consider a Technical Report or an item in self-archive repositories (departmental, arXiv.org, etc.) that is put up without any form of direct peer-review as a publication.

The DAS2018 proceedings (full papers) will be published through the IEEE Computer Society's Conference Publishing Services (CPS) and will later on be available through the IEEE-Xplore digital library.

Paper Presentation Instructions

Each paper appearing in the DAS technical program must be presented by one of its authors at the workshop. The organizers and the IEEE reserve the right to exclude a paper from distribution after the workshop (e.g., removal from IEEE Xplore) if the paper is not presented at the workshop by an author or a suitable proxy.

Oral Presentation

Each paper in an oral session is allocated 20 minutes. This includes time required for introduction of the speaker, as well as time for questions from the audience. Therefore, authors are advised to prepare a 15 minute talk. The lecture room will be equipped with a Laptop and a data projector. However, authors may use their laptops for presentation. Authors who want to use the organisers’ laptop need to copy their presentations to it well in advance of the session in which their paper is to be presented. Authors presenting in a specific session must be in the room and present themselves to the session chair at least 15 minutes before the start of the session (or before the start of the previous session if there is no break between sessions).

Poster Presentation

Each poster will be allocated a board which will fit an A0 poster in portrait orientation (i.e. 841 x 1189 mm or 33.1 x 46.8 in). An index of where each poster should be mounted will be available in the venue. Board mounting materials will be provided by the organisers. Poster presenters are required to prepare a 1-minute “teaser” announcement of their poster and present it in the corresponding session preceding their poster session. Teaser announcements must be emailed to das2018@caa.tuwien.ac.at by April 23rd. Download the for detailed instructions.

It is recommended that presenters mount their poster in the morning of the day of their poster session so that workshop participants can visit it at other times in addition to the allocated session. A presenting author is expected to be available in front of the poster for the entire duration of the poster session listed in the technical program. Posters must be removed at the end of the allocated poster session.