- HTK - Hidden Markov Model Toolkit
- Implementation of Bidirectional Long-Short Term Memory Networks (BLSTM) combined with Connectionist Temporal Classification (CTC) - including examples for Arabic recognition
- SRILM - A Toolkit for generating language modeles
- Torch5 - A Toolkit for HMM and GMM and many other machine learning algorithms
- uptools: Tools for reading and processing files in the UNIPEN file format.
- Comparison Tools for Handwriting Recognizers using the UNIPEN format (Gene Ratzlaff, IBM)
- HUE: a software toolkit which supports the rapid development and re-use of handwriting and document analysis systems (Univ. of Essex, UK).
- OCRopus - The OCRopus(tm) open source document analysis and OCR system
- NHocr - OCR engine for Japanese language
- Public domain OCR software (Univ. of Maryland, USA)
- Source code at the DIMUND server (Univ. of Maryland, USA)
- Optical Character Recognition sources
- RWTH OCR - The RWTH Aachen University Optical Character Recognition System
Pixels vs Vectors
- AutoTrace bitmap to vector conversion
- Support-Vector Machine: SVMlight Well-designed light-weight package for experimentation with the support-vector classifier. Several kernel functions are supported. ASCII data files.
- SVM Torch-II is a new implementation of Vapnik's Support Vector Machine that works both for classification and regression problems, and that has been specifically tailored for large-scale problems (such as more than 20000 examples, even for input dimensions higher than 100).
- Discrete-HMM kernel in C++ Originally developed for speech recognition, this generic package (ASCII data files!) allows for quick experimentation using discrete hidden-Markov modeling. A single HMM model is handled by the main program, thus multiple-class recognition will be realizable using (Unix) scripts.
- AutoClass: An unsupervised Bayesian classification program (NASA). Some data modeling (e.g., specifying all feature scale types) and structuring of the (ASCII) files is required.
- PCA:Principal Components Analysis, compact single main program written in C. Reads ASCII input files.
- SMART 11.0: A package implementing the keyword vector-space approach for IR as introduced by Salton (1961). Source code is for SunOS, but has been ported to Linux by several groups. There is extensive documentation on www.
Tools for (linguistic) post processing
- Word lists of a few Western languages.
- Link Grammar 4.1: A parser for English, written in C, by Temperley, Sleator and Lafferty at Carnegie Mellon.
- Ontolingua: Semantic modeling tool on WWW by Stanford University.
There is a European mirror site. Ontologies can be exported in a number of formats, including Kif, Clips, Loom and Prolog. This is a generic tool, but can be used for content-related or document-related modeling in the context of machine reading.
- Algoval Internet-based algorithm evaluation. Several benchmarks in the area of TC-11 are already present (digit recognition, dictionary search, region-of-interest (ROI) detection). Algorithms in Java can be uploaded and compared (Simon Lucas, Univ. of Essex).
- PinkPanther document-segmentation benchmarking.
Learning and Optimization
The software packages mentioned on this page are - mostly and preferably - available in source-code format (C,C++,Tcl/Tk,Java) and require standard ASCII input files. Please do not hesitate to give me a hint about free source code in the area of text processing on Internet.
This page is editable only by TC11 Officers .