Skip to content
Stefan Weil edited this page Oct 5, 2023 · 84 revisions

Tesseract at UB Mannheim

The Mannheim University Library (UB Mannheim) uses Tesseract to perform text recognition (OCR = optical character recognition) for historical German newspapers (Allgemeine Preußische Staatszeitung, Deutscher Reichsanzeiger). The latest results with text from more than 700000 pages are available online.

Tesseract installer for Windows

Normally we run Tesseract on Debian GNU Linux, but there was also the need for a Windows version. That's why we have built a Tesseract installer for Windows.

WARNING: Tesseract should be either installed in the directory which is suggested during the installation or in a new directory. The uninstaller removes the whole installation directory. If you installed Tesseract in an existing directory, that directory will be removed with all its subdirectories and files.

The latest installer can be downloaded here:

There are also older versions for 32 and 64 bit Windows available.

In addition, we also provide documentation which was generated by Doxygen.

Tesseract models for historic prints

Historic books printed in Fraktur script are supported by the standard models frk and script/Fraktur. In addition we trained our own models on a wide range of historic books. They are available online. In many cases the results from those models (Fraktur_5000000, frak2021) are better than those from the standard models.

History

  • 2023-10-05 Update Tesseract 5.3.3.
  • 2023-04-01 Update Tesseract 5.3.1. Now uses msys packages. 32 bit installer is no longer provided.
  • 2022-12-22 Update Tesseract 5.3.0. Now signed with new key.
  • 2022-12-14 Update Tesseract 5.3.0-rc1.
  • 2022-07-12 Update Tesseract 5.2.0. Fixed support for image URL with https.
  • 2022-07-08 Update Tesseract 5.2.0. Broken support for image URL with https.
  • 2022-05-10 Update Tesseract 5.1.0.
  • 2022-01-18 Update Tesseract 5.0.1. Fixed model download.
  • 2022-01-07 Update Tesseract 5.0.1. Model download is broken.
  • 2021-12-01 Update Tesseract 5.0.0.
  • 2021-10-30 Update Tesseract 5.0.0 release candidate 1.
  • 2021-08-11 Update Tesseract 5.0.0 (alpha). Faster (uses 32 bit float instead of 64 bit double).
  • 2021-05-06 Update Tesseract 5.0.0 (alpha). Now supports image URL with https.
  • 2020-11-27 Update Tesseract 5.0.0 (alpha).
  • 2020-03-28 Update Tesseract 5.0.0 (alpha).
  • 2020-02-23 Update Tesseract 5.0.0 (alpha).
  • 2019-10-30 Update Tesseract 5.0.0 (alpha). Added support for OCR from URL. Fixed installation for Lao traineddata.
  • 2019-10-10 Update Tesseract 5.0.0 (alpha). Uninstall no longer recursively removes the installation directory.
  • 2019-07-08 Update Tesseract 5.0.0 (alpha). Supports result output on Windows command line.
  • 2019-06-23 Update Tesseract 5.0.0 (alpha). Supports Windows XP again. Much faster (removed OpenMP).
  • 2019-05-26 Update Tesseract 5.0.0 (alpha).
  • 2019-05-09 Special edition for #elag2019. Training executables which require ICU fail.
  • 2019-03-17 Special edition for #bibtag19.
  • 2019-03-14 Update Tesseract 4.1.0 (RC1). Added support for ALTO output. Missing ICU DLL for training.
  • 2018-10-30 Update Tesseract 4.0.0.
  • 2018-10-24 Update Tesseract 4.0.0 (RC4).
  • 2018-10-14 Update Tesseract 4.0.0 (RC3).
  • 2018-10-10 Update Tesseract 4.0.0 (RC2).
  • 2018-10-02 Update Tesseract 4.0.0 (RC1).
  • 2018-09-17 Fixed the previous 64 bit installer by adding two missing DLL files.
  • 2018-09-12 Update Tesseract 4.0.0. Mainly bug fixes, see list of commits. For the 64 bit installation, some executables don't work because of missing DLL files.
  • 2018-06-21 Update Tesseract 3.05.02. Also updates the DLL files.
  • 2018-06-08 Update Tesseract 4.0.0. Fix ICU DLL files for 64 bit installer.
  • 2018-04-14 Update Tesseract 4.0.0. Also updates some DLL files. Now also with 64 bit installer.
  • 2018-01-09 Update Tesseract 4. Also updates some DLL files.
  • 2017-08-04 Update Tesseract 4. Now supports best traineddata.
  • 2017-06-02 Update Tesseract 3.05.01.
  • 2017-05-10 Update Tesseract 3.05.00 (+ later fixes). Removed buggy setting of PATH.
  • 2017-05-10 Update Tesseract 4. Now includes AVX support.
  • 2017-02-16 Update Tesseract 4. Fixed not working AVX support.
  • 2017-02-02 Update Tesseract 4. Removed not working AVX support.
  • 2017-01-30 Update Tesseract 4, added new training tools. AVX support not working.
  • 2016-11-29 First version with LSTM (still experimental).
  • 2016-11-11 Update with latest bug fixes.
  • 2016-08-31 Update with latest bug fixes for text2image.
  • 2016-08-28 Update with latest bug fixes.
  • 2016-07-11 TIFF warnings are now shown on the console (no longer disturbing message windows).
  • 2016-05-13 The new installer now includes the executables needed for training, too. It is based on the latest Tesseract sources.

Hint: Old versions of the installer had an option to add Tesseract to the PATH environment variable. That option was disabled by default. If it was enabled and PATH was very long, it could happen that the new PATH was empty. We suggest not to use that option and disabled it in our latest version.