Computerization in “text” format is born in the United States in 1971 at the initiative of Michael Hart, a student at the University of Illinois. On July 4, 1971, National Day, he seized the Declaration of Independence of the U.S. (signed July 4, 1776) on the keyboard of his computer. This entry is made in capital letters, the only commonly available then in computer systems.
In 1989, he launched Project Gutenberg with the ambition to digitize heritage. This project, which has no more than ten texts in its infancy, reached in 2014 the highest number of over 45,000 documents. Scanning was done at first by keyboarding and using document scanners. Books within the scan are texts in the public domain not protected by copyright.
The particularity of this initiative lies in the use of rudimentary technologies producing a simple description of the data. This avoids introducing risk incompatibility with future systems.
Under the leadership of Project Gutenberg, similar projects have emerged.
The Universal Library, which is a project of Conservatoire national des arts et métiers (CNAM) in France, off since 2002, has reproduced a hundred texts still available.
More ambitious the project initially called “Sourceberg project” aims to create a multilingual digital library. Supported by the Wikimedia Foundation, it offers free access to information, without advertising, built by volunteers using wiki technology. On 26 December, it officially changed its name to “Wikisource”. It offers meanwhile 55715 free texts free.
Comparable projects rely on a different strategy. OCR (Optical Character Recognition): from a reproduction of the page by scanner, an image form, the text is restored. This technique allows you to switch the picture mode to text mode, however, has the disadvantage of leaving a lot of problems.