/*
 * readme.txt
 * ----------
 *
 * Description of program "phonet.c"  ("Hannoveraner Phonetik").
 *
 * Copyright (c):
 * 1999-2008:  Joerg MICHAEL, Adalbert-Stifter-Str. 11, 30655 Hannover, Germany
 *
 * SCCS: @(#) readme.txt  1.5  2008-11-30
 *
 * This file is subject to the GNU Lesser General Public License (LGPL)
 * (formerly known as GNU Library General Public Licence)
 * as published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 * This file is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *
 * You should have received a copy of the GNU Library General Public License
 * along with this file; if not, write to the    
 * Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 * Actually, the LGPL is __less__ restrictive than the better known GNU General 
 * Public License (GPL). See the GNU Library General Public License or the file 
 * LIB_GPLP.TXT for more details and for a DISCLAIMER OF ALL WARRANTIES.
 *
 * There is one important restriction: If you modify this program in any way 
 * (e.g. add or change phonetic rules or modify the underlying logic or 
 * translate this program into another programming language), you must also 
 * release the changes under the terms of the LGPL.
 * That means you have to give out the source code to your changes,
 * and a very good way to do so is mailing them to the address given below.
 * I think this is the best way to promote further development and use
 * of this software.
 *
 * If you have any remarks, feel free to e-mail to:  
 *     ct@ct.heise.de
 *
 * The author's email address is:
 *    astro.joerg@googlemail.com
 */


This file contains some important information, which cannot be found 
in the German "c't" article.

Index:

1. Overview on the program "phonet"
2. Syntax of phonetic rules
3. Hash algorithms
4. Consistency check

5. Frequently Asked Questions
    a) Is it possible to use this program in a commercial software project?
    b) Is a Java version of this program available?
    c) What is the speed of this program?
    d) How did the author develop his phonetic rules? 
    e) Why did the author develop his own version of regular expressions?

6. Java version of "phonet"
7. Perl version of "phonet"
8. WWW resources for this program
9. References
    a) Phonetic conversion
    b) Checking and correcting first names
    c) Error-tolerant database selects (for mail addresses)
    d) Error-tolerant search routines
    e) Levenshtein function
10. Checking and correcting German mail addresses
11. History of the program


========================================================================


Overview on the program "phonet"


The program "phonet" is designed for phonetic string conversion.
Functionally, the program consists of three main parts:
a) A syntax for context-dependent phonetic rules that can be parsed 
   by the function "phonet", which is the main "engine" of the program.
b) Phonetic rules for one or more languages (in the file "phonet.h").
c) A check function that checks all phonetic rules for consistency.

List of source files:
a)  ph_ext.h   (contains macros and prototypes; may be changed)
b)  umlaut_p.h (contains lists of umlauts)
c)  phonet.h   (contains all phonetic rules)
d)  phonet.c   (this is the "workhorse" of the program)

The program "phonet.h", which contains all phonetic rules, uses the 
char set "iso8859-1".

If you want to use "phonet.c" as a library, delete the line
"#define PHONET_EXECUTABLE"  from the file "ph_ext.h".
(Note: This will also disable the function "check_rules", which is needed
solely for development purposes.)

Notice:
The exe file included in this download is a DOS exe, so you have to obey
the rules for 8.3 filenames under DOS.


========================================================================


Syntax of phonetic rules


The syntax for phonetic rules is as follows:
   <search_string>  <1.rule>  <2.rule>
Syntax for search strings:
   <word> [<->..] [<] [<0-9>] [^[^]] [$]

Constraints:
a) All phonetic rules must be written in upper case. 
b) The end of "word" may contain as an optional simple regular expression
   one array of letters (or umlaut's) enclosed in '(' and ')'.
c) Rules with a '<' demand that the replacement string may not be longer 
   than the search string.
d) The placement of rules determines their priority. Therefore, the rules
   for "SH" must be placed before the rules for "S" 
   (otherwise, a conversion error will occur for "SH").

Note that although the tokens '^' and '$' look like common unix regular 
expressions, their meaning is not exactly the same.
The difference is important if you convert texts consisting of more than
one word.


========================================================================


Hash algorithms


This program contains two hash algorithms. The second hash algorithm 
has been implemented in version 1.3, thereby tripling the speed of the
function "phonet".

Each of them demands that all phonetic rules be sorted by first char, 
but the second one also uses the second char (If the second "char" is
an array, every letter in the array is evaluated).

Hence, the sorting of rules can significantly influence the performance
of the program. 

While the sorting order for the first char is irrelevant, the sorting 
order for the second char should be:

1. "normal" letters (i.e. 'A' - 'Z')
2. umlauts 
3. all other chars (e.g. '.').


========================================================================


Consistency check


If you add or modify some phonetic rules, you should check them for 
consistency with the function "check_rules". Due to the high number 
of rules and all their mutual dependencies, a manual check would be 
virtually "hopeless".

Rule checking involves several steps which are done for search and 
replacement strings of every rule. First, a syntax check is done, which 
verifies the correct syntax of search strings (e.g. correct sequence 
of '-', '<', priority, '^' and '$').

Then, search string and replacement string are converted by the function
"phonet". The results must be identical to the replacement string.
In this way, all errors are found which stem from a wrong succession of 
rules or ignorance of dependencies. Alas, sometimes this method is too 
"narrow-minded" and puts out warnings which can be ignored. Some of these 
have been included as exceptions in the function "check_rules".


========================================================================


Frequently Asked Questions


a)
Is it possible to use this program in a commercial software project?

This is exactly what the Library GPL has been designed for.
See the file "LIB_GPLP.TXT" for details.

Probably the "safest" way to comply with the rules of the LGPL is 
to put the program "phonet" in a separate library (e.g. Windows-DLL).
In this way, only the phonetic library is subject to the LGPL and 
the "rest" of the project still has the "old" rights of its owner, 
so you don't have to give away your source code.


b)
Is a Java version of this program available?

Yes, a native Java version is available.
See:  https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".

Alternatively, you can also write a wrapper class in Java which uses 
the Java Native Interface to call a C library.
There is an excellent article (in German) telling you how to do it:
"Kaffee mit Vitamin C", c't, issue 20/2000, pp.242-247
or:  www.heise.de/ct, soft-link 0020242. 


c)
What is the speed of this program?

Due to the coherent use of very fast pointer arithmetic (e.g. "s1 == s2") 
instead of relatively slow string commands (e.g. "strcpy"), the C code 
runs very fast even on an old 486 notebook.

In order to get measurable running times, you usually have to do thousands 
or even tens of thousands of phonetic conversions.


d)
How did the author develop his phonetic rules? 

As a start, the author adopted a rule set from an old article in c't 
(G. Wilde, C. Meyer: "Nicht wrtlich genommen", c't, 10/1988, pp.126-131
 - see chapter "References"), which contained about 30 rules.

These rules were relatively crude and one of them was even found to be 
faulty. All other rules were developed by the author.

Several approaches have been combined for the final development of the 
German rules:

- Because one of the most common applications for phonetic analysis is 
  searching in address databases, several rules for common first names 
  and common family names have been added.
- Since the author, of course, had an "natural intuition" for difficult 
  words or missing exceptions, any such word that he encountered was noted 
  and checked with "phonet".
- For dictionary applications, several rules for the new German orthography
  (e.g. "viel versprechend" vs. "vielversprechend") have been added.
- The final "brute force proofreading" has been done using the "Duden". 
  Difficult first letters like 'C' and (to my surprise) 'V' were checked
  most thoroughly.


e)
Why did the author develop his own version of regular expressions?

From the start on, high performance was one of the main goals, and this 
requires a good hash algorithm and efficient parsing of the phonetic rules. 
If you do a "grep" with (e.g.) 500 regular expressions, the speed will be 
quite slow.
Secondly, using common regular expressions would mean that the rules could 
not use priorities, '<' or '-', thereby inflating the number of rules.

During the development of the program, the first syntax ideas were 
priorities, '^' and '$'. 
Later on, minus chars ('-'), '<' and arrays of letters (e.g. "(XYZ)") were 
added to the syntax to curb the number of rules.


========================================================================


Java version of "phonet"


Due to the efforts of Andreas Meyer and his team at company Softmethod, 
a native Java version of "phonet" is now available. This implementation 
is also subject to the GNU Lesser General Public License (LGPL).

See:  https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".


========================================================================


Perl version of "phonet"


Due to the efforts of Michael Maretzke from Muenchen (Munich), Germany,
a Perl version of phonet is now available. This version is also subject 
to the GNU Lesser General Public License (LGPL).

If you have any questions concerning the Perl version, please mail to
Michael Maretzke (michael@maretzke.de).

To install the Perl version, uncompress the file "phonet.tar.gz" 
with "tar xzf phonet.tar.gz" (works at least under Solaris and Linux)
or with "winzip". Then, follow the instructions in the accompanying 
file "readme_perl.txt".

Actually, the Perl version uses a connect to the C program, thereby 
avoiding porting errors. As a further advantage, you probably do not 
have to worry about running times.


========================================================================


WWW resources for this program


Program and article are available from: 
http://www.heise.de/ct/ftp/99/25/252

Java-Version of "phonet":
Go to https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".



Dictionary of first names (program "gender"):

http://www.heise.de/ct/ftp/07/17/182
or  http://www.heise.de/ct, soft-link 0717182  (please use version 1.2 or higher)



Error-tolerant database selects (program "addr", 
incl. "phonetic" Levenshtein function):

http://www.heise.de/ct/ftp/07/20/214
or  http://www.heise.de/ct, soft-link 0720214


========================================================================


References


a)
Phonetic conversion:

G. Wilde, C. Meyer: Nicht wrtlich genommen, "Schreibweisentolerante" 
Suchroutinen in dBase, c't, issue 10/1988, pp. 126-131 
[article on "soundex" and a soundex version called "phonem"].

J. Michael: Doppelgnger gesucht, Ein Programm fr kontextsensitive 
phonetische Textumwandlung, c't, issue 25/1999, pp. 252-261
["Hannoveraner Phonetik"].


b)
Checking and correcting first names:

J. Michael: 40000 Namen, Anredebestimmung anhand des Vornamens, 
c't, issue 17/2007, pp. 182-183 [current dictionary size: 44000+ entries].


c)
Error-tolerant database selects:

J. Michael: Von Hinz und Kuntz, Ein Programmpaket zur fehlertoleranten 
Anschriftensuche, c't, issue 20/2007, pp. 214-219.


d)
Error-tolerant search routines:

U. Manber, S. Wu: Approximate Pattern Matching: Agrep finds patterns 
even when you can't remember the exact spelling, Byte, issue 11/1992, 
pp. 281-292.

G. Gronek: hnlichkeiten gesucht, Fehlertoleranter Suchalgorithmus 
"Shift-AND", c't, issue 05/1995, pp. 294-301 [article on "agrep"].

R. Rapp: Text-Detektor, Fehlertolerantes Retrieval ganz einfach, 
c't, issue 04/1997, pp. 386-392 [article on trigrams].


e)
Levenshtein function:

Vladimir I. Levenshtein: Binary Codes Capable of Correcting Deletions, 
Insertions and Reversals, Soviet Physics Doklady, vol. 10, pp. 707-709 
(1965).

G. Ebner: Wort-Arithmetik, Phonetische hnlichkeiten mit der 
Levenshtein-Distanz errechnet, c't, issue 07/1989, pp. 192-208.

J. Michael, Joker im Spiel, Erweiterung der Levenshtein-Funktion 
auf Wildcards, c't, issue 03/1994, pp. 230-239.

J. Michael: Von Hinz und Kuntz, Ein Programmpaket zur fehlertoleranten 
Anschriftensuche, c't, issue 20/2007, pp. 214-219.


========================================================================


Checking and correcting German mail addresses


(in German:)  Anschriftenprfung und -korrektur


Der Autor von "phonet" und "addr" hat ein C-Programm zur Prfung und 
Korrektur von deutschen Postanschriften entwickelt.
Als Anschrift zhlt hierbei die Kombination aus Strae, PLZ und Ortsname.

Wie Sie sicherlich nicht anders erwarten, sind Qualitt und Geschwindigkeit 
des Programms sehr gut.

Bei einem Vergleich mit einer kommerziell verfgbaren Standardanwendung 
war dieses Programm nicht nur schneller, sondern auch bei der Fehlerkorrektur 
klar berlegen.

Es ist geplant, das Programm kommerziell zu nutzen.
Bei Interesse am Programm knnen Sie unter "astro.joerg@googlemail.com"
Kontakt aufnehmen.


Qualitt der Fehlerkorrektur:

Ein besonderes Leistungsmerkmal dieses Programms besteht darin, dass nicht 
nur Einzelfehler korrigiert werden knnen (das ist noch einfach), sondern 
auch Doppel- und (bei hinreichend hoher hnlichkeit) sogar Dreifachfehler 
(d.h. Strae, PLZ und Ort sind falsch) ebenfalls korrigiert werden knnen.

Bei einem Vergleich mit einer kommerziell verfgbaren Standardanwendung 
war dieses Programm in Bezug auf die Fehlerkorrektur klar berlegen.

Beispiele 
(zur Anonymisierung sind smtliche Hausnummern auf "1" gesetzt):

Strae, PLZ und Ort falsch, aber korrigierbar:
a) Feuersicht 1, 31295 Stobrenau

Andere Fehlerarten (Korrekturvorschlag wird gemacht):
b) Neuhof 1, 35792 Lhnberg-Niedershausen
c) Maxstr. 1, 83274 Traunstein
d) Klerstr. 1, 38120 Braunschweig
e) Dorfstr. 1, 07806 Neunhofen


========================================================================


History of the program


1998-09-01:  Start of program development.
1999-03-30:  The first version of this program is submitted for 
             publication in a German computer magazine (c't).

1999-12-06:  Version 1.0 is published in c't (issue 25/1999, pp.252-261).
1999-12-22:  Due to some problems in VC++ 6.0 and gcc, all phonetic rules 
             were converted to upper case
             (The functions "initialize_phonet" and "check_rules" were 
              adapted).

2000-01-04:  Multi-language support (for natural languages, e.g. German)
             has been added, and all comments were translated into English.
2000-01-10:  Version 1.1 of the program is placed under the GNU Lesser
             General Public License.
2000-05-20:  Some phonetic rules and an FAQ list have been added.
2000-06-06:  Function "check_rules" outputs the total number of rules.

2000-09-26:  Version 1.2:
             A Perl version is available, 
             and some phonetic rules have been added.

2000-11-14:  Version 1.3:
             A second hash algorithm has been implemented.

2001-04-28:  Phonetic rules for all "missing" iso-umlauts have been added.
             and the macro "CHAR" has been deleted.
2001-05-26:  Some phonetic rules have been added.

2002-01-18:  Some phonetic rules have been added.
2003-08-10:  Some phonetic rules have been added.

2005-06-11:  Version 1.4:
             Phonetic rules for "no language" have been implemented.

2007-05-17:  Some phonetic rules have been added.
2007-08-27:  One phonetic rule has been added.

2008-11-30:  Version 1.5:
             50+ phonetic rules (most of them for important Russian, 
             Arabic and French first names) have been added
             and a list of references (in readme-file) has been added.


========================================================================
(End of file "readme.txt")
