/*
 * readme.txt
 * ----------
 *
 * Description of "phonet.c".
 *
 * Copyright (c):
 * 1999-2007:  Joerg MICHAEL, Adalbert-Stifter-Str. 11, 30655 Hannover, Germany
 *
 * SCCS: @(#) readme.txt  1.4.2  2007-08-27
 *
 * This file is subject to the GNU Lesser General Public License (LGPL)
 * (formerly known as GNU Library General Public Licence)
 * as published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 * This file is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *
 * You should have received a copy of the GNU Library General Public License
 * along with this file; if not, write to the    
 * Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 * Actually, the LGPL is __less__ restrictive than the better known GNU General 
 * Public License (GPL). See the GNU Library General Public License or the file 
 * LIB_GPLP.TXT for more details and for a DISCLAIMER OF ALL WARRANTIES.
 *
 * There is one important restriction: If you modify this program in any way 
 * (e.g. add or change phonetic rules or modify the underlying logic or 
 * translate this program into another programming language), you must also 
 * release the changes under the terms of the LGPL.
 * That means you have to give out the source code to your changes,
 * and a very good way to do so is mailing them to the address given below.
 * I think this is the best way to promote further development and use
 * of this software.
 *
 * If you have any remarks, feel free to e-mail to:  
 *     ct@ct.heise.de
 *
 * The author's email address is:
 *    astro.joerg@googlemail.com
 */


This file contains some important information, which cannot be found 
in the German "c't" article.

Index:

1. Overview on the program "phonet"
2. Syntax of phonetic rules
3. Hash algorithms
4. Consistency check

5. Frequently Asked Questions
6. Java version of "phonet"
7. Perl version of "phonet"
8. WWW resources for this program
9. History of the program


========================================================================


Overview on the program "phonet"


The program "phonet" is designed for phonetic string conversion.
Functionally, the program consists of three main parts:
a) A syntax for context-dependent phonetic rules that can be parsed 
   by the function "phonet", which is the main "engine" of the program.
b) Phonetic rules for one or more languages (in the file "phonet.h").
c) A check function that checks all phonetic rules for consistency.

List of source files:
a)  ph_ext.h   (contains macros and prototypes; may be changed)
b)  umlaut_p.h (contains lists of umlauts)
c)  phonet.h   (contains all phonetic rules)
d)  phonet.c   (this is the "workhorse" of the program)

The program "phonet.h", which contains all phonetic rules, uses the 
char set "iso8859-1".

If you want to use "phonet.c" as a library, delete the line
"#define PHONET_EXECUTABLE"  from the file "ph_ext.h".
(Note: This will also disable the function "check_rules", which is needed
solely for development purposes.)

Notice:
The exe file included in this download is a DOS exe, so you have to obey
the rules for 8.3 filenames under DOS.


========================================================================


Syntax of phonetic rules


The syntax for phonetic rules is as follows:
   <search_string>  <1.rule>  <2.rule>
Syntax for search strings:
   <word> [<->..] [<] [<0-9>] [^[^]] [$]

Constraints:
a) All phonetic rules must be written in upper case. 
b) The end of "word" may contain as an optional simple regular expression
   one array of letters (or umlaut's) enclosed in '(' and ')'.
c) Rules with a '<' demand that the replacement string may not be longer 
   than the search string.
d) The placement of rules determines their priority. Therefore, the rules
   for "SH" must be placed before the rules for "S" 
   (otherwise, a conversion error will occur for "SH").

Note that although the tokens '^' and '$' look like common unix regular 
expressions, their meaning is not exactly the same.
The difference is important if you convert texts consisting of more than
one word.


========================================================================


Hash algorithms


This program contains two hash algorithms. The second hash algorithm 
has been implemented in version 1.3, thereby tripling the speed of the
function "phonet".

Each of them demands that all phonetic rules be sorted by first char, 
but the second one also uses the second char (If the second "char" is
an array, every letter in the array is evaluated).

Hence, the sorting of rules can significantly influence the performance
of the program. 

While the sorting order for the first char is irrelevant, the sorting 
order for the second char should be:

1. "normal" letters (i.e. 'A' - 'Z')
2. umlauts 
3. all other chars (e.g. '.').


========================================================================


Consistency check


If you add or modify some phonetic rules, you should check them for 
consistency with the function "check_rules". Due to the high number 
of rules and all their mutual dependencies, a manual check would be 
virtually "hopeless".

Rule checking involves several steps which are done for search and 
replacement strings of every rule. First, a syntax check is done, which 
verifies the correct syntax of search strings (e.g. correct sequence 
of '-', '<', priority, '^' and '$').

Then, search string and replacement string are converted by the function
"phonet". The results must be identical to the replacement string.
In this way, all errors are found which stem from a wrong succession of 
rules or ignorance of dependencies. Alas, sometimes this method is too 
"narrow-minded" and puts out warnings which can be ignored. Some of these 
have been included as exceptions in the function "check_rules".


========================================================================


Frequently Asked Questions


1.
Is it possible to use this program in a commercial software project?

This is exactly what the Library GPL has been designed for.
See the file "LIB_GPLP.TXT" for details.

Probably the "safest" way to comply with the rules of the LGPL is 
to put the program "phonet" in a separate library (e.g. Windows-DLL).
In this way, only the phonetic library is subject to the LGPL and 
the "rest" of the project still has the "old" rights of its owner, 
so you don't have to give away your source code.


2.
Is a Java version of this program available?

Yes, a native Java version is available.
See:  https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".

Alternatively, you can also write a wrapper class in Java which uses 
the Java Native Interface to call a C library.
There is an excellent article (in German) telling you how to do it:
"Kaffee mit Vitamin C", c't, issue 20/2000, pp.242-247
or:  www.heise.de/ct, soft-link 0020242. 


3.
What is the speed of this program?

Due to the coherent use of very fast pointer arithmetic (e.g. "s1 == s2") 
instead of relatively slow string commands (e.g. "strcpy"), the C code 
runs very fast even on an old 486 notebook.

In order to get measurable running times, you usually have to do thousands 
or even tens of thousands of phonetic conversions.


4.
How did the author develop his phonetic rules? 

As a start, the author adopted a rule set from an old article in c't 
(issue 10/1988, pp.126), which contained about 30 rules. 
These rules were relatively crude and one of them was even found to be 
faulty. All other rules were developed by the author.

Several approaches have been combined for the final development of the 
German rules:

- Because one of the most common applications for phonetic analysis is 
  searching in address databases, several rules for common first names 
  and common family names have been added.
- Since the author, of course, had an "natural intuition" for difficult 
  words or missing exceptions, any such word that he encountered was noted 
  and checked with "phonet".
- For dictionary applications, several rules for the new German orthography
  (e.g. "viel versprechend" vs. "vielversprechend") have been added.
- The final "brute force proofreading" has been done using the "Duden". 
  Difficult first letters like 'C' and (to my surprise) 'V' were checked
  most thoroughly.


5.
Why did the author develop his own version of regular expressions?

From the start on, high performance was one of the main goals, and this 
requires a good hash algorithm and efficient parsing of the phonetic rules. 
If you do a "grep" with (e.g.) 500 regular expressions, the speed will be 
quite slow.
Secondly, using common regular expressions would mean that the rules could 
not use priorities, '<' or '-', thereby inflating the number of rules.

During the development of the program, the first syntax ideas were 
priorities, '^' and '$'. 
Later on, minus chars ('-'), '<' and arrays of letters (e.g. "(XYZ)") were 
added to the syntax to curb the number of rules.


========================================================================


Java version of "phonet"


Due to the efforts of Andreas Meyer and his team at company Softmethod, 
a native Java version of "phonet" is now available. This implementation 
is also subject to the GNU Lesser General Public License (LGPL).

See:  https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".


========================================================================


Perl version of "phonet"


Due to the efforts of Michael Maretzke from Muenchen (Munich), Germany,
a Perl version of phonet is now available. This version is also subject 
to the GNU Lesser General Public License (LGPL).

To install the Perl version, uncompress the file "phonet.tar.gz" 
with "tar xzf phonet.tar.gz" (works at least under Solaris and Linux)
or with "winzip". Then, follow the instructions in the accompanying 
file "readme_perl.txt".

Actually, the Perl version uses a connect to the C program, thereby 
avoiding porting errors. As a further advantage, you probably do not 
have to worry about running times.

If you have any questions concerning the Perl version, please mail to
Michael Maretzke (michael@maretzke.de).


========================================================================


WWW resources for this program

www.heise.de/ct, soft-link 9925252



Java-Version of "phonet":
Go to https://opensource.softmethod.de/trac/opensource
      and click on "phonet4j".


========================================================================


History of the program


1998-09-01:  Start of program development.
1999-03-30:  The first version of this program is submitted for 
             publication in a German computer magazine (c't).
1999-12-06:  Version 1.0 is published in c't (issue 25/1999, pp.252-261).
1999-12-22:  Due to some problems in VC++ 6.0 and gcc, all phonetic rules 
             were converted to upper case
             (The functions "initialize_phonet" and "check_rules" were 
              adapted).

2000-01-04:  Multi-language support (for natural languages, e.g. German)
             has been added, and all comments were translated into English.
2000-01-10:  Version 1.1 of the program is placed under the GNU Lesser
             General Public License.
2000-05-20:  Some phonetic rules and an FAQ list have been added.
2000-06-06:  Function "check_rules" outputs the total number of rules.

2000-09-26:  Version 1.2: A Perl version and some phonetic rules have 
             been added.
2000-11-14:  Version 1.3: A second hash algorithm has been implemented.

2001-04-28:  Phonetic rules for all "missing" iso-umlauts have been added
             and the macro "CHAR" has been deleted.
2001-05-26:  Some phonetic rules have been added.

2002-01-18:  Some phonetic rules have been added.
2003-08-10:  Some phonetic rules have been added.

2005-06-11:  Version 1.4:
             Phonetic rules for "no language" have been implemented.

2007-05-17:  Some phonetic rules have been added.
2007-08-27:  One phonetic rule has been added.

========================================================================
(End of file "readme.txt")
