/*
 * readme.txt
 * ----------
 *
 * Description of program "address comparison".
 *
 * Copyright (c):
 * 2007-2008:  Jrg MICHAEL, Adalbert-Stifter-Str. 11, 30655 Hannover, Germany
 *
 * SCCS: @(#) readme.txt  1.2  2008-11-30
 *
 * This file is subject to the GNU Lesser General Public License (LGPL)
 * (formerly known as GNU Library General Public Licence)
 * as published by the Free Software Foundation; either version 2 of the
 * License, or (at your option) any later version.
 * This file is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *
 * You should have received a copy of the GNU Library General Public License
 * along with this file; if not, write to the
 * Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 *
 * Actually, the LGPL is __less__ restrictive than the better known GNU General
 * Public License (GPL). See the GNU Library General Public License or the file
 * LIB_GPLA.TXT for more details and for a DISCLAIMER OF ALL WARRANTIES.
 *
 * There is one important restriction: If you modify this program in any way
 * (e.g. modify the underlying logic or translate this program into another
 * programming language), you must also release the changes under the terms
 * of the LGPL.
 * That means you have to give out the source code to your changes,
 * and a very good way to do so is mailing them to the address given below.
 * I think this is the best way to promote further development and use
 * of this software.
 *
 * If you have any remarks, feel free to e-mail to:
 *     ct@ct.heise.de
 *
 * The author's email address is:
 *    astro.joerg@googlemail.com
 */


This file contains some important information, which cannot be found
in the German "c't" article.

Index:

1. Overview of the program "addr"
2. Important configuration options
3. Using the functions "get_gender" and "check_equiv_names"
4. Using the function "phonet" for phonetic analysis
5. Phonetically extended Levenshtein function
6. Program dependencies
7. Format of "struct MAIL_ADDR"
8. Using phonetic variables
9. Date format for birthdays

10. Options for "run_mode"
11. The address comparison
12. Calculating weights from a file

13. Error-tolerant database selects with program "dbselect.c"
14. Searching duplicates with program "dedupl.c"
15. Return values of "dbselect.c" and "dedupl.c"
16. Standalone test
17. Doing database selects from other programming languages
18. Using check sums for customer numbers

19. Frequently Asked Questions
    a) Is it possible to use this program in a commercial software project?
    b) Is a Java version of this program available?
    c) What is the speed of this program?

20. WWW resources for this program
21. References
    a) Levenshtein function
    b) Phonetic conversion
    c) Error-tolerant database selects (for mail addresses)
    d) Error-tolerant search routines
    e) Check digits
    f) Dictionary of first names
22. Checking and correcting German mail addresses
23. History of the program


========================================================================


Overview of the program "addr"


The program "addr" is a software package for error tolerant address
comparisons (e.g. database searches or searching duplicates in a database).

List of files:

a)  addr_ext.h (contains macros and prototypes; may be changed)
b)  umlaut_a.h (contains system-dependent umlauts)
c)  lev100ph.c (phonetically extended Levenshtein functions and the like)
d)  addr.c     (functions for address comparison)

e)  demo.c     (a simple demo program showing you "how to do it")
f)  dbselect.c (program for error-tolerant database selects;
                may be changed)
g)  dedupl.c   (program for searching duplicates in address databases;
                may be changed)
h)  dummy_db.h (dummy database functions which are needed for a standalone
                test of the programs "dbselect.c" and "dedupl.c")

Add-ons:
i)  s_dieder.c (program for check sums based on dieder groups,
                from c't, issue 4/1997; in German)
j)  s_mod11.c  (program for check sums based on mod 11,
                from c't, issue 7/1996; in German)


This software uses the char set "iso8859-1".

The program "demo.c" is intended to get a "feeling" for the address
comparison. In this program, comparing two mail addresses is done
with trace option, so you will see how many points each field gets.
Calling the program with option "-?" will show you the available options.
For example, the option "-lev <name_1> <name_2>" will show you the inner 
workings of the Levenshtein function. 

Most important, of course, are the programs "dbselect.c" (for error-tolerant
database selects) and "dedupl.c" (for searching duplicates in a database).
These two programs cover two important applications for address comparisons.

As you might expect, both programs can be customized for individual needs.
The corresponding places are marked by "TO-DO" comments in the file
"addr_ext.h" and the programs "dbselect.c" and "dedupl.c". 
Since the programs "addr_ext.h", "dbselect.c" and "dedupl.c" are intended 
to be customized, all changes covered by TO-DO comments are free.

If you want to use "dbselect.c" or "dedupl.c" as a library, delete the
line  "#define SELECT_EXECUTABLE"  or  "#define DEDUPL_EXECUTABLE",
respectively, in the corresponding "*.c"-file. And do not forget to call
the function "cleanup_addr()" when you exit the program.

Notice:
The exe files included in this download are DOS exes, so you have to obey
the rules for 8.3 filenames under DOS.


========================================================================


Important configuration options


The following macros should be activated or changed, if necessary:

#define FILE_FOR_STANDALONE_TEST   (see chapter 16 of this file)
                          De-activate this macro for production use.

#define USE_GENDER        (see chapter 3 of this file)
                          Activate this macro, if you use "gender.c".

#define USE_PHONET        (see chapter 4 of this file)
                          Activate this macro, if you use "phonet.c".

#define DEFAULT_COUNTRY   (in file "addr_ext.h")
                          Users outside Germany must change this macro.

#define NON_EXISTING_BIRTHDAY   (in file "dbselect.c")
                          This macro depends on the date format of your
                          database. Change it, if necessary.


Activate the following macro(s), if you want to supply the corresponding 
phonetic variable(s) by yourself:
   #define ACTIVATE_FIRST_NAME_PHONET   (in file "dbselect.c")
   #define ACTIVATE_FAM_NAME_PHONET     (in file "dbselect.c")
   #define ACTIVATE_STREET_PHONET       (in file "dbselect.c")
Warning: This will override the option "USE_PHONET".


========================================================================


Using the functions "get_gender" and "check_equiv_names"


This program optionally uses the program "gender.c" from the same author
(available from http://www.heise.de/ct, soft-link 0717182). 
"gender" is a program for determining the gender of a given first name. 
Please use version 1.2 or higher.

The corresponding dictionary file "nam_dict.txt" contains (among others) 
44,000+ first names and some 700 pairs of "equivalent" names. 
The program is subject to the LGPL while the dictionary file "nam_dict.txt"
is placed under the "GNU Free Documentation License".

If you want to use "gender", add all files of "gender" to your project
(i.e. "gen_ext.h" and "gender.c") and delete the macro "GENDER_EXECUTABLE" 
from the file "gen_ext.h".
Secondly, activate the line "#define USE_GENDER" in the file "addr_ext.h".

Note:
The return values of "get_gender" can be customized by changing the 
corresponding macros in the file "gen_ext.h".


========================================================================


Using the function "phonet" for phonetic analysis


This program optionally uses the program "phonet.c" from the same author
(available from http://www.heise.de/ct). "phonet" is a powerful program 
containing more than 900 phonetic rules for the German language - and is 
also subject to the LGPL.

If you want to use "phonet", add all files of "phonet" to your project
(i.e. "ph_ext.h", "phonet.h" and "phonet.c") and delete the macro
"PHONET_EXECUTABLE" from the file "ph_ext.h".

Secondly, activate the line "#define USE_PHONET" in the file "addr_ext.h".
And use "run_mode" with option "COMPARE_LANGUAGE_GERMAN" for address 
comparisons.


========================================================================


Phonetically extended Levenshtein function


In addition to the function "phonet", this program also uses a phonetically
extended Levenshtein function. Most of the extensions apply to German only,
but some of them are independent of language.

New in version 1.2:
In version 1.2, the Levenshtein function has been extended to "true" phonetic 
rules (for German).
If you want to inspect them, see the array "l_rules_german" in file "lev100ph.c".

The ordering of the rules is to a large degree determined by the use of hash 
algorithms to speed up the function.
Therefore, be careful if you want to change them.


========================================================================


Program dependencies


dbselect.c, dedupl.c and demo.c:
    addr.c, addr_ext.h, lev100ph.c, umlaut_a.h
    [ gender, phonet ]     (if used, which is the default)

addr.c:
    addr_ext.h, lev100ph.c, umlaut_a.h
    [ gender, phonet ]     (if used, which is the default)

lev100ph.c:
    addr_ext.h, umlaut_a.h
    [ phonet ]           (if used, which is the default)


Note:
All these programs can be put together into a library or DLL.

For this, of course, you have to disable all macros  "<prog>_EXECUTABLE" 
in the programs "dbselect.c", "dedupl.c" and "demo.c". For production use,
it is also advisable to disable the macro "FILE_FOR_STANDALONE_TEST"
and use real SQL commands instead (which you have to provide yourself).


========================================================================


Format of "struct MAIL_ADDR"


Functions for address comparisons need a format for mail addresses 
as arguments. In this program, mail adresses are represented as arrays 
of "struct MAIL_ADDR":

struct MAIL_ADDR
  {
    char *text;
    int info;
  };

Hence, all variables must be character strings, while ".info" serves 
as the corresponding field identifier. The macros "IS_*" from the file 
"addr_ext.h" are defined for this purpose.
The last line of this array must contain null values. 

Birthdays may be given in full format, or as separate strings for day, 
month and year.
And you don't have to define all variables, because "missing" variables 
(with the exception of phonetic variables) are automatically interpreted 
as being empty strings.

Macros have been defined for the corresponding field lengths without '\0'. 
With the exception of the fixed values for  "LENGTH_GENDER" (= 1), 
"LENGTH_BIRTH_MONTH" (= 2) and "LENGTH_BIRTH_DAY" (= 2),  all field lengths 
may be changed individually. 
However, the lengths of the phonetic variables must be the same as for 
the corresponding "normal" variables.

The following example of a "MAIL_ADDR" variable is taken from the program 
"dedupl.c":

struct MAIL_ADDR db_addr[] =
   { { db_gender,        IS_GENDER      },
     { db_first_name,    IS_FIRST_NAME  },
     { db_fam_name,      IS_FAM_NAME    },
     { db_c_o_name,      IS_C_O_NAME    },
     { db_street,        IS_STREET      },
     { db_first_name_ph, IS_FIRST_NAME_PHONET },
     { db_fam_name_ph,   IS_FAM_NAME_PHONET },
     { db_street_ph,     IS_STREET_PHONET },
     { db_zip_code,      IS_ZIP_CODE    },
     { db_city,          IS_CITY        },
     { db_country,       IS_COUNTRY     },
     { db_b_day,         IS_BIRTH_DAY   },
     { db_b_month,       IS_BIRTH_MONTH },
     { db_b_year,        IS_BIRTH_YEAR  },
     { db_cust_number,   IS_CUST_NUMBER },
     {   NULL,              0           }
   };


========================================================================


Using phonetic variables


As you might have noticed, the "struct MAIL_ADDR" has "room" for phonetic
variables. Of course, if you use this option, you must do so consistenly.

If you use phonetic variables in (e.g.) the program "dbselect.c", 
you must either:
a) store and extract them from your database
   (this means adding them to your "declare_cursor" function)
or
b) create them "on the fly" in the "fetch_cursor" function. 
   You can create them using "phonet" or any other program for phonetic 
   conversion

Notice:
If you call "lev_ph" with user-supplied phonetic variables, this will 
override the option "USE_PHONET".


========================================================================


Date format for birthdays


In general, there are two different ways to store birthdays in a database:
You can store them in one "date" field or in three separate fields for
year, month and day of birth.

There are good reasons for both ways. If birthdays may be incomplete,
that is, if you might just know the year of birth, or month and year,
it is advisable to store birthdays as three different database fields.
If, on the other hand, you can guarantee that birthdays are always
complete, you can make good use of the date format of your database.

Hence, the structure "MAIL_ADDR" supports both options: 
You can define separate fields for year, month and day of birth. As an 
alternative, you can use birthdays in full format and even define an 
individual date format. The "date_format" can be (e.g.) "ddmmyyyy" or 
"MM/DD/YYYY" or "D.M.Y". If less than two delimiters in "date_format" 
are given, the field lengths must match. 

If "date_format" is NULL or empty, one of the formats 
    "<year>-<month>-<day>"  (ISO standard)
or  "<day>.<month>.<year>"  (European format) 
or  "<month>/<day>/<year>"  (American format)
is expected. 

Years in "full_birthday" may be given as "yyyy" or "yy" or even "y", 
and month or day may be two digits or just one. Hence, this function 
not only accepts birthdays like "1975-06-23" and "23.06.1975" as valid 
dates, but also "75-6-23" and "23.6.75" as well.


========================================================================


Options for "run_mode"


The parameter "run_mode" controls the available run-time options.
The following values, which can be "OR"-ed, are available for users:


COMPARE_NORMAL             -  normal mail address comparison
DATABASE_SELECT            -  do an "asymmetric" comparison
                              needed for database selects
DB_WILDCARDS_FOR_LIKE      -  use wildcards for SQL select with "like"
                              (i.e. wildcard are '%' and '_'
                               instead of the usual '*' and '?')
SEARCH_FAMILY_MEMBERS      -  search for family members, too
ACCEPT_SIMILAR_BIRTHDAYS   -  accept similar birthdays (i.e. "15 May"
                              and "1 June" are seen as similar)
COMPARE_LANGUAGE_GERMAN    -  German mail address formatting
                              and German phonetic comparison
DO_UNWEIGHTED_COMPARISON   -  do a comparision with "naive" weights
SKIP_BLANKCUT              -  do not call "blank_cut" for formatting 
                              (this function deletes leading and trailing
                              blanks and compresses multiple blanks)
SKIP_UPEXPAND              -  do not call "up_expand" for formatting 
                              (this function converts all chars
                              to upper case and expands umlauts)
TRACE_ERRORS               -  trace errors
TRACE_ADDR                 -  activate trace option for "compare_addr"
TRACE_LEV                  -  activate trace option for the Levenshtein
                              function

Other values defined in "lev_ext.h" are reserved for internal use 
and are not discussed here.


========================================================================


The address comparison


The actual comparison is done by the function "compare_addr" (in file 
"addr.c"):

int compare_addr (struct MAIL_ADDR addr_1[], struct MAIL_ADDR addr_2[],
      int min_points, int run_mode);

"min_points" is the minimum number of points for the address comparison.
Recommended values are in the range of 80 to 93 points.
This function returns a measure for the similarity of two addresses.
Since at most 100 points can be reached, this amounts to a measure in
percent.

For each field, the weight (= maximum number of points) is calculated 
by the amount of information contained in the field (= filter factor 
of the most common entry).

The formula is:
   points = 10 * log (filter_factor)
"log" is the decadic logarithm. For example, for month_of_birth, the 
filter factor used in the program is 11.78, which gives 10,7 points 
for this field.

For "international" addresses, the default weights are:
   gender       :   3   points
   first name   :  16.5 points
   family name  :  19   points
   street       :  37   points
   ZIP code/city:  21   points
   country      :   3   points
   full birthday:  39.7 points
   cust. number :  50   points

The database option ("DATABASE_SELECT") leads to an "asymmetric" comparison,
i.e. empty strings in the first address are "ignored" (by setting the error
limit to "MATCHES_ALL").

The comparison algorithm enables you to look for "family members", defined
by having (nearly) the same family name and address, without compromising
the performance of the program. If this is what you want, use the option
"SEARCH_FAMILY_MEMBERS". Then, family members will marked by the special
return value "IS_FAMILY_MEMBER".

The address comparison requires all strings to be "trimmed", i.e. leading
and trailing blanks and multiple blanks "inside" the variables are not
allowed. If you do the formatting yourself, you may use the run-mode option 
"SKIP_BLANKCUT".


========================================================================


Calculating weights from a file


You can re-calculate any weight, if necessary, with the program "demo":
   demo -calculate_weight <sorted_unload_file>

Caveats:
1. The unload file must be sorted and must contain data of the required 
   field only. 
   Hence, if you calculate the weight for (e.g.) year of birth, your 
   unload file should look something like this:
      1954
      1967
      1967
      1983
   or:
      54
      67
      67
      83
2. The unload file should NOT contain field delimiters or abbreviations 
   (e.g. in a list of first names, you have to delete entries like 
   "A." or "Chr.").

As a final step, you have to manually replace the macro definition for 
the weight in the file "addr_ext.h".

If you have addresses from different countries in your database, 
you should calculate the weight separately for each (major) country
and include the values in the array "WEIGHTS_ADDR_COMPARE" 
(also in the file "addr_ext.h").

Notes:
1. 
Since the correct weight for year_of_birth strongly depends on the contents
of your database, it is strongly recommended that you re-calculate the weight
for year_of_birth.
2.
If you want to calculate weights for the file "samples.txt", you must 
first do an unload for the respective field. Call the function "demo"
for this:
   demo -unload_field_from_sample_file  <field_no>  [ <sample_file> ]
The valid range for <field_no> is 1 - 10  (1 = gender,  10 = year_of_birth).
However, since the "original" samples file is rather small, it will NOT 
give statistically valid results.
Therefore, if you use this option, you should use a samples file filled 
with your own data.


========================================================================


Error-tolerant database selects with program "dbselect.c"


An error-tolerant database select must meet several conflicting demands. 
First of all, users demand good answering times, which means using a 
database index. But if you allow for errors in all fields, you must also 
allow for wrong first letters in index fields - which is not what database 
indexes are made for.

In order to comply with these conflicting demands, the program "dbselect.c"
does the following:

First, select strings for family name, first name, ZIP code, birthday and 
customer number are created. The necessary indexes are:
   - ZIP code + family name
   - full birthday + first name
   - customer number
To allow for variations due to e.g. umlauts, several search names are 
generated for family name and first name.
In case of missing or empty fields, the corresponding select is omitted.
Note:
In this program, you must use DB wildcards for like (i.e. '%' and '_').

All database entries found in the select are then compared with the search
address via the function "compare_addr".
If you want to do phonetic comparisons, activate the corresponding lines 
of code (as marked by "TO-DO" comments) in the file "dbselect.c".


The function doing the actual database select is defined as follows:

   int database_select (struct MAIL_ADDR search_addr[],
          int min_points, struct DB_SEARCH_RESULT *search_results[],
          struct DB_SEARCH_RESULT storage_area[],
          int max_useful_found, int run_mode)

Here, "search_addr" is the address to be searched for.
"min_points" is the minimum number of points for a match.
Recommended values are between 80 and 90.

For "run_mode", all predefined values can be used. The standalone test of 
the program "dbselect.c" uses the following default value:
   #define RUN_MODE  (COMPARE_NORMAL | COMPARE_LANGUAGE_GERMAN | DATABASE_SELECT | SEARCH_FAMILY_MEMBERS | DB_WILDCARDS_FOR_LIKE)

The structure "struct DB_SEARCH_RESULT" is defined as follows:
   struct DB_SEARCH_RESULT
     {
      long matchcode;
      int  points;
      char first_name [LENGTH_FIRST_NAME +1];
      char fam_name [LENGTH_FAM_NAME +1];
      char city [LENGTH_CITY +1];
      char full_birthday [10 +1];
      char cust_number [LENGTH_CUST_NUMBER +1];
     };

All matching addresses (or more precisely, number of points, matchcode 
and certain database fields) are stored in the arrays "search_results" 
and "storage_area". Both arrays _MUST_ be of size "max_useful_found".
The array "storage_area" is used internally for unsorted storage and 
should not be read be the user.

Instead, use the array "search_results". It contains pointers to all 
entries and is sorted by decreasing number of points. Hence, the first
entries will be the best-matching ones. If searching for family members
is activated, they will be the last entries.

This function returns the number of matches.

Exceptions: A return value < 0 indicates an error.
This can be:
1. The return value  ADDR_INSUFFICIENT_SEL_CRITERIA  indicates that 
   the select has been aborted because of insufficient select criteria.
2. If the number of matches exceeds "max_useful_found", the function 
   returns  ADDR_TOO_MANY_MATCHES_FOUND,  thereby indicating 
   that the select has been aborted, because not all matches can be
   stored in the arrays and the arrays are "filled".
   This mainly happens if you search for incomplete addresses.
      In this case, you have two options:
      a) Ask the user to narrow down the search criteria.
      b) Use the first "max_useful_found" matches as stored in "search_results".
3. Other errors are mostly SQL errors.

On the other hand, it is less critical if "min_points" is "too low", 
because this will only lead to the following:
1. The "least matching" entries are deleted from the list of matches and
2. "min_points" is raised.

Hence, if you are unsure about "good" values for "min_points", use the 
default as a starting point - together with a low value for "max_useful_found" 
(e.g. 5).


========================================================================


Searching duplicates with program "dedupl.c"


The program "dedupl.c" is designed for searching duplicates in an address
database. Searching for duplicates is a process with several steps:

1.
Unload the database and format the unload file. Both steps can be done
by calling the program "dedupl.c" with option "-unload":

   dedupl  -unload  <unload_file_1>  [ .. <unload_file_7> ]

In order to "catch" different sources of errors, up to five unload files
can be created. If you specify "-" as a file name, the corresponding file
will not be created.

Each line of the unload file contains the following fields (separated
by '|'):

   <sort_header>|<family_name_phonet>|<first_name_phonet>|
       <full_birthday_and_gender>|<family_name>|<first_name>|<c_o_name>|
       <ZIP_code>|<country>|<street_phonet>|<street>|<customer_number>|
       <database_matchcode>|

Note:
"Double" names are written in "full format" and also in "splitted" form 
(e.g. for "Miller-Smith" the entries "Miller-Smith", "Miller" and "Smith"
are created).

2.
The sort header is an 80-byte string for:

File_1:  Sort by name (i.e.: phonetic family name, phonetic first name),
         birthday, gender and c/o-name.
File_2:  Sort by mail address (i.e. ZIP code and phonetic street), birthday,
         gender and phonetic first name.
File_3:  "Mixed" sort by ZIP code, phonetic family name, phonetic street,
         phonetic first name and c/o name.
File_4:  Sort by birthday, gender, first name, ZIP code and phonetic street.
File_5:  Sort by customer number.
File_6:  Sort by phone number.
File_7:  Sort by IBAN number and bank account number.

Sorting the unload file means that duplicates can be found by sequentially
reading the sorted unload file.

3.
After sorting, the actual search for duplicates is done by calling the
program "dedupl.c" with option "-search_duplicates":

   dedupl -search_duplicates <sorted_unload_file> <dest_file>  [ <min_points> ]

<min_points> is the default value for the minimum number of points needed
in the address comparison. Recommended values range from 84 to 93 points.
The default value is 86 points. 
The results of the search are stored in the file <dest_file>.

Note:
For an "inital" deduping, the first one or two unloads (i.e. <file_1>  [ and <file_2> ] )
are sufficient, because most of the "obvious" duplicates will be found.
File number 6 is also strongly recommended.

4.
The last step is cleaning up the duplicates. 
This, of course, must still be done manually.

Note:
Due to the handling of "double names", the result file may contain more 
than one line for certain database entries.


========================================================================


Return values of "dbselect.c" and "dedupl.c"


Return values < 0 indicate an error. The following macros are defined
in the file "addr_ext.h":

ADDR_INSUFFICIENT_SEL_CRITERIA    Insufficient select criteria given
ADDR_TOO_MANY_MATCHES_FOUND       Too many matches found
                                  (narrow down your search criteria)
ADDR_CANNOT_CREATE_FILE           Cannot create file
ADDR_CANNOT_READ_FILE             Cannot read file
ADDR_INTERNAL_ERROR               Internal error (this should NOT happen)

ADDR_SQL_DECLARE_ERROR            SQL error in declare cursor statement
ADDR_SQL_FETCH_ERROR              SQL error in fetch cursor statement
ADDR_SQL_CLOSE_ERROR              SQL error in close cursor statement


========================================================================


Standalone test


By default, the programs "dbselect.c" and "dedupl.c" do a standalone 
test. That is, all database calls are simulated. The program "demo.c" 
is also intended to "give you a feeling" of the address comparison and 
the Levenshtein function.

For this purpose, the file "samples.txt", which may be edited, and the
program "dummy_db.h" are used to simulate a database table.

Note:
If you want to switch to production mode, deactivate the macro
"FILE_FOR_STANDALONE_TEST" in the programs "dbselect.c" and "dedupl.c".


========================================================================


Doing database selects from other programming languages


You can do error-tolerant database selects using other programming languages
(e.g. Cobol) as follows:

1.
Use the functions "create_search_strings" and "check_search_field" from
the file "dbselect.c" to create search strings for a database select.
You might find it necessary to emulate the first half of the function
"database_select" from the file "dbselect.c".

2.
Use the search strings for a "primary" database select.

3.
Use the function "compare_addr" to do the final filtering.

4.
Store and process the results.


========================================================================


Using check sums for customer numbers


There are two excellent c't articles (in German) telling you "how to do it":

1. J. Michael, Mit Sicherheit, Prfziffernverfahren auf Modulo-Basis, 
   c't 7/1996, pp.264-268

2. J. Michael, Bltenrein, Prfziffernverfahren auf der Basis von Diedergruppen, 
   c't 4/1997, pp.448-452

Since the programs originally published with the above articles are currently 
unavailable, I have added them to this project:

1. The program "s_mod11.c" (in German) contains a check sum algorithm based 
   on mod 11.
2. The program "s_dieder.c" (in German) contains a check sum algorithm based 
   on dieder groups.

Both articles and the corresponding programs are based on the dissertation
by J. Verhoeff:  Error Detecting Decimal Codes, Mathematical Centre Tracts,
Vol. 29 (Mathematisch Centrum, Amsterdam, 1969).


========================================================================


Frequently Asked Questions


a)
Is it possible to use this program in a commercial software project?

This is exactly what the Library GPL has been designed for.
See the file "LIB_GPLA.TXT" for details.

Probably the "safest" way to comply with the rules of the LGPL is
to put the program in a separate library (e.g. Windows-DLL).
In this way, only this library is subject to the LGPL and the "rest"
of the project still has the "old" rights of its owner, so you don't
have to give away your source code.


b)
Is a Java version of this program available?

The company which has written a Java version of my program "phonet"
has also promised to develop a native Java version of "addr" under
the LGPL - see:  https://opensource.softmethod.de/trac/opensource
The project name will be "addr4j".

Alternatively, you can also write a wrapper class in Java which uses 
the Java Native Interface to call a C library.
There is an excellent article (in German) telling you how to do it:
"Kaffee mit Vitamin C", c't, issue 20/2000, pp.242-247
or:  www.heise.de/ct, soft-link 0020242. 


c)
What is the speed of this program?

The speed should be good enough for most uses of this program.

Experience indicates that the answering time in database queries is
mainly due to the database select. The dominating factor in searches
for duplicates is the time needed for the various sort commands.


========================================================================


WWW resources for this program


http://www.heise.de/ct/ftp/07/20/214
or  http://www.heise.de/ct, soft-link 0720214



Dictionary of first names (program "gender"):
http://www.heise.de/ct/ftp/07/17/182
or  http://www.heise.de/ct, soft-link 0717182  (please use version 1.2 or higher).


Phonetic conversion for German ("Hannoveraner Phonetik"):
   J. Michael: Doppelgnger gesucht, Ein Programm fr kontextsensitive 
   phonetische Textumwandlung, c't, issue 25/1999, pp.252-261
   (program and article available from http://www.heise.de/ct/ftp/99/25/252).
(Java version of "phonet":
 go to https://opensource.softmethod.de/trac/opensource
    and click on "phonet4j")


========================================================================


References


a)
Levenshtein function:

Vladimir I. Levenshtein: Binary Codes Capable of Correcting Deletions, 
Insertions and Reversals, Soviet Physics Doklady, vol. 10, pp. 707-709 
(1965).

G. Ebner: Wort-Arithmetik, Phonetische hnlichkeiten mit der 
Levenshtein-Distanz errechnet, c't, issue 07/1989, pp. 192-208.

J. Michael, Joker im Spiel, Erweiterung der Levenshtein-Funktion 
auf Wildcards, c't, issue 03/1994, pp. 230-239.



b)
Phonetic conversion:

G. Wilde, C. Meyer: Nicht wrtlich genommen, "Schreibweisentolerante" 
Suchroutinen in dBase, c't, issue 10/1988, pp. 126-131 
[article on "soundex" and a soundex version called "phonem"].

J. Michael: Doppelgnger gesucht, Ein Programm fr kontextsensitive 
phonetische Textumwandlung, c't, issue 25/1999, pp. 252-261 
[phonetic conversion for German; also called "Hannoveraner Phonetik"].



c)
Error-tolerant database selects (for mail addresses):

J. Michael: Von Hinz und Kuntz, Ein Programmpaket zur fehlertoleranten 
Anschriftensuche, c't, issue 20/2007, pp. 214-219.



d)
Error-tolerant search routines:

U. Manber, S. Wu: Approximate Pattern Matching: Agrep finds patterns 
even when you can't remember the exact spelling, Byte, issue 11/1992, 
pp. 281-292.

G. Gronek: hnlichkeiten gesucht, Fehlertoleranter Suchalgorithmus 
"Shift-AND", c't, issue 05/1995, pp. 294-301 [article on "agrep"].

R. Rapp: Text-Detektor, Fehlertolerantes Retrieval ganz einfach, 
c't, issue 04/1997, pp. 386-392 [article on trigrams].



e)
Check digits:

J. Verhoeff: Error Detecting Decimal Codes, Mathematical Centre Tracts, 
vol. 29 (Mathematisch Centrum, Amsterdam, 1969).

R.-H. Schulz: Codierungstheorie. Eine Einfhrung, Vieweg Verlag, 
Braunschweig, Wiesbaden, 1991 (see chapter 8: Prfzeichenverfahren).

J. Michael, Mit Sicherheit, Prfziffernverfahren auf Modulo-Basis, 
c't, issue 07/1996, pp. 264-268.

J. Michael, Bltenrein, Prfziffernverfahren auf der Basis von 
Diedergruppen, c't, issue 04/1997, pp. 448-452.



f)
Dictionary of first names (with 40000+ entries):

J. Michael: 40000 Namen, Anredebestimmung anhand des Vornamens, 
c't, issue 17/2007, pp. 182-183 [current dictionary size: 44000+ entries].


========================================================================


Checking and correcting German postal addresses


(in German:)  Anschriftenprfung und -korrektur


Der Autor von "phonet" und "addr" hat ein C-Programm zur Prfung und 
Korrektur von deutschen Postanschriften entwickelt.
Als Anschrift zhlt hierbei die Kombination aus Strae, PLZ und Ortsname.

Wie Sie sicherlich nicht anders erwarten, sind Qualitt und Geschwindigkeit 
des Programms sehr gut.

Bei einem Vergleich mit einer kommerziell verfgbaren Standardanwendung 
war dieses Programm nicht nur schneller, sondern auch bei der Fehlerkorrektur 
klar berlegen.

Es ist geplant, das Programm kommerziell zu nutzen.
Bei Interesse am Programm knnen Sie unter "astro.joerg@googlemail.com"
Kontakt aufnehmen.


Qualitt der Fehlerkorrektur:

Ein besonderes Leistungsmerkmal dieses Programms besteht darin, dass nicht 
nur Einzelfehler korrigiert werden knnen (das ist noch einfach), sondern 
auch Doppel- und (bei hinreichend hoher hnlichkeit) sogar Dreifachfehler 
(d.h. Strae, PLZ und Ort sind falsch) ebenfalls korrigiert werden knnen.

Bei einem Vergleich mit einer kommerziell verfgbaren Standardanwendung 
war dieses Programm in Bezug auf die Fehlerkorrektur klar berlegen.

Beispiele 
(zur Anonymisierung sind smtliche Hausnummern auf "1" gesetzt):

Strae, PLZ und Ort falsch, aber korrigierbar:
a) Feuersicht 1, 31295 Stobrenau

Andere Fehlerarten (Korrekturvorschlag wird gemacht):
b) Neuhof 1, 35792 Lhnberg-Niedershausen
c) Maxstr. 1, 83274 Traunstein
d) Klerstr. 1, 38120 Braunschweig
e) Dorfstr. 1, 07806 Neunhofen


========================================================================


History of the program


2007-05-23:  The first version of this program is submitted for
             publication in a German computer magazine (c't).
2007-09-17:  Version 1.0 is published in c't.

2007-11-15:  Version 1.1:
             1. The following fields have been added in "dedupl.c",
                "dummy_db.h", and "samples.txt":
                phone_number, mobile_number, iban_code, bank_account,
                cust_number
             2. "dedupl.c": unload files no. 6 and 7 (for phone number
                and bank account) have been added.

2008-11-30:  Version 1.2:
             1. The Levenshtein function ("lev_diff") has been extended
                to trigrams and "true" rule-based phonetics (for German).
                This will lead to (e.g.) better detection of doubles.

             2. In acknowledgement of these significant improvements,
                the module "lev100.c" has been renamed "lev100ph.c".

             3. The optional add-on programs "gender" and "phonet" 
                are activated by default.
             4. Documentation has been improved.
                Readme-file: chapter "References" has been added.


========================================================================
(End of file "readme.txt")
