Navigation
Learn About
Developing With
Ingres Talk
Information
Toolbox
Views
OME soundex dm()
From Ingres Community Wiki
Introduction
The standard Ingres soundex function is a typical Knuth coded version of the Russell soundex (circa 1918). The Daitch-Mokotoff soundex (circa 1980) is an upgraded soundex more suitable to larger databases.
No special libraries are need to support this function, hence this code should be transportable to non Linux systems.
This implementation of the Daitch-Mokotoff soundex was written by Martin Bowes.
Syntax
(varchar )rdv = soundex_dm((varchar )string)
A simple Daitch-Mokotoff soundex is a six character code. It is composed entirely of digits and it may have a leading zero.
Note that unlike the standard soundex function, the Daitch-Mokotoff soundex allows for hard/soft sounds. For example is the 'ch' soft as in 'cheese' or hard as in 'christmas'? As a consequence an input string may have multiple possible return codes. To handle this the return string will be composed of all the possibilities seperated by commas. The list is not sorted, but all elements are unique.
| Word | soundex(Word) | soundex_dm(Word) |
| Moskowitz | M232 | 645740 |
| Peterson | P362 | 739460,734600 |
| Jackson | J250 | 154600,454600,145460,445460 |
Return Value
The return is a varchar(111).
As mentioned above, the implementation will return all possible codes in a comma seperated list.
The implementation given here can handle as many as 16 possibilities in the return before it would generate an error. The length can be adjusted by altering the macro: AD_SOUNDEX_DM_MAX_OUTPUT
Example
Its probably no exageration to say that almost everyone in the Western world knows the name 'Arnold Schwarzenegger'. But how many of them could spell it?
Consider the following table:
| Name | soundex(Name) | soundex_dm(Name)) |
| Schwarzenegger | S625 | 479465,474659 |
| Shwarzenegger | S625 | 479465,474659 |
| Schwartsenegger | S632 | 479465 |
The American Soundex code for Schwarzenegger is S625.
If the name was misspelled as Shwarzenegger, the code would still be S625, so any search application based on American Soundex would still find the match in spite of that misspelling.
However if the name was misspelled as Schwartsenegger, the American Soundex code would be S632, so a search application based on American Soundex would not find the match with that misspelling.
Under DM Soundex, the correct spelling, Schwarzenegger, has two codes, namely 474659 and 479465. The incorrect spelling, Shwarzenegger, has the same two codes, and the incorrect spelling, Schwartsenegger, has the DM code of 479465, which is one of the two codes for the correct spelling. So a search application based on DM Soundex would find the match with either of these misspellings.
FOD
Add the following definition to the fod_id enum set: UDF_SOUNDEX_DM
Then add the following to the Function_Definitions array:
static IIADD_FO_DFN Function_Definitions[]={
...
{
II_O_OPERATION, /*fod_object_type*/
{"soundex_dm"}, /*fod_name*/
UDF_SOUNDEX_DM, /*fod_id*/
II_NORMAL /*fod_type*/
},
...
}; /*Function_Definitions*/
FIDs
Add the following definitions to the fid_id enum set:
UDF_FI_SOUNDEX_DM,
The FIDs rely on the following definition of parameter types.
static II_DT_ID UD_2_VC[] = {II_VARCHAR, II_VARCHAR};
The FIDs are:
static IIADD_FI_DFN Function_Instances[] = {
...
{/* soundex_dm(varchar) */
II_O_FUNCTION_INSTANCE, /* fid_object_type */
UDF_FI_SOUNDEX_DM_1, /* fid_id*/
II_NO_FI, /* fid_cmplmnt*/
UDF_SOUNDEX_DM, /* fid_opid=fod_id from function definition
** This is the minor sort field for this array
*/
II_NORMAL, /* fid_optype
** This is the major sort field for this array
*/
II_FID_F0_NOFLAGS, /* fid_attributes*/
0, /* fid_wslength*/
1, /* fid_numargs*/
UD_2_VC, /* fid_args, a pointer to an array of datatypes*/
II_VARCHAR, /* fid_result, result is an integer */
II_RES_FIXED, /* fid_rltype*/
AD_SOUNDEX_DM_OUTPUT_LEN + sizeof(short), /* fid_rlength */
0, /* fid_rprec */
soundex_dm, /* fid_routine */
0 /* lenspec_routine */
}, /* soundex_dm(varchar) */
Executor Code
/* Stuff used in soundex_dm() function */ #define AD_SOUNDEX_DM_INT_BUFFER 64 #define AD_SOUNDEX_DM_PAD_BUFFER 20 #define AD_LEN_SOUNDEX_DM 6 #define AD_SOUNDEX_DM_MAX_OUTPUT 16 #define AD_SOUNDEX_DM_MAX_ENCODE 20 #define AD_SOUNDEX_DM_OUTPUT_LEN AD_LEN_SOUNDEX_DM + ((AD_SOUNDEX_DM_MAX_OUTPUT - 1) * (AD_LEN_SOUNDEX_DM + 1))
II_STATUS
soundex_dm (
II_SCB *scb,
II_DATA_VALUE *string, /* Generate the Daitch-Mokotoff soundex for this
** string
*/
II_DATA_VALUE *rdv
/* varchar is returned, length is fixed
** by AD_SOUNDEX_DM_OUTPUT_LEN. This is set to
** a size sufficient to store 16 (ie. AD_SOUNDEX_DM_MAX_OUTPUT)
** comma seperated 6 (ie. AD_LEN_SOUNDEX_DM) character strings.
*/
)
{
/* Sundry initialisation */
int hyphen=0;
int start_word=0;
int oi, ei=0, encoded=0;
int before_a_vowel=0;
int total_choices;
int choice, total_routes, unique;
int i, j, true_length;
char a_char;
/* Used for error processing */ char msg[256];
/* The internal buffer to hold the uppercase'd and stripped input string. ** This is sized to more than easily allow for enough characters to ** generate the 6 character D-M soundex value */ char buffer[AD_SOUNDEX_DM_INT_BUFFER + AD_SOUNDEX_DM_PAD_BUFFER];
/* destination initialization field */ char zeroes[] = "0000000";
/* These are fixed codes used in the D-M soundex algorithm */ char FourtyThree[] = "43"; char FourtyFive[] = "45"; char FiftyFour[] = "54"; char SixtySix[] = "66"; char NinetyFour[] = "94";
/* Define the structures for the encoding array and the buffer of possible
* output values
*/
struct _dm_element_bit
{
short length;
char string[2];
};
struct _dm_element
{
enum _dm_element_source
{character, phrase} source; /* On occasion we need to know the source of
** the element. This is for characters with
** two possible sounds, where we must allow
** for the chance of two successive
** characters with the same sound number.
** See following note on Rule 4.
*/
unsigned short choice_mask; /* If non zero, there are two choices.
** The mask is a bit pattern, essentially
** indicating the choice number as a power
** of two, which allows the following task
** of printing all choices to navigate the
** encoded array.
*/
struct _dm_element_bit left;
struct _dm_element_bit right;
};
struct _dm_prior_element /* This is used in some specialised handling */
{
enum _dm_element_source source;
short length;
char string[2];
} prior;
struct _dm_element encode[AD_SOUNDEX_DM_MAX_ENCODE]; /* I'm not going to initialise all elements of the encode[] array. This will ** save time but we have to be careful when processing the array later! */ char output[AD_SOUNDEX_DM_MAX_OUTPUT][AD_LEN_SOUNDEX_DM];
/* Pad the internal buffer with spaces */
for (i=0; i< AD_SOUNDEX_DM_INT_BUFFER + AD_SOUNDEX_DM_PAD_BUFFER; i++)
buffer[i]=' ';
/* Preprocess: Fill internal buffer[]
** Point to first alpha char in input. None? Return an error!
** Convert char to upper until first non-alpha. If non-alpha is one or more
** blanks then skip these and continue. Ignore first occurrence of hyphen
** apostrophe or period characters.
*/
true_length=*(short *)string->db_data;
for (i=0,j=0; i< true_length && j< AD_SOUNDEX_DM_INT_BUFFER; i++)
{
/*ignore spaces*/
if (isblank(*(char *)(string->db_data + sizeof(short) + i))) continue;
/* Ignore first hyphen, apostrophe or period only */
if ((*(char *)(string->db_data + sizeof(short) + i) == '-'
|| *(char *)(string->db_data + sizeof(short) + i) == '\
|| *(char *)(string->db_data + sizeof(short) + i) == '.')
&& hyphen==0)
{
hyphen=1;
continue;
};
/* Process Alpha characters */
if (isalpha(*(char *)(string->db_data + sizeof(short) + i)))
{
/* Convert to uppercase and store in buffer */
a_char=(char )toupper((int )(*(char *)(string->db_data + sizeof(short) + i)));
memcpy((char *)(buffer + j), (II_PTR) &a_char, 1);
start_word=1; j++;
continue;
};
break; /* Anything else breaks the pre-process loop */
}; /* For */
if (!start_word)
{
*(short *)(rdv->db_data)=6;
rdv->db_length=6+sizeof(short);
memcpy((II_PTR)(rdv->db_data + sizeof(short)), (II_PTR) zeroes, AD_LEN_SOUNDEX_DM);
return (II_OK); /* Nothing to do, return "000000" */
};
buffer[j]='\0'; /* Terminate the buffer */
/* Now process the data stored in the buffer[].
* This for loop loads the encode array with the list of possibilities.
*/
/* Sundry Initialisation */
start_word=1; total_choices=0;
for (i=0, ei=0;
i<j && ei<AD_SOUNDEX_DM_MAX_ENCODE && encoded<=AD_LEN_SOUNDEX_DM + 1;
ei++, start_word=0
)
{
/* The most likely settings, which will be overridden when required */
encode[ei].choice_mask=0;
encode[ei].source=phrase;
encode[ei].left.length=1; /* Most likely one char long */
encode[ei].right.length=1;
encode[ei].right.string[0]='-'; /*'-' means 'Not Coded' */
/* The most likely 'before a vowel' scenario, which will be retested if
** required
*/
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 1);
The 'A' cases
if (!strncmp((char *)(buffer +i), "AI", 2)
|| !strncmp((char *)(buffer +i), "AJ", 2)
|| !strncmp((char *)(buffer +i), "AY", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded ++;
}
else {
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (before_a_vowel) {
encode[ei].left.string[0]='1';
encoded ++;
}
else
{
encode[ei].left.string[0]='-';
};
};
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "AU", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded ++;
}
else
{
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (before_a_vowel) {
encode[ei].left.string[0]='7';
encoded ++;
}
else
{
encode[ei].left.string[0]='-';
};
};
i+=2;
continue;
};
if (buffer[i]=='A')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='0';
encoded ++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'B', 'V', 'W' cases
if (buffer[i]=='B' || buffer[i]=='V' || buffer[i]=='W')
{
encode[ei].source=character;
encode[ei].left.string[0]='7';
encoded++;
i++;
continue;
};
The 'C' cases
if (!strncmp((char *)(buffer +i), "CHS", 3))
{
if (start_word) {
encode[ei].left.string[0]='5';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
};
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "CSZ", 3)
|| !strncmp((char *)(buffer +i), "CZS", 3))
{
encode[ei].left.string[0]='4';
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "CH", 2)) /* Try KH(5) and TCH(4) */
{
encode[ei].choice_mask=1<<total_choices; total_choices++;
encode[ei].left.string[0]='5'; /* As KH */
encode[ei].right.string[0]='4'; /* As TCH */
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "CK", 2)) /* Try K(5) and TSK(45) */
{
encode[ei].choice_mask=1<<total_choices; total_choices++;
encode[ei].left.string[0]='5'; /* As K */
encode[ei].right.length=2;
memcpy((II_PTR )(encode[ei].right.string), (II_PTR )FourtyFive, 2);
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "CS", 2)
|| !strncmp((char *)(buffer +i), "CZ", 2))
{
encode[ei].left.string[0]='4';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='C') /* Try K(5) and TZ(4) */
{
encode[ei].source=character;
encode[ei].choice_mask=1<<total_choices; total_choices++;
encode[ei].left.string[0]='5'; /* As K */
encode[ei].right.string[0]='4'; /* As TZ */
encoded++;
i++;
continue;
};
The 'D' cases
if (!strncmp((char *)(buffer +i), "DRZ", 3)
|| !strncmp((char *)(buffer +i), "DRS", 3)
|| !strncmp((char *)(buffer +i), "DSH", 3)
|| !strncmp((char *)(buffer +i), "DSZ", 3)
|| !strncmp((char *)(buffer +i), "DZH", 3)
|| !strncmp((char *)(buffer +i), "DZS", 3))
{
encode[ei].left.string[0]='4';
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "DS", 2)
|| !strncmp((char *)(buffer +i), "DZ", 2))
{
encode[ei].left.string[0]='4';
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "DT", 2))
{
encode[ei].left.string[0]='3';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='D')
{
encode[ei].source=character;
encode[ei].left.string[0]='3';
encoded++;
i++;
continue;
};
The 'E' cases
if (!strncmp((char *)(buffer +i), "EI", 2)
|| !strncmp((char *)(buffer +i), "EJ", 2)
|| !strncmp((char *)(buffer +i), "EY", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded ++;
}
else {
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (before_a_vowel) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
};
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "EU", 2))
{
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (start_word || before_a_vowel) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i+=2;
continue;
};
if (buffer[i]=='E')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'F' cases
if (!strncmp((char *)(buffer +i), "FB", 2))
{
encode[ei].left.string[0]='7';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='F')
{
encode[ei].source=character;
encode[ei].left.string[0]='7';
encoded++;
i++;
continue;
};
The 'G' and 'Q' cases
if (buffer[i]=='G' || buffer[i]=='Q')
{
encode[ei].source=character;
encode[ei].left.string[0]='5';
encoded++;
i++;
continue;
};
The 'H' cases
if (buffer[i]=='H')
{
encode[ei].source=character;
if (start_word || before_a_vowel)
{
encode[ei].left.string[0]='5';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'I' cases
if (!strncmp((char *)(buffer +i), "IA", 2)
|| !strncmp((char *)(buffer +i), "IE", 2)
|| !strncmp((char *)(buffer +i), "IO", 2)
|| !strncmp((char *)(buffer +i), "IU", 2))
{
if (start_word) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i+=2;
continue;
};
if (buffer[i]=='I')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'J' cases
if (buffer[i]=='J') /* Try Y(1) and DZH(4) */
{
encode[ei].source=character;
encode[ei].choice_mask=1<<total_choices; total_choices++;
encode[ei].right.string[0]='4'; /* As DZH(4) */
if (start_word)
{
encode[ei].left.string[0]='1'; /* As Y(1) */
encoded++;
}
else
{
encode[ei].left.string[0]='-'; /* ie. Not Coded */
};
i++;
continue;
};
The 'K' cases
if (!strncmp((char *)(buffer +i), "KS", 2))
{
if (start_word) {
encode[ei].left.string[0]='5';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
};
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "KH", 2))
{
encode[ei].left.string[0]='5';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='K')
{
encode[ei].source=character;
encode[ei].left.string[0]='5';
encoded++;
i++;
continue;
};
The 'L' cases
if (buffer[i]=='L')
{
encode[ei].source=character;
encode[ei].left.string[0]='8';
encoded++;
i++;
continue;
};
The 'M' cases
if (!strncmp((char *)(buffer +i), "MN", 2))
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )SixtySix, 2);
encoded++;
i+=2;
continue;
};
if (buffer[i]=='M')
{
encode[ei].source=character;
encode[ei].left.string[0]='6';
encoded++;
i++;
continue;
};
The 'N' cases
if (!strncmp((char *)(buffer +i), "NM", 2))
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )SixtySix, 2);
encoded++;
i+=2;
continue;
};
if (buffer[i]=='N')
{
encode[ei].source=character;
encode[ei].left.string[0]='6';
encoded++;
i++;
continue;
};
The 'O' cases
if (!strncmp((char *)(buffer +i), "OI", 2)
|| !strncmp((char *)(buffer +i), "OJ", 2)
|| !strncmp((char *)(buffer +i), "OY", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else {
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (before_a_vowel) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
};
i+=2;
continue;
};
if (buffer[i]=='O')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='0';
encoded ++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'P' cases
if (!strncmp((char *)(buffer +i), "PF", 2)
|| !strncmp((char *)(buffer +i), "PH", 2))
{
encode[ei].left.string[0]='7';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='P')
{
encode[ei].source=character;
encode[ei].left.string[0]='7';
encoded++;
i++;
continue;
};
The 'R' cases
if (!strncmp((char *)(buffer +i), "RTZ", 3))
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )NinetyFour, 2);
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "RS", 2) /* Try RTZ(94) and ZH(4) */
|| !strncmp((char *)(buffer +i), "RZ", 2))
{
encode[ei].choice_mask=1<<total_choices; total_choices++;
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )NinetyFour, 2); /* Try RTZ(94) */
encode[ei].right.string[0]='4'; /* Try ZH(4) */
encoded++;
i+=2;
continue;
};
if (buffer[i]=='R')
{
encode[ei].source=character;
encode[ei].left.string[0]='9';
encoded++;
i++;
continue;
};
The 'S' cases
if (!strncmp((char *)(buffer +i), "SCHTSCH", 7))
{
encode[ei].source=phrase;
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=7;
continue;
};
if (!strncmp((char *)(buffer +i), "SCHTSH", 6)
|| !strncmp((char *)(buffer +i), "SCHTCH", 6))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=6;
continue;
};
if (!strncmp((char *)(buffer +i), "SHTCH", 5)
|| !strncmp((char *)(buffer +i), "SHTSH", 5)
|| !strncmp((char *)(buffer +i), "STSCH", 5))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=5;
continue;
};
if (!strncmp((char *)(buffer +i), "SHCH", 4)
|| !strncmp((char *)(buffer +i), "STRZ", 4)
|| !strncmp((char *)(buffer +i), "STRS", 4)
|| !strncmp((char *)(buffer +i), "STSH", 4))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "SCHT", 4)
|| !strncmp((char *)(buffer +i), "SCHD", 4))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
};
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "STCH", 4)
|| !strncmp((char *)(buffer +i), "SZCZ", 4)
|| !strncmp((char *)(buffer +i), "SZCS", 4))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "SCH", 3))
{
encode[ei].left.string[0]='4';
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "SHT", 3)
|| !strncmp((char *)(buffer +i), "SZT", 3)
|| !strncmp((char *)(buffer +i), "SHD", 3)
|| !strncmp((char *)(buffer +i), "SZD", 3))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
};
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "SH", 2)
|| !strncmp((char *)(buffer +i), "SZ", 2))
{
encode[ei].left.string[0]='4';
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "SC", 2))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "ST", 2)
|| !strncmp((char *)(buffer +i), "SD", 2))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
};
encoded++;
i+=2;
continue;
};
if (buffer[i]=='S')
{
encode[ei].source=character;
encode[ei].left.string[0]='4';
encoded++;
i++;
continue;
};
The 'T' cases
if (!strncmp((char *)(buffer +i), "TTSCH", 5))
{
encode[ei].left.string[0]='4';
encoded++;
i+=5;
continue;
};
if (!strncmp((char *)(buffer +i), "TTCH", 4)
|| !strncmp((char *)(buffer +i), "TSCH", 4)
|| !strncmp((char *)(buffer +i), "TTSZ", 4))
{
encode[ei].left.string[0]='4';
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "TSK", 3))
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyFive, 2);
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "TCH", 3)
|| !strncmp((char *)(buffer +i), "TRZ", 3)
|| !strncmp((char *)(buffer +i), "TRS", 3)
|| !strncmp((char *)(buffer +i), "TSH", 3)
|| !strncmp((char *)(buffer +i), "TTS", 3)
|| !strncmp((char *)(buffer +i), "TTZ", 3)
|| !strncmp((char *)(buffer +i), "TSZ", 3)
|| !strncmp((char *)(buffer +i), "TZS", 3))
{
encode[ei].left.string[0]='4';
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "TH", 2))
{
encode[ei].left.string[0]='3';
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "TC", 2)
|| !strncmp((char *)(buffer +i), "TZ", 2)
|| !strncmp((char *)(buffer +i), "TS", 2))
{
encode[ei].left.string[0]='4';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='T')
{
encode[ei].source=character;
encode[ei].left.string[0]='3';
encoded++;
i++;
continue;
};
The 'U' cases
if (!strncmp((char *)(buffer +i), "UI", 2)
|| !strncmp((char *)(buffer +i), "UJ", 2)
|| !strncmp((char *)(buffer +i), "UY", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else {
before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
if (before_a_vowel) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
};
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "UE", 2))
{
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i+=2;
continue;
};
if (buffer[i]=='U')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='0';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'X' cases
if (buffer[i]=='X')
{
encode[ei].source=character;
if (start_word)
{
encode[ei].left.string[0]='5';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
};
encoded++;
i++;
continue;
};
The 'Y' cases
if (buffer[i]=='Y')
{
encode[ei].source=character;
if (start_word) {
encode[ei].left.string[0]='1';
encoded++;
}
else
{
encode[ei].left.string[0]='-';
};
i++;
continue;
};
The 'Z' cases
if (!strncmp((char *)(buffer +i), "ZHDZH", 5))
{
if (start_word)
{
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=5;
continue;
};
if (!strncmp((char *)(buffer +i), "ZDZH", 4))
{
if (start_word)
{
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "ZSCH", 4))
{
encode[ei].left.string[0]='4';
encoded++;
i+=4;
continue;
};
if (!strncmp((char *)(buffer +i), "ZDZ", 3))
{
if (start_word)
{
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.string[0]='4';
};
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "ZHD", 3))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
};
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "ZSH", 3))
{
encode[ei].left.string[0]='4';
encoded++;
i+=3;
continue;
};
if (!strncmp((char *)(buffer +i), "ZD", 2))
{
if (start_word) {
encode[ei].left.string[0]='2';
}
else
{
encode[ei].left.length=2;
memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
};
encoded++;
i+=2;
continue;
};
if (!strncmp((char *)(buffer +i), "ZH", 2)
|| !strncmp((char *)(buffer +i), "ZS", 2))
{
encode[ei].left.string[0]='4';
encoded++;
i+=2;
continue;
};
if (buffer[i]=='Z')
{
encode[ei].source=character;
encode[ei].left.string[0]='4';
encoded++;
i++;
continue;
};
}; /* For each character in string */
Process the encode array into the output array
/* Process the encode array into the output array.
**
** Note that ei is the index of the last element in the encode array plus 1.
**
** At this point we do the final weed of adjacent characters with the same
** sound code.
*/
total_routes=(int )1<<total_choices;
for (oi=0; oi<total_routes && oi<AD_SOUNDEX_DM_MAX_OUTPUT; oi++)
{
memcpy((II_PTR) output[oi], (II_PTR) zeroes, AD_LEN_SOUNDEX_DM);
j=0; /* j is character position in output[oi] */
prior.source=phrase; /* init 'prior' case */
prior.length=1;
prior.string[0]='-';
for (i=0; i<ei; i++) /* i is index of encoded array */
{
if (encode[i].choice_mask == 0)
{
if (encode[i].left.string[0]!='-')
{
if (encode[i].source == character
&& prior.source == character
&& !strncmp(prior.string, encode[i].left.string,
encode[i].left.length)) {continue;};
/* Otherwise....*/
output[oi][j++]=encode[i].left.string[0];
if (encode[i].left.length > 1 && j<AD_LEN_SOUNDEX_DM)
output[oi][j++]=encode[i].left.string[1];
};
/* Save left as 'prior' case */
prior.source=encode[i].source;
prior.length=encode[i].left.length;
memcpy(prior.string, encode[i].left.string, prior.length);
}
else
{
choice=oi & encode[i].choice_mask;
/* Left if choice = 0, Right if choice = 1 */
if (!choice)
{
if (encode[i].left.string[0]!='-')
{
if (encode[i].source == character
&& prior.source == character
&& !strncmp(prior.string, encode[i].left.string,
encode[i].left.length)) {continue;};
/* Otherwise....*/
output[oi][j++]=encode[i].left.string[0];
if (encode[i].left.length > 1 && j<AD_LEN_SOUNDEX_DM)
output[oi][j++]=encode[i].left.string[1];
};
/* Save left as 'prior' case */
prior.source=encode[i].source;
prior.length=encode[i].left.length;
memcpy(prior.string, encode[i].left.string, prior.length);
}
else
{
if (encode[i].right.string[0]!='-')
{
if (encode[i].source == character
&& prior.source == character
&& !strncmp(prior.string, encode[i].right.string,
encode[i].right.length)) {continue;};
/* Otherwise....*/
output[oi][j++]=encode[i].right.string[0];
if (encode[i].right.length > 1 && j<AD_LEN_SOUNDEX_DM)
output[oi][j++]=encode[i].right.string[1];
};
/* Save right as 'prior' case */
prior.source=encode[i].source;
prior.length=encode[i].right.length;
memcpy(prior.string, encode[i].right.string, prior.length);
};
};
}; /* for each element in encoded array */
}; /* For each possible route through the array */
Build the rdv->db_data
/* Build the rdv->data from the output array, removing duplicates along
** the way.
*/
memcpy((II_PTR)(rdv->db_data + sizeof(short)), output[0],
AD_LEN_SOUNDEX_DM);
*(short *)(rdv->db_data)=AD_LEN_SOUNDEX_DM;
for (i=1; i<oi; i++)
{
unique=1;
for (j=0; j < i; j++) /* check for duplicates */
{
if (!strncmp(output[j], output[i], AD_LEN_SOUNDEX_DM))
{
unique=0;
break;
};
};
if (unique)
{
if (rdv->db_length >= (*(short *)rdv->db_data + AD_LEN_SOUNDEX_DM + 1)) {
*(char *)(rdv->db_data + sizeof(short) + *(short *)rdv->db_data)=',';
memcpy(
(II_PTR)(rdv->db_data + sizeof(short) + *(short *)rdv->db_data) + 1,
output[i], AD_LEN_SOUNDEX_DM);
*(short *)rdv->db_data+=AD_LEN_SOUNDEX_DM + 1;
}
else
{
sprintf(msg, "soundex_dm(): unexpected overflow in return string length.");
us_error(scb, 0x200000, msg);
return (II_ERROR);
};
};
};
return (II_OK);
}; /* soundex_dm() */
Support function
/* soundex_dm_vowelage:
** Simply checks if the current code set is before a vowel.
** In this case a vowel is in the set: A, E, I, O, U, J and Y
*/
int
soundex_dm_vowelage (
char *buffer, /* The buffer of characters to check */
int b_ptr, /* The current position in the buffer */
int b_len, /* The length of the buffer */
int skip /* How far ahead to check for a vowel */
)
{
/* return (0) if we have exhausted the buffer */
if (b_ptr + skip >= b_len) {return ((int )0);};
/* return (1) if before a vowel */
if (buffer[b_ptr + skip]=='A' || buffer[b_ptr + skip]=='E'
|| buffer[b_ptr + skip]=='I' || buffer[b_ptr + skip]=='O'
|| buffer[b_ptr + skip]=='U' || buffer[b_ptr + skip]=='J'
|| buffer[b_ptr + skip]=='Y')
{return ((int )1);};
/* return (0) if NOT before a vowel */
return ((int )0);
}; /* soundex_dm_vowelage */

