Login Register Actian.com  

Actian Community Wiki

Navigation
Learn About
Developing With
Ingres Talk
Information
Toolbox

OME soundex dm()

From Ingres Community Wiki

Jump to: navigation, search

Contents

Introduction

The standard Ingres soundex function is a typical Knuth coded version of the Russell soundex (circa 1918). The Daitch-Mokotoff soundex (circa 1980) is an upgraded soundex more suitable to larger databases.

No special libraries are need to support this function, hence this code should be transportable to non Linux systems.

This implementation of the Daitch-Mokotoff soundex was written by Martin Bowes.

Syntax

(varchar )rdv = soundex_dm((varchar )string)

A simple Daitch-Mokotoff soundex is a six character code. It is composed entirely of digits and it may have a leading zero.

Note that unlike the standard soundex function, the Daitch-Mokotoff soundex allows for hard/soft sounds. For example is the 'ch' soft as in 'cheese' or hard as in 'christmas'? As a consequence an input string may have multiple possible return codes. To handle this the return string will be composed of all the possibilities seperated by commas. The list is not sorted, but all elements are unique.

Example soundex, soundex_dm output
Wordsoundex(Word)soundex_dm(Word)
MoskowitzM232645740
PetersonP362739460,734600
JacksonJ250154600,454600,145460,445460

Return Value

The return is a varchar(111).

As mentioned above, the implementation will return all possible codes in a comma seperated list.

The implementation given here can handle as many as 16 possibilities in the return before it would generate an error. The length can be adjusted by altering the macro: AD_SOUNDEX_DM_MAX_OUTPUT

Example

Its probably no exageration to say that almost everyone in the Western world knows the name 'Arnold Schwarzenegger'. But how many of them could spell it?

Consider the following table:

Namesoundex(Name)soundex_dm(Name))
SchwarzeneggerS625479465,474659
ShwarzeneggerS625479465,474659
SchwartseneggerS632479465

The American Soundex code for Schwarzenegger is S625.

If the name was misspelled as Shwarzenegger, the code would still be S625, so any search application based on American Soundex would still find the match in spite of that misspelling.

However if the name was misspelled as Schwartsenegger, the American Soundex code would be S632, so a search application based on American Soundex would not find the match with that misspelling.

Under DM Soundex, the correct spelling, Schwarzenegger, has two codes, namely 474659 and 479465. The incorrect spelling, Shwarzenegger, has the same two codes, and the incorrect spelling, Schwartsenegger, has the DM code of 479465, which is one of the two codes for the correct spelling. So a search application based on DM Soundex would find the match with either of these misspellings.

FOD

Add the following definition to the fod_id enum set: UDF_SOUNDEX_DM

Then add the following to the Function_Definitions array:

static IIADD_FO_DFN Function_Definitions[]={
    ...
   { 
   II_O_OPERATION,   /*fod_object_type*/
   {"soundex_dm"},   /*fod_name*/
   UDF_SOUNDEX_DM,   /*fod_id*/
   II_NORMAL         /*fod_type*/
   },
   ...
   }; /*Function_Definitions*/

FIDs

Add the following definitions to the fid_id enum set:

UDF_FI_SOUNDEX_DM,

The FIDs rely on the following definition of parameter types.

static II_DT_ID  UD_2_VC[]         = {II_VARCHAR,  II_VARCHAR};

The FIDs are:

static IIADD_FI_DFN Function_Instances[] = {
   ...
   {/* soundex_dm(varchar) */
   II_O_FUNCTION_INSTANCE,    /* fid_object_type */
   UDF_FI_SOUNDEX_DM_1,       /* fid_id*/
   II_NO_FI,                  /* fid_cmplmnt*/
   UDF_SOUNDEX_DM,            /* fid_opid=fod_id from function definition
                              ** This is the minor sort field for this array
                              */
   II_NORMAL,                 /* fid_optype
                              ** This is the major sort field for this array
                              */
   II_FID_F0_NOFLAGS,         /* fid_attributes*/
   0,                         /* fid_wslength*/
   1,                         /* fid_numargs*/
   UD_2_VC,                   /* fid_args, a pointer to an array of datatypes*/
   II_VARCHAR,                /* fid_result, result is an integer */
   II_RES_FIXED,              /* fid_rltype*/
   AD_SOUNDEX_DM_OUTPUT_LEN + sizeof(short),         /* fid_rlength */
   0,                         /* fid_rprec */
   soundex_dm,                /* fid_routine */
   0                          /* lenspec_routine */
   }, /* soundex_dm(varchar) */

Executor Code

/* Stuff used in soundex_dm() function */
#define AD_SOUNDEX_DM_INT_BUFFER 64
#define AD_SOUNDEX_DM_PAD_BUFFER 20
#define AD_LEN_SOUNDEX_DM         6
#define AD_SOUNDEX_DM_MAX_OUTPUT 16
#define AD_SOUNDEX_DM_MAX_ENCODE 20
#define AD_SOUNDEX_DM_OUTPUT_LEN AD_LEN_SOUNDEX_DM  +  ((AD_SOUNDEX_DM_MAX_OUTPUT - 1) * (AD_LEN_SOUNDEX_DM + 1))
II_STATUS
soundex_dm (
   II_SCB           *scb,
   II_DATA_VALUE    *string,  /* Generate the Daitch-Mokotoff soundex for this
                              ** string
                              */
   II_DATA_VALUE    *rdv      
       /* varchar is returned, length is fixed
       ** by AD_SOUNDEX_DM_OUTPUT_LEN. This is set to 
       ** a size sufficient to store 16 (ie. AD_SOUNDEX_DM_MAX_OUTPUT)
       ** comma seperated 6 (ie. AD_LEN_SOUNDEX_DM) character strings.
       */
)
{
    /* Sundry initialisation */
   int         hyphen=0;
   int         start_word=0;
   int         oi, ei=0, encoded=0;
   int         before_a_vowel=0;
   int         total_choices;
   int         choice, total_routes, unique;
   int         i, j, true_length;
   char        a_char;
   /* Used for error processing */
   char        msg[256];
   /* The internal buffer to hold the uppercase'd and stripped input string.
   ** This is sized to more than easily allow for enough characters to
   ** generate the 6 character D-M soundex value
   */
   char        buffer[AD_SOUNDEX_DM_INT_BUFFER + AD_SOUNDEX_DM_PAD_BUFFER];
   /* destination initialization field */
   char        zeroes[] = "0000000";
   /* These are fixed codes used in the D-M soundex algorithm */
   char        FourtyThree[]   = "43";
   char        FourtyFive[]    = "45";
   char        FiftyFour[]     = "54";
   char        SixtySix[]      = "66";
   char        NinetyFour[]    = "94";
   /* Define the structures for the encoding array and the buffer of possible
    * output values
    */
   struct _dm_element_bit
   {
       short length;
       char  string[2];
   };
   struct _dm_element
   {
       enum _dm_element_source
       {character, phrase} source; /* On occasion we need to know the source of
                                   ** the element. This is for characters with
                                   ** two possible sounds, where we must allow
                                   ** for the chance of two successive
                                   ** characters with the same sound number.
                                   ** See following note on Rule 4.
                                   */
       unsigned short choice_mask; /* If non zero, there are two choices.
                                   ** The mask is a bit pattern, essentially
                                   ** indicating the choice number as a power
                                   ** of two, which allows the following task
                                   ** of printing all choices to navigate the
                                   ** encoded array.
                                   */
       struct _dm_element_bit left;
       struct _dm_element_bit right;
   };
   struct _dm_prior_element /* This is used in some specialised handling */
   {
       enum _dm_element_source source;
       short length;
       char  string[2];
   } prior;
   struct _dm_element encode[AD_SOUNDEX_DM_MAX_ENCODE];
   /* I'm not going to initialise all elements of the encode[] array. This will
   ** save time but we have to be careful when processing the array later!
   */
   char output[AD_SOUNDEX_DM_MAX_OUTPUT][AD_LEN_SOUNDEX_DM];
   /* Pad the internal buffer with spaces */
   for (i=0; i< AD_SOUNDEX_DM_INT_BUFFER + AD_SOUNDEX_DM_PAD_BUFFER; i++)
       buffer[i]=' ';
   /* Preprocess: Fill internal buffer[]
   ** Point to first alpha char in input. None? Return an error!
   ** Convert char to upper until first non-alpha. If non-alpha is one or more
   ** blanks then skip these and continue. Ignore first occurrence of hyphen
   ** apostrophe or period characters.
   */
   true_length=*(short *)string->db_data;
   for (i=0,j=0; i< true_length && j< AD_SOUNDEX_DM_INT_BUFFER; i++)
   {
       /*ignore spaces*/
       if (isblank(*(char *)(string->db_data + sizeof(short) + i))) continue;
       /* Ignore first hyphen, apostrophe or period only */
       if ((*(char *)(string->db_data + sizeof(short) + i) == '-'
         || *(char *)(string->db_data + sizeof(short) + i) == '\
         || *(char *)(string->db_data + sizeof(short) + i) == '.')
         && hyphen==0)
       {
           hyphen=1;
           continue;
       };
       /* Process Alpha characters */
       if (isalpha(*(char *)(string->db_data + sizeof(short) + i)))
       {
           /* Convert to uppercase and store in buffer */
           a_char=(char )toupper((int )(*(char *)(string->db_data + sizeof(short) + i)));
           memcpy((char *)(buffer + j), (II_PTR) &a_char, 1);
           start_word=1; j++;
           continue;
       };
       break; /* Anything else breaks the pre-process loop */
   }; /* For */
   if (!start_word)
   {
       *(short *)(rdv->db_data)=6;
       rdv->db_length=6+sizeof(short);
       memcpy((II_PTR)(rdv->db_data + sizeof(short)), (II_PTR) zeroes, AD_LEN_SOUNDEX_DM);
       return (II_OK); /* Nothing to do, return "000000" */
   };
   buffer[j]='\0'; /* Terminate the buffer */
   /* Now process the data stored in the buffer[].
    * This for loop loads the encode array with the list of possibilities.
    */
   /* Sundry Initialisation */
   start_word=1; total_choices=0;
   for (i=0, ei=0;
       i<j && ei<AD_SOUNDEX_DM_MAX_ENCODE && encoded<=AD_LEN_SOUNDEX_DM + 1;
       ei++, start_word=0
       )
   {
       /* The most likely settings, which will be overridden when required */
       encode[ei].choice_mask=0;
       encode[ei].source=phrase;
       encode[ei].left.length=1;       /* Most likely one char long */
       encode[ei].right.length=1;
       encode[ei].right.string[0]='-'; /*'-' means 'Not Coded' */
       /* The most likely 'before a vowel' scenario, which will be retested if 
       ** required
       */
       before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 1);

The 'A' cases

       if (!strncmp((char *)(buffer +i), "AI", 2)
        || !strncmp((char *)(buffer +i), "AJ", 2)
        || !strncmp((char *)(buffer +i), "AY", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded ++;
           }
           else {
               before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
               if (before_a_vowel) {
                   encode[ei].left.string[0]='1';
                   encoded ++;
               }
               else
               {
                   encode[ei].left.string[0]='-';
               };
           };
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "AU", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded ++;
           }
           else
           {
               before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
               if (before_a_vowel) {
                   encode[ei].left.string[0]='7';
                   encoded ++;
               }
               else
               {
                   encode[ei].left.string[0]='-';
               };
           };
           i+=2;
           continue;
       };
       if (buffer[i]=='A')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded ++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       };

The 'B', 'V', 'W' cases

       if (buffer[i]=='B' || buffer[i]=='V' || buffer[i]=='W')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='7';
           encoded++;
           i++;
           continue;
       };

The 'C' cases

       if (!strncmp((char *)(buffer +i), "CHS", 3))
       {
           if (start_word) {
               encode[ei].left.string[0]='5';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
           };
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "CSZ", 3)
         || !strncmp((char *)(buffer +i), "CZS", 3))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "CH", 2)) /* Try KH(5) and TCH(4) */
       {
           encode[ei].choice_mask=1<<total_choices; total_choices++;
           encode[ei].left.string[0]='5'; /* As KH */
           encode[ei].right.string[0]='4'; /* As TCH */
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "CK", 2)) /* Try K(5) and TSK(45) */
       {
           encode[ei].choice_mask=1<<total_choices; total_choices++;
           encode[ei].left.string[0]='5'; /* As K */
           encode[ei].right.length=2;
           memcpy((II_PTR )(encode[ei].right.string), (II_PTR )FourtyFive, 2);
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "CS", 2)
        || !strncmp((char *)(buffer +i), "CZ", 2))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='C') /* Try K(5) and TZ(4) */
       {
           encode[ei].source=character;
           encode[ei].choice_mask=1<<total_choices; total_choices++;
           encode[ei].left.string[0]='5'; /* As K */
           encode[ei].right.string[0]='4'; /* As TZ */
           encoded++;
           i++;
           continue;
       };

The 'D' cases

       if (!strncmp((char *)(buffer +i), "DRZ", 3)
        || !strncmp((char *)(buffer +i), "DRS", 3)
        || !strncmp((char *)(buffer +i), "DSH", 3)
        || !strncmp((char *)(buffer +i), "DSZ", 3)
        || !strncmp((char *)(buffer +i), "DZH", 3)
        || !strncmp((char *)(buffer +i), "DZS", 3))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "DS", 2)
        || !strncmp((char *)(buffer +i), "DZ", 2))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "DT", 2))
       {
           encode[ei].left.string[0]='3';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='D')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='3';
           encoded++;
           i++;
           continue;
       };

The 'E' cases

       if (!strncmp((char *)(buffer +i), "EI", 2)
        || !strncmp((char *)(buffer +i), "EJ", 2)
        || !strncmp((char *)(buffer +i), "EY", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded ++;
               }
           else {
               before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
               if (before_a_vowel) {
                   encode[ei].left.string[0]='1';
                   encoded++;
               }
               else
               {
                   encode[ei].left.string[0]='-';
               };
           };
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "EU", 2))
       {
           before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
           if (start_word || before_a_vowel) {
               encode[ei].left.string[0]='1';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i+=2;
           continue;
       };
       if (buffer[i]=='E')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       };

The 'F' cases

       if (!strncmp((char *)(buffer +i), "FB", 2))
       {
           encode[ei].left.string[0]='7';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='F')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='7';
           encoded++;
           i++;
           continue;
       };

The 'G' and 'Q' cases

       if (buffer[i]=='G' || buffer[i]=='Q')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='5';
           encoded++;
           i++;
           continue;
       };

The 'H' cases

       if (buffer[i]=='H')
       {
           encode[ei].source=character;
           if (start_word || before_a_vowel)
           {
               encode[ei].left.string[0]='5';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           }; 
           i++;
           continue;
       };

The 'I' cases

       if (!strncmp((char *)(buffer +i), "IA", 2)
        || !strncmp((char *)(buffer +i), "IE", 2)
        || !strncmp((char *)(buffer +i), "IO", 2)
        || !strncmp((char *)(buffer +i), "IU", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='1';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i+=2;
           continue;
       };
       if (buffer[i]=='I')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++; 
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       }; 

The 'J' cases

       if (buffer[i]=='J') /* Try Y(1) and DZH(4) */
       {
           encode[ei].source=character;
           encode[ei].choice_mask=1<<total_choices; total_choices++;
           encode[ei].right.string[0]='4'; /* As DZH(4) */
           if (start_word)
           {
               encode[ei].left.string[0]='1'; /* As Y(1) */
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-'; /* ie. Not Coded */
           };
           i++;
           continue;
       };

The 'K' cases

       if (!strncmp((char *)(buffer +i), "KS", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='5';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
           };
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "KH", 2))
       {
           encode[ei].left.string[0]='5';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='K')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='5';
           encoded++;
           i++;
           continue;
       };

The 'L' cases

       if (buffer[i]=='L')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='8';
           encoded++;
           i++;
           continue;
       };

The 'M' cases

       if (!strncmp((char *)(buffer +i), "MN", 2))
       {
           encode[ei].left.length=2;
           memcpy((II_PTR )(encode[ei].left.string), (II_PTR )SixtySix, 2);
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='M')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='6';
           encoded++;
           i++;
           continue;
       };

The 'N' cases

       if (!strncmp((char *)(buffer +i), "NM", 2))
       {
           encode[ei].left.length=2;
           memcpy((II_PTR )(encode[ei].left.string), (II_PTR )SixtySix, 2);
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='N')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='6';
           encoded++;
           i++;
           continue;
       };

The 'O' cases

       if (!strncmp((char *)(buffer +i), "OI", 2)
        || !strncmp((char *)(buffer +i), "OJ", 2)
        || !strncmp((char *)(buffer +i), "OY", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++;
           }
           else {
               before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
               if (before_a_vowel) {
                   encode[ei].left.string[0]='1';
                   encoded++;
               }
               else
               {
                   encode[ei].left.string[0]='-';
               };
           }; 
           i+=2;
           continue;
       };
       if (buffer[i]=='O')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded ++; 
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       };

The 'P' cases

       if (!strncmp((char *)(buffer +i), "PF", 2)
        || !strncmp((char *)(buffer +i), "PH", 2))
       {
           encode[ei].left.string[0]='7';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='P')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='7';
           encoded++;
           i++;
           continue;
       };

The 'R' cases

       if (!strncmp((char *)(buffer +i), "RTZ", 3))
       {
           encode[ei].left.length=2;
           memcpy((II_PTR )(encode[ei].left.string), (II_PTR )NinetyFour, 2);
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "RS", 2) /* Try RTZ(94) and ZH(4) */
        || !strncmp((char *)(buffer +i), "RZ", 2))
       {
           encode[ei].choice_mask=1<<total_choices; total_choices++;
           encode[ei].left.length=2;
           memcpy((II_PTR )(encode[ei].left.string), (II_PTR )NinetyFour, 2); /* Try RTZ(94) */
           encode[ei].right.string[0]='4'; /* Try ZH(4) */
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='R')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='9';
           encoded++;
           i++;
           continue;
       };

The 'S' cases

       if (!strncmp((char *)(buffer +i), "SCHTSCH", 7))
       {
           encode[ei].source=phrase;
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=7;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SCHTSH", 6)
        || !strncmp((char *)(buffer +i), "SCHTCH", 6))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=6;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SHTCH", 5)
        || !strncmp((char *)(buffer +i), "SHTSH", 5)
        || !strncmp((char *)(buffer +i), "STSCH", 5))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else 
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=5;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SHCH", 4)
        || !strncmp((char *)(buffer +i), "STRZ", 4)
        || !strncmp((char *)(buffer +i), "STRS", 4)
        || !strncmp((char *)(buffer +i), "STSH", 4))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else 
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=4;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SCHT", 4)
        || !strncmp((char *)(buffer +i), "SCHD", 4))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else 
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
           };
           encoded++;
           i+=4;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "STCH", 4)
        || !strncmp((char *)(buffer +i), "SZCZ", 4)
        || !strncmp((char *)(buffer +i), "SZCS", 4))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=4;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SCH", 3))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SHT", 3)
        || !strncmp((char *)(buffer +i), "SZT", 3)
        || !strncmp((char *)(buffer +i), "SHD", 3)
        || !strncmp((char *)(buffer +i), "SZD", 3))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
           };
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SH", 2)
        || !strncmp((char *)(buffer +i), "SZ", 2))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "SC", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ST", 2)
        || !strncmp((char *)(buffer +i), "SD", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
           };
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='S')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='4';
           encoded++;
           i++;
           continue;
       }; 

The 'T' cases

       if (!strncmp((char *)(buffer +i), "TTSCH", 5))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=5;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "TTCH", 4)
        || !strncmp((char *)(buffer +i), "TSCH", 4)
        || !strncmp((char *)(buffer +i), "TTSZ", 4))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=4;
           continue;
       }; 
       if (!strncmp((char *)(buffer +i), "TSK", 3))
       {
           encode[ei].left.length=2;
           memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyFive, 2);
           encoded++;
           i+=3;
           continue;
       }; 
       if (!strncmp((char *)(buffer +i), "TCH", 3)
        || !strncmp((char *)(buffer +i), "TRZ", 3)
        || !strncmp((char *)(buffer +i), "TRS", 3)
        || !strncmp((char *)(buffer +i), "TSH", 3)
        || !strncmp((char *)(buffer +i), "TTS", 3)
        || !strncmp((char *)(buffer +i), "TTZ", 3)
        || !strncmp((char *)(buffer +i), "TSZ", 3)
        || !strncmp((char *)(buffer +i), "TZS", 3))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "TH", 2))
       {
           encode[ei].left.string[0]='3';
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "TC", 2)
        || !strncmp((char *)(buffer +i), "TZ", 2)
        || !strncmp((char *)(buffer +i), "TS", 2))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='T')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='3';
           encoded++;
           i++;
           continue;
       };

The 'U' cases

       if (!strncmp((char *)(buffer +i), "UI", 2)
        || !strncmp((char *)(buffer +i), "UJ", 2)
        || !strncmp((char *)(buffer +i), "UY", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++;
           }
           else {
               before_a_vowel=soundex_dm_vowelage(&buffer, i, j, 2);
               if (before_a_vowel) {
                   encode[ei].left.string[0]='1';
                   encoded++;
               }
               else
               {
                   encode[ei].left.string[0]='-';
               };
           };
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "UE", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i+=2;
           continue;
       };
       if (buffer[i]=='U')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='0';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       };

The 'X' cases

       if (buffer[i]=='X')
       {
           encode[ei].source=character;
           if (start_word)
           {
               encode[ei].left.string[0]='5';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FiftyFour, 2);
           };
           encoded++;
           i++;
           continue;
       };

The 'Y' cases

       if (buffer[i]=='Y')
       {
           encode[ei].source=character;
           if (start_word) {
               encode[ei].left.string[0]='1';
               encoded++;
           }
           else
           {
               encode[ei].left.string[0]='-';
           };
           i++;
           continue;
       };

The 'Z' cases

       if (!strncmp((char *)(buffer +i), "ZHDZH", 5))
       {
           if (start_word)
           {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=5;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZDZH", 4))
       {
           if (start_word)
           {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=4;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZSCH", 4))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=4;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZDZ", 3))
       {
           if (start_word)
           {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.string[0]='4';
           };
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZHD", 3))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
           };
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZSH", 3))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=3;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZD", 2))
       {
           if (start_word) {
               encode[ei].left.string[0]='2';
           }
           else
           {
               encode[ei].left.length=2;
               memcpy((II_PTR )(encode[ei].left.string), (II_PTR )FourtyThree, 2);
           };
           encoded++;
           i+=2;
           continue;
       };
       if (!strncmp((char *)(buffer +i), "ZH", 2)
        || !strncmp((char *)(buffer +i), "ZS", 2))
       {
           encode[ei].left.string[0]='4';
           encoded++;
           i+=2;
           continue;
       };
       if (buffer[i]=='Z')
       {
           encode[ei].source=character;
           encode[ei].left.string[0]='4';
           encoded++;
           i++;
           continue;
       };
   }; /* For each character in string */

Process the encode array into the output array

   /* Process the encode array into the output array.
   **
   ** Note that ei is the index of the last element in the encode array plus 1.
   **
   ** At this point we do the final weed of adjacent characters with the same
   ** sound code. 
   */
   total_routes=(int )1<<total_choices;
   for (oi=0; oi<total_routes && oi<AD_SOUNDEX_DM_MAX_OUTPUT; oi++)
   {
       memcpy((II_PTR) output[oi], (II_PTR) zeroes, AD_LEN_SOUNDEX_DM);
       j=0;                 /* j is character position in output[oi] */
       prior.source=phrase; /* init 'prior' case */
       prior.length=1;
       prior.string[0]='-';
       for (i=0; i<ei; i++) /* i is index of encoded array */
       {
           if (encode[i].choice_mask == 0)
           {
               if (encode[i].left.string[0]!='-')
               {
                   if (encode[i].source == character
                    && prior.source == character
                    && !strncmp(prior.string, encode[i].left.string,
                        encode[i].left.length)) {continue;};
                   /* Otherwise....*/
                   output[oi][j++]=encode[i].left.string[0];
                   if (encode[i].left.length > 1 && j<AD_LEN_SOUNDEX_DM)
                       output[oi][j++]=encode[i].left.string[1];
               };
               /* Save left as 'prior' case */
               prior.source=encode[i].source;
               prior.length=encode[i].left.length;
               memcpy(prior.string, encode[i].left.string, prior.length);
           }
           else
           {
               choice=oi & encode[i].choice_mask;
               /* Left if choice = 0, Right if choice = 1 */
               if (!choice)
               {
                   if (encode[i].left.string[0]!='-')
                   {
                       if (encode[i].source == character
                        && prior.source == character
                        && !strncmp(prior.string, encode[i].left.string,
                            encode[i].left.length)) {continue;};
                       /* Otherwise....*/
                       output[oi][j++]=encode[i].left.string[0];
                       if (encode[i].left.length > 1 && j<AD_LEN_SOUNDEX_DM)
                           output[oi][j++]=encode[i].left.string[1];
                   };
                   /* Save left as 'prior' case */
                   prior.source=encode[i].source;
                   prior.length=encode[i].left.length;
                   memcpy(prior.string, encode[i].left.string, prior.length);
               }
               else
               {
                   if (encode[i].right.string[0]!='-')
                   {
                       if (encode[i].source == character
                        && prior.source == character
                        && !strncmp(prior.string, encode[i].right.string,
                            encode[i].right.length)) {continue;};
                       /* Otherwise....*/
                       output[oi][j++]=encode[i].right.string[0];
                       if (encode[i].right.length > 1 && j<AD_LEN_SOUNDEX_DM)
                           output[oi][j++]=encode[i].right.string[1];
                   };
                   /* Save right as 'prior' case */
                   prior.source=encode[i].source;
                   prior.length=encode[i].right.length;
                   memcpy(prior.string, encode[i].right.string, prior.length);
               };
           };
       }; /* for each element in encoded array */
   }; /* For each possible route through the array */

Build the rdv->db_data

   /* Build the rdv->data from the output array, removing duplicates along
   ** the way.
   */
   memcpy((II_PTR)(rdv->db_data + sizeof(short)), output[0],
AD_LEN_SOUNDEX_DM);
   *(short *)(rdv->db_data)=AD_LEN_SOUNDEX_DM;
   for (i=1; i<oi; i++)
   {
       unique=1;
       for (j=0; j < i; j++) /* check for duplicates */
       {
           if (!strncmp(output[j], output[i], AD_LEN_SOUNDEX_DM))
           {
               unique=0;
               break;
           };
       };
       if (unique)
       {
           if (rdv->db_length >= (*(short *)rdv->db_data + AD_LEN_SOUNDEX_DM + 1)) {
               *(char *)(rdv->db_data + sizeof(short) + *(short *)rdv->db_data)=',';
               memcpy(
                   (II_PTR)(rdv->db_data + sizeof(short) + *(short *)rdv->db_data) + 1,
           output[i], AD_LEN_SOUNDEX_DM);
               *(short *)rdv->db_data+=AD_LEN_SOUNDEX_DM + 1;
           }
           else
           {
               sprintf(msg, "soundex_dm(): unexpected overflow in return string length.");
               us_error(scb, 0x200000, msg);
               return (II_ERROR);
           };
       };
   };
   return (II_OK);

}; /* soundex_dm() */

Support function

/* soundex_dm_vowelage:
**     Simply checks if the current code set is before a vowel.
**     In this case a vowel is in the set: A, E, I, O, U, J and Y
*/
int
soundex_dm_vowelage (
   char *buffer,   /* The buffer of characters to check  */
   int  b_ptr,     /* The current position in the buffer */
   int  b_len,     /* The length of the buffer           */
   int  skip       /* How far ahead to check for a vowel */
   )
{
   /* return (0) if we have exhausted the buffer */
   if (b_ptr + skip >= b_len) {return ((int )0);};

   /* return (1) if before a vowel */
   if (buffer[b_ptr + skip]=='A' || buffer[b_ptr + skip]=='E'
    || buffer[b_ptr + skip]=='I' || buffer[b_ptr + skip]=='O'
    || buffer[b_ptr + skip]=='U' || buffer[b_ptr + skip]=='J'
    || buffer[b_ptr + skip]=='Y')
   {return ((int )1);};

   /* return (0) if NOT before a vowel */
   return ((int )0);
}; /* soundex_dm_vowelage */
Personal tools
© 2011 Actian Corporation. All Rights Reserved