DENOTE ITALICIS DENOTE SUPERSCRIPT AN ALGORITHM FOR SUFFIX STRIP MFPORTER 1980 ORIGINALLY PUBLISH IN PROGRAM NO 3 PP 130137 JULY 1980 1 INTRODUCT REMOV SUFFIXE BY AUTOMAT MEAN I AN OPER WHICH I ESPECIALLY USE IN THE FIELD OF INFORM RETRIEV IN A TYPIC IR ENVIRON ONE HA A COLLECT OF DOCUM EACH DESCRIB BY THE WORD IN THE DOCUM TITLE AND POSSIBLY BY WORD IN THE DOCUM ABSTRACT IGNOR THE ISSUE OF PRECISELY WHERE THE WORD ORIGIN WE CAN SAY THAT A DOCUM I REPRES BY A VETOR OF WORD OR ERM TERM WITH A COMMON STEM WILL USUALLY HAVE SIMILAR MEAN FOR EXAMPLE CONNECT CONNECT CONNECT CONNECT CONNECT FREQUENTLY THE PERFORM OF AN IR SYSTEM WILL BE IMPROV IF TERM GROUP SUCH A THI ARE CONFLATE INTO A SINGLE TERM THI MAY BE DONE BY REMOV OF THE VARIOU SUFFIXE ED ING ION ION TO LEAVE THE SINGLE TERM CONNECT IN ADDIT THE SUFFIX STRIP PROCESS WILL REDUCE THE TOTAL NUMBER OF TERM IN THE IR SYSTEM AND HENCE REDUCE THE SIZE AND COMPLEXITY OF THE DATA IN THE SYSTEM WHICH I ALWAY ADVANTAGE MANY STRATEGI FOR SUFFIX STRIP HAVE BEEN REPORT IN THE LITERATUREEG 16 THE NATURE OF THE TASK WILL VARY CONSIDERABLY DEPEND ON WHETHER A STEM DICTIONARY I BE US WHETHER A SUFFIX LIST I BE US AND OF COURSE ON THE PURPOSE FOR WHICH THE SUFFIX STRIP I BE DONE ASSUM THAT ONE I NOT MAKE USE OF A STEM DICTIONARY AND THAT THE PURPOSE OF THE TASK I TO IMPROVE IR PERFORM THE SUFFIX STRIP PROGRAM WILL USUALLY BE GIVEN AN EXPLICIT LIST OF SIFFIXE AND WITH EACH SUFFIX THE CRITERION UNDER WHICH IT MAY BE REMOV FROM A WORD TO LEAVE A VALID STEM THI I THE APPROACH ADOPT HERE THE MAIN MERIT OF THE PRESENT PROGRAM ARE THAT IT I SMALL LESS THAN 400 LINE OF BCPL FAST IT WILL PROCESS A VOCABULARY OF 10000 DIFFER WORD IN ABOUT 81 SECOND ON THE IBM 370165 AT CAMBRIDGE UNIVERSITY AND REASONABLY SIMPLE AT ANY RATE IT I SIMPLE ENOUGH TO BE DESCRIB IN FULL A AN ALGORITHM IN THI PAPER THE PRESENT VERSION IN BCPL I FREELY AVAIL FROM THE AUTHOR BCPL I ITSELF AVAIL ON A WIDE RANGE OF DIFFER COMPUT BUT ANYONE WISH TO USE THE PROGRAM SHOULD HAVE LITTLE DIFFICULTY IN CODE IT UP IN OTHER PROGRAM LANGUAGE GIVEN THE SPEED OF THE PROGRAM IT WOULD BE QUITE REALIST TO APPLY IT TO EVERY WORD IN A LARGE FILE OF CONTINU TEXT ALTHOUGH FOR HISTOR REASON WE HAVE FOUND IT CONVENI TO APPLY IT ONLY TO RELATIVELY SMALL VOCABULARY LIST DERIV FROM CONTINU TEXT FILE IN ANY SUFFIX STRIP PROGRAM FOR IR WORK TWO POINT MUST BE BORNE IN MIND FIRSTLY THE SUFFIXE ARE BE REMOV SIMPLY TO IMPROVE IR PERFORM AND NOT A A LINGUIST EXERCISE THI MEAN THAT IT WOULD NOT BE AT ALL OBVIOU UNDER WHAT CIRCUMST A SUFFIX SHOULD BE REMOV EVEN IF WE COULD EXACTLY DETERMINE THE SUFFIXE OF A WORD BY AUTOMAT MEAN PERHAP THE BEST CRITERION FOR REMOV SUFFIXE FROM TWO WORD W1 AND W2 TO PRODUCE A SINGLE STEM I TO SAY THAT WE DO SO IF THERE APPEAR TO BE NO DIFFER BETWEEN THE TWO STATEM A DOCUM I ABOUT W1 AND A DOCUM I ABOUT W2 SO IF W1CONNECT AND W2CONNECT IT SEEM VERY REASON TO CONFLATE THEM TO A SINGLE STEM BUT IF W1RELATE AND W2RELATIVITY IT SEEM PERHAP UNREASON ESPECIALLY IF THE DOCUM COLLECT I CONCERN WITH THEORET PHYSIC IT SHOULD PERHAP BE AD THAT RELATE AND RELATIVITY RECONFL TOGETH IN THE ALGORITHM DESCRIB HERE BETWEEN THESE TWO EXTREME THERE I A CONTINUUM OF DIFFER CASE AND GIVEN TWO TERM W1 AND W2 THERE WILL BE SOME VARIATE IN OPINION A TO WHETHER THEY SHOULD BE CONFLATE JUST A THERE I WITH DECID THE RELEV OF SOME DOCUM TO A QUERY THE EVALU OF THE WORTH OF A SUFFIX STRIP SYSTEM I CORRESPONDINGLY DIFFICULT THE SECOND POINT I THAT WITH THE APPROACH ADOPT HERE IE THE USE OF A SUFFIX LIST WITH VARIOU RULE THE SUCCESS RATE FOR THE SUFFIX STRIP WILL BE SIGNIFICANTLY LESS THAN 100 IRRESPECT OF HOW THE PROCESS I EVALU FOR EXAMPLE IF SAND AND SANDER GET CONFLATE SO MOST PROBABLY WILL WAND AND WANDER THE ERROR HERE I THAT THE ER OF WANDER HA BEEN TREATE A A SUFFIX WHEN IN FACT IT I PART OF THE STEM EQUALLY A SUFFIX MAY COMPLETELY ALTER THE MEAN OF A WORD IN WHICH CASE IT REMOV I UNHELP PROBE AND PROBATE FOR EXAMPLE HAVE QUITE DISTINCT MEAN IN MODERN ENGLISH IN FACT THESE WOULD NOT BE CONFLATE IN OUR PRESENT ALGORITHM THERE COME A STAGE IN THE DEVELOP OF A SUFFIX STRIP PROGRAM WHERE THE ADDIT OF MORE RULE TO INCREASE THE PERFORM IN ONE AREA OF THE VOCABULARY CAUSE AN EQUAL DEGRAD OF PERFORM ELSEWHERE UNLESS THI PHENOMENON I NOTIC IN TIME IT I VERY EASY FOR THE PROGRAM TO BECOME MUCH MORE COMPLEX THAN I REALLY NECESSARY IT I ALSO EASY TO GIVE UNDUE EMPHASI TO CASE WHICH APPEAR TO BE IMPORT BUT WHICH TURN UT TO BE RATHER RARE FOR EXAMPLE CASE IN WHICH THE ROOT OF A WORD CHANGE WITH THE ADDIT OF A SUFFIX A IN DECEIVEDECEPT RESUMERESUMPT INDEXINDICE OCCUR MUCH MORE RARELY IN REAL VOCABULARI THAN ONE MIGHT AT FIRST SUPPOSE IN VIEW OF THE ERROR RATE THAT MUST IN ANY CASE BE EXPECT IT DID NOT SEEM WORTHWHILE TO TRY AND COPE WITH THESE CASE IT I NOT OBVIOU THAT THE SIMPLICITY OF THE PRESENT PROGRAM I ANY DEMERIT IN A TEST ON THE WELLKNOWN CRANFIELD 200 COLLECTION7 IT GAVE AN IMPROV IN RETRIEV PERFORM WHEN COMPAR WITH A VERY MUCH MORE ELABOR PROGRAM WHICH HA BEEN IN USE IN IR RESEARCH IN CAMBRIDGE SINCE 197126 THE TEST WA DONE A FOLLOW THE WORD OF THE TITLE AND ABSTRACT IN THE DOCUM WERE PASS THROUGH THE EARLIER SUFFIX STRIP SYSTEM AND THE RESULTI STEM WERE US TO INDEX THE DOCUM THE WORD OF THE QUERI WERE REDUC TO STEM IN THE SAME WAY AND THE DOCUM WERE RANK FOR EACH QUERY US TERM COORDIN MATCH OF QUERY AGAINST DOCUM FROM THESE RANK RECAL AND PRECIS VALUE WERE OBTAIN US THE STANDARD RECAL CUTOFF METHOD THE ENTIRE PROCESS WA THEN REPEATE US THE SUFFIX STRIP SYSTEM DESCRIB IN THI PAPER AND THE RESULT WERE A FOLLOW EARLIER SYSTEM PRESENT SYSTEM PRECIS RECAL PRECIS RECAL 0 5724 0 5860 10 5685 10 5813 20 5285 20 5392 30 4261 30 4351 40 4220 40 3939 50 3906 50 3885 60 3286 60 3318 70 3164 70 3119 80 2715 80 2752 90 2459 90 2585 100 2459 100 2585 CLEARY THE PERFORM I NOT VERY DIFFER THE IMPORT POINT I THAT THE EARLIER MORE ELABOR SYSTEM CERTAINLY PERFORM NO BETTER THAN THE PRESENT SIMPLE SYSTEM THI TEST WA DONE BY PROF CJ VAN RIJSBERGEN 2 THE ALGORITHM TO PRESENT THE SUFFIX STRIP ALGORITHM IN IT ENTIRETY WE WILL NEED A FEW DIFINIT A CONSONANTIN A WORD I A LETTER OTHER THAN A E I O OR U AND OTHER THAN Y PRECED BY A CONSON THE FACT THAT THE TERM CONSON I DEFIN TO SOME EXTENT IN TERM OF ITSELF DOE NOT MAKE IT AMBIGU SO IN TOY THE CONSON ARE T AND Y AND IN SYZYGY THEY ARE Z AND G IF A LETTER I NOT A CONSON IT I A OWEL A CONSON WILL BE DENOT BY C A VOWEL BY V A LIST CCC OF LENGTH GREATER THAN 0 WILL BE DENOT BY C AND A LIST VVV OF LENGTH GREATER THAN 0 WILL BE DENOT BY V ANY WORD OR PART OF A WORD THEREFORE HA ONE OF THE FOUR FORM CVCV C CVCV V VCVC C VCVC V THESE MAY ALL BE REPRES BY THE SINGLE FORM CVCVC V WHERE THE SQUARE BRACKET DENOTE ARBITRARY PRESENCE OF THEIR CONTENT US VCM TO DENOTE VC REPEATE M TIME THI MAY AGAIN BE WRITTEN A CVCMV M WILL BE CALL THE MEASUREOF ANY WORD OR WORD PART WHEN REPRES IN THI FORM THE CASE M 0 COVER THE NULL WORD HERE ARE SOME EXAMPLE M0 TR EE TREE Y BY M1 TROUBLE OAT TREE IVY M2 TROUBLE PRIVATE OATEN ORRERY THE ULESFOR REMOV A SUFFIX WILL BE GIVEN IN THE FORM CONDIT S1 S2 THI MEAN THAT IF A WORD END WITH THE SUFFIX S1 AND THE STEM BEFORE S1 SATISFI THE GIVEN CONDIT S1 I REPLAC BY S2 THE CONDIT I USUALLY GIVEN IN TERM OF M EG M 1 EMENT HERE S1 I EMENT AND S2 I NULL THI WOULD MAP REPLAC TO REPLAC SINCE REPLAC I A WORD PART FOR WHICH M 2 THE CONDIT PART MAY ALSO CONTAIN THE FOLLOW THE STEM END WITH AND SIMILARLY FOR THE OTHER LETTER V THE STEM CONTAIN A VOWEL D THE STEM END WITH A DOUBLE CONSON EG TT SS O THE STEM END CVC WHERE THE SECOND C I NOT W X OR Y EG WIL HOP AND THE CONDIT PART MAY ALSO CONTAIN EXPRESS WITH ND ORAND OT SO THAT M1 AND OR T TEST FOR A STEM WITH M1 END IN OR T WHILE D AND NOT L OR OR Z TEST FOR A STEM END WITHA DOUBLE CONSON OTHER THAN L OR Z ELABOR CONDIT LIKE THI ARE REQUIR ONLY RARELY IN A SET OF RULE WRITTEN BENEATH EACH OTHER ONLY ONE I OBEY AND THI WILL BE THE ONE WITH THE LONGEST MATCH S1 FOR THE GIVEN WORD FOR EXAMPLE WITH SS SS I I SS SS HERE THE CONDIT ARE ALL NULL CARESS MAP TO CARESS SINCE SS I THE LONGEST MATCH FOR S1 EQUALLY CARESS MAP TO CARESS S1SS AND CARE TO CARE S1 IN THE RULE BELOW EXAMPLE OF THEIR APPLIC SUCCESS OR OTHERWISE ARE GIVEN ON THE RIGHT IN LOWER CASE THE ALGORITHM NOW FOLLOW STEP 1A SS SS CARESS CARESS I I PONI PONI TI TI SS SS CARESS CARESS CAT CAT STEP 1B M0 EED E FEED FEED AGRE AGREE V ED PLASTER PLASTER BLED BLED V ING MOTOR MOTOR SING SING IF THE SECOND OR THIRD OF THE RULE IN STEP 1B I SUCCESS THE FOLLOW I DONE AT ATE CONFLATE CONFLATE BL BLE TROUBLE TROUBLE IZ IZE SIZE SIZE D AND NOT L OR OR Z SINGLE LETTER HOP HOP TAN TAN FALL FALL HISS HISS FIZZ FIZZ M1 AND O E FAIL FAIL FILE FILE THE RULE TO MAP TO A SINGLE LETTER CAUSE THE REMOV OF ONE OF THE DOUBLE LETTER PAIR THE E I PUT BACK ON AT BL AND IZ SO THAT THE SUFFIXE ATE BLE AND IZE CAN BE RECOGNISE LATER THI E MAY BE REMOV IN STEP 4 STEP 1C V Y I HAPPY HAPPI SKY SKY STEP 1 DEAL WITH PLURAL AND PAST PARTICIPLE THE SUBSEQU STEP ARE MUCH MORE STRAIGHTFORWARD STEP 2 M0 ATION ATE RELATE RELATE M0 TIONAL TION CONDIT CONDIT RATION RATION M0 ENCI ENCE VALENCE VALENCE M0 ANCI ANCE HESIT HESIT M0 IZER IZE DIGIT DIGIT M0 ABLI ABLE CONFORM CONFORM M0 ALLI AL RADIC RADIC M0 ENTLI ENT DIFFER DIFFER M0 ELI E VILE VILE M0 OUSLI OU ANALOG ANALOG M0 IZATE IZE VIETNAM VIETNAM M0 ATION ATE PREDIC PREDIC M0 ATOR ATE OPER OPER M0 ALISM AL FEUDAL FEUDAL M0 IVE IVE DECIS DECIS M0 FUL FUL HOPE HOPE M0 OUS OU CALLOUS CALLOU M0 ALITI AL FORMAL FORMAL M0 IVITI IVE SENSIT SENSIT M0 BILITI BLE SENSIBLE SENSIBLE THE TEST FOR THE STRING S1 CAN BE MADE FAST BY DO A PROGRAM SWITCH ON THE PENULTIM LETTER OF THE WORD BE TEST THI GIVE A FAIRLY EVEN BREAKDOWN OF THE POSSIBLE VALUE OF THE STRING S1 IT WILL BE SEEN IN FACT THAT THE S1STRING IN STEP 2 ARE PRESENT HERE IN THE ALPHABET ORDER OF THEIR PENULTIM LETTER SIMILAR TECHNIQUE MAY BE APPLI IN THE OTHER STEP STEP 3 M0 ICATE IC TRIPLIC TRIPLIC M0 ATIVE FORM FORM M0 ALIZE AL FORMAL FORMAL M0 ICITI IC ELECTR ELECTR M0 ICAL IC ELECTR ELECTR M0 FUL HOPE HOPE M0 NESS GOOD GOOD STEP 4 M1 AL REVIV REVIV M1 ANCE ALLOW ALLOW M1 ENCE INFER INFER M1 ER AIRLIN AIRLIN M1 IC GYROSCOP GYROSCOP M1 ABLE ADJUST ADJUST M1 IBLE DEFENS DEFEN M1 ANT IRRIT IRRIT M1 EMENT REPLAC REPLAC M1 MENT ADJUST ADJUST M1 ENT DEPEND DEPEND M1 AND OR T ION ADOPT ADOPT M1 OU HOMOLOG HOMOLOG M1 ISM COMMUN COMMUN M1 ATE ACTIV ACTIV M1 ITI ANGULAR ANGULAR M1 OU HOMOLOG HOMOLOG M1 IVE EFFECT EFFECT M1 IZE BOWDLER BOWDLER THE SUFFIXE ARE NOW REMOV ALL THAT REMAIN I A LITTLE TIDY UP STEP 5A M1 E PROBATE PROBAT RATE RATE M1 AND NOT O E CEASE CEA STEP 5B M 1 AND D AND L SINGLE LETTER CONTROL CONTROL ROLL ROLL THE ALGORITHM I CARE NOT TO REMOVE A SUFFIX WHEN THE STEM I TOO SHORT THE LENGTH OF THE STEM BE GIVEN BY IT MEASURE M THERE I NO LINGUIST BASI FOR THI APPROACH IT WA MERELY OBSERV THAT M COULD BE US QUITE EFFECTIVELY TO HELP DECIDE WHETHER OR NOT IT WA WISE TO TAKE OFF A SUFFIX FOR EXAMPLE IN THE FOLLOW TWO LIST LIST A LIST B RELATE DERIV PROBATE ACTIV CONFLATE DEMONSTR PIRATE NECESSIT PRELATE RENOV ATE I REMOV FROM THE LIST B WORD BUT NOT FROM THE LIST A WORD THI MEAN THAT THE PAIR DERIVATEDER ACTIVATEACT DEMONSTRATEDEMON TRABLE NECESSITATENECESSIT WILL CONFLATE TOGETH THE FACT THAT NO ATTEMPT I MADE TO IDENTIFY PREFIXE CAN MAKE THE RESULT LOOK RATHER INCONSIST THU PRELATE DOE NOT LOSE THE ATE BUT ARCHPREL BECOME ARCHPREL IN PRACTICE THI DOE NOT MATTER TOO MUCH BECAUSE THE PRESENCE OF THE PREFIX DECREASE THE PROBABILITY OF AN ERRONE CONFLATE COMPLEX SUFFIXE ARE REMOV BIT BY BIT IN THE DIFFER STEP THU GENER I STRIP TO GENER STEP 1 THEN TO GENER STEP 2 THEN TO GENER STEP 3 AND THEN TO GENER STEP 4 OSCIL I STRIP TO OSCIL STEP 1 THEN TO OSCIL STEP 2 THEN TO OSCIL STEP 4 AND THEN TO OSCIL STEP 5 IN A VOCABULARY OF 10000 WORD THE REDUCT IN SIZE OF THE STEM WA DISTRIBUT AMONG THE STEP A FOLLOW SUFFIX STRIP OF A VOCABULARY OF 10000 WORD NUMBER OF WORD REDUC IN STEP 1 3597 2766 3 327 42424 5 1373 NUMBER OF WORD NOT REDUC 3650 THE RESULT VOCABULARY OF STEM CONTAIN 6370 DISTINCT ENTRI THU THE SUFFIX STRIP PROCESS REDUC THE SIZE OF THE VOCABULARY BY ABOUT ONE THIRD REFER 1 LOVIN JB DEVELOP OF A STEM ALGORITHM MECHAN TRANSLATE AND COMPUT LINGUIST 1 MARCH 1968 PP 2331 2 ANDREW K THE DEVELOP OF A FAST CONFLATE ALGORITHM FOR ENGLISH DISSERT FOR THE DIPLOMA IN COMPUT SCIENCE COMPUT LABORATORY UNIVERSITY OF CAMBRIDGE 1971 3 PETRARCA AE AND LAY WM USE OF AN AUTOMATICALLY GENER AUTHORITY LIST TO ELIMIN SCATTER CAUS BY SOME SINGULAR AND PLURAL MAIN INDEX TERM PROCEED OF THE AMERICAN SOCIETY FOR INFORM SCIENCE 1969 PP 277282 4 DATTOLA ROBERT T FIRST FLEXIBLE INFORM RETRIEV SYSTEM FOR TEXT WEBSTER NY XEROX CORPOR 12 DEC 1975 5 COLOMBO D AND NIEHOFF RT FINAL REPORT ON IMPROV ACCESS TO SCIENTIF AND TECHNIC INFORM THROUGH AUTOM VOCABULARY SWITCHINGNSF GRANT NO SIS7512924 TO THE NATION SCIENCE FOUNDATE 6 DAWSON JL SUFFIX REMOV AND WORD CONFLATE ALLC BULLETIN MICHAELMA 1974 P3346 7 CLEVERDON CW MILL J AND KEEN M FACTOR DETERMIN THE PERFORM OF INDEX SYSTEMS2 VOL COLLEGE OF AERONAUT CRANFIELD 1966