fromtheWorldWideWeb
MarcoBaroni
SSLMIT,UniversityofBolognaCorsodellaRepubblica136
47100Forl`ı,Italy
baroni@sslmit.unibo.it
Abstract
TheBootCaTtoolkit(BaroniandBernardini,2004)isasuiteofperlprogramsimplementingapro-ceduretobootstrapspecializedcorporaandtermsfromthewebusingminimalknowledgesources.Inthispaper,wereportongoingworkinwhichweap-plytheBootCaTproceduretoaJapanesecorpusandtermextractiontaskinthehotelterminologydo-main.Theresultsofourexperimentsareveryen-couraging,indicatingthattheBootCaTprocedurecanbesuccessfullyapplied,withrelativelysmallmodifications,toalanguageverydifferentfromEnglishandtheotherIndo-Europeanlanguagesonwhichwetestedtheprocedureoriginally.
1Introduction
TheWorldWideWebisarichsourceofeasilyac-cessiblelanguagedata(KilgarriffandGrefenstette,2003).Amongthosewhocanbenefitfromthisre-source,therearelanguageprofessionals(languageteachers,translators,interpreters,etc)whoroutinelyworkwithavarietyofspecializedlanguages,wherenewtermsareintroducedatafastpace.
WerecentlyintroducedtheBootCaTtoolkit,1asuiteofperlprogramsimplementinganiterativeknowledge-poorproceduretobootstrapspecializedcorporaandtermlistsfromtheweb.
Inthispaper,wereportpreliminaryresultsfromanongoingstudyinwhichweusetheBootCaTtoolstoextractJapanesehotelbusinessterminology.Thestudystartedwithpracticalmotivations,thatis,theinterestofourItalianstudentsofJapaneseinthisdomainandtheconsequentneedtobuildtherele-vantlanguageresourcesforteaching.Thestudyisalsogivingusachancetotestthecross-linguisticvi-abilityoftheBootCaTtoolsbyapplyingtheproce-
1
BootCaTstandsforBootstrappingCorporaandTerms.
Thetoolkitisfreelyavailablefrom:
http://sslmit.unibo.it/∼baroni/bootcat.htmlMotokoUeyama
SSLMIT,UniversityofBolognaCorsodellaRepubblica136
47100Forl`ı,Italy
motoko@sslmit.unibo.it
duretoatypologically(andorthographically)verydifferentlanguage.
Therestofthepaperisstructuredasfollows:Insection2weshortlyreviewsomerelatedwork.Insection3andsection4wedescribetheBootCaTprocedureandhowwetuneditforJapanese,respec-tively.Insection5wepresentourexperiments.Weconcludeinsection6bysketchingsomefuturedi-rections.
2Relatedwork
TheideaofbuildingacorpususingautomatedsearchenginequeriesoriginatesfromGhanietal.(2001),whoappliedittothecreationofmi-noritylanguagecorpora.Ourcorpus-comparison-basedtermextractionmethodologywasinspiredbyRaysonandGarside(2000).Thereis,ofcourse,alargebodyofworkonJapaneseterminology,someofitinvolvingwebmining.Forexample,FujiiandIshikawa(2000)usethewebtosearchfordefini-tionsofpre-selectedterms.
However,asfarasweknow,thisisthefirststudypresentingafullknowledge-poorproceduretoex-tractJapanesetermsandspecializedcorporafromtheweb.
3TheBootCaTprocedure
Themaincorpus/termbootstrappingloopoftheBootCaTprocedureisillustratedinFigure1.Thebootstrappingprocessstartswithasmalllistofseedtermsrepresentativeoftheinvestigateddomain(ho-telterminologyinthepresentstudy).Theseedsarerandomlycombined,andeachcombinationisusedasaGooglequerystring.Thetopnpagesreturnedfromeachqueryareretrievedandformat-tedastext.Newseedsareextractedbycomparingthefrequencyofwords/termsintheretrievedcor-pusandinareferencecorpus.Inthecurrentstudy,corpuscomparisonstatisticsarecomputedwiththeUCStoolkit(Evert,2004).Randomcombinations
SelectInitialSeeds(Terms)CombineSeedsRandomlyRunGoogleQueriesRetrieveCorpusExtractNewSeeds(Terms)viaCorpusComparisonFigure1:TheBootCaTloop
ofthenewlyextractedseedtermsarethenusedforanotherroundofGooglequeries,andanewcorpusiscreatedbyretrievingandformattingthepagesfoundinthisround.Theiterativeprocedureofterms/corpusextractioncanberepeatedasmanytimesasdesired(e.g.,untilthecorpusreachesacer-tainsize).
4AdaptationtoJapanese
Therearetwoimportantissuesinadaptingthepro-ceduredescribedabovetoJapanese.First,Japaneseweb-pagescanbeindifferentcharactersets(shift-jis,euc-jp,iso-2022-jp,utf-8);second,inJapanesewords/tokensarenotseparatedbywhitespaceorotherdelimiters.Tosolvethefirstproblem,wechangedthecodeoftheBootCaTscripttoretrieveandformatweb-pages.Now,thisscriptdetectsthecharactersetusedtoencodeapageintheHTMLcode,anditconvertsthetextofthepagefromthespecifiedcharactersetintoutf-8.SinceChaSen(seebelow)expectsinputandoutputtobecodedineuc-jp,weusetherecodecommandlinetool2tocon-vertbackandforthbetweenutf-8andeuc-jp.
Tosplittheretrievedtextintotokens,weuseChaSen(Matsumotoetal.,2000),apowerfulcom-mandlinetoolthatperformsJapanesetokenization,morphologicalanalysisandPOStagging.Thepars-ingandtokenizationrulesofChaSencanbemodi-fiedviaparameterfiles.Forourpurposes,weadded“under-segmenting”rulestopreservetwocomplextemplates,i.e.,nominalcompounds(e.g.,yoyaku-kakunin‘reservation-confirmation’;ryookin-hyoo‘rate-chart’)andnounsprefixedbyhonorificmark-ers(e.g.,go-yoyaku‘HONORIFIC-reservation’).
2
http://recode.progiciels-bpi.ca/
ByaddingthenominalcompoundtemplatetotheChaSenparameterfile,wecapturemanycandidatecomplextermsalreadyinthetokenizationphase.Thus,atthemomentwedonotdistinguishbetweenasimpleandacomplextermextractionphase(aswedo,instead,whentheBootCaTprocedureisappliedtoWesternlanguages).Infuturework,wewouldliketoexploremoresophisticatedmethodstoex-tractcomplextermsinJapanese.
5
Experiments
5.1Preparationofmaterials
Thesecondauthor,usinghernativespeakerknowl-edgeandmanualwebqueries,preparedalistof126(simpleandcomplex)termstypicalofhoteltermi-nology.20outofthese126termswereusedasini-tialseedsforthebootstrappingprocess:e.g.,yoy-aku‘reservation’,kyaku-sitsu‘guestroom’,ruumu-saabisu‘roomservice’.Theremaining106termsareusedforrecall-orientedevaluation(seesection5.3.1below).
Ourprocedurerequiresthecomparisonofthere-trievedspecializedcorporatoareferencecorpus.SincewedidnotownaJapanesecorpus,wecon-structedoneinthefollowingway.Wepreparedasetofseedsbyrandomlyselecting100wordsfromthebasicvocabularylistofanelementaryJapanesetextbook(Bannoetal.,1999).Theseedswerecom-binedtoform100randomtriplets,andthesewereusedforGooglequeries.Thecorpusobtainedbydownloadingandformattingthepagesfoundinthiswaycontainsabout3.5Mtokens.While,ofcourse,itisnotabalancedcorpus,itdoesincludetextsbelongingtoawidevarietyoftopics,genresandstyles.
5.2Procedure
UsingtheBootCaTtools,wequeriedGooglefor10randomlyconstructedtripletsofseeds.Were-trieved77pages,andwetokenizedthecontentsofthosepageswithChaSen.Weobtainedafirstcor-pusofabout100Ktokens.WethenusedtheUCStoolkittofindthemosttypicaltokensofthiscor-pusascomparedtothereferencecorpus.Inpartic-ular,werankedthetermsonthebasisoftwoas-sociationmeasures,log-likelihoodratioandmutualinformation,computedoncontingencytablesofoc-currencesoftermsinthespecializedandreferencecorpora.Beforecomputingmutualinformation,wefilteredouttermsthatoccurredlessthan10timesinthespecializedcorpus.
Log-likelihoodratioandmutualinformationtendtofinditemsattheoppositeendsofthefrequencyscale.Forexample,atthetopofthelistrankedbylog-likelihoodratio,weseefrequenttermssuchashoteru‘hotel’andchoushoku‘breakfast’;atthetopofthelistrankedbymutualinformationweseerarertermssuchaskaraoke-ruumu‘karaokeroom’andyoyaku-kin‘reservationfee’.
Combiningthetop100termsfromthelog-likelihoodratioandmutualinformationlists,weob-tainedanewsetof1seedtermsforthenextrun.Inthesecondandthirdrunsoftheprocedure,webuilt50tripletstobeusedasGooglequerystrings.Inthesecondrun,weretrieved236pageswhich,again,wetokenizedwithChaSen.Theresultingcorpuscontainedabout390Ktokens.Anewlistoftermswasextractedwiththesamecorpuscompar-isonmethoddescribedabove.Thistime,thecom-binedlistcontained194terms.Inthethirdrun,weretrieved225pages,865Ktokensand196combinedterms.Intotal,weretrieved424distinctterms.Wedecidedtostopandanalyzethedatawecollecteduptothispoint.5.3Evaluation
5.3.1Termquality
Thesecondauthorratedalltheextractedtermsus-inga3-pointscale:irrelevantterms,somewhatrelevantterms,completelyrelevantterms.The“somewhatrelevant”categoryincludedtoponymsandtermsofcloselyrelateddomains(e.g.,travelandtransportations).Theresultsofthisevaluationaresummarizedintable1.
notsomewhatverytotalrelevantrelevantrelevantterms1strun,ll13%12%75%1001strun,mi7%23%70%1001strun,ll+mi10.9%16.4%72.5%12ndrun,ll18%7%75%1002ndrun,mi15%25%60%1002ndrun,ll+mi16.4%16.4%67%1943drun,ll23%19%58%1003drun,mi24%30%46%1003drun,ll+mi23.9%25%51%196combined,ll16.9%15.5%67.4%212combined,mi16.7%28.2%54.9%262combined,ll+mi
18.1%
23.3%
58.4%
424
Table1:RelevanceofretrievedtermsTheresultsreportedinthistableareverypromis-ing:inthefinalcombinedlist,almost60%oftheretrievedtermsareveryrelevant,andlessthan20%
arecompletelyirrelevant.3
Acloserexaminationoftheirrelevantitemsshowsthatmostofthemaregrammaticalmor-phemes/words(adverbialsuffixes,conjunctions,conjugationendings,etc).4Thisisparticularlytrueinthelog-likelihoodlists,sincegrammaticalmor-phemestendtobehighfrequencyitems.Specifi-cally,themostcommongrammaticalelementsex-tractedbythealgorithmarethosethataretypicalofinterrogative/exhortativesentencesinthepolitereg-ister(forexample,kudasai‘please’).Itisnotsur-prisingtofindahighoccurrenceofsuchformsinpagesaddressedtotouristsandpotentialhotelcos-tumers.Indeed,itmaybeusefultoourtargetusers(teachersandstudentsofspecializedlanguages)tobeawarethatthelanguageoftourismisrichinthiskindofexpressions.
Wealsoperformedrecall-orientedevaluationbycountinghowmanyofthe106non-seeditemsinouroriginallistofmanuallypickedterms(seesection5.1above)wererankedbytheautomatedprocedureinthetop100/200termsaccordingtoatleastonemeasure.Theresultsarereportedintable2.
proportionofretrievedpre-selectedtermstop100cutofftop200cutoff
1strun,ll15%24.5%1strun,mi4.7%16.9%1strun,ll+mi17.9%26.4%2ndrun,ll16.9%26.4%2ndrun,mi1.8%4.7%2ndrun,ll+mi17.9%30.1%3drun,ll6.6%12.2%3drun,mi1.8%1.8%3drun,ll+mi8.4%14.1%combined,ll21.6%32%combined,mi6.6%19.8%combined,ll+mi
24.5%36.7%
Table2:Recallofpre-selectedtermsEvenwiththemaximumrecallsetting(combinedrunsandmeasures,top200lists),justaboveonethirdofthemanuallyselectedtermswereretrievedautomatically.Thisisnotnecessarilybad,inlightofourgoodprecisionresults.Itratherseemstosug-3
Ifweselectandcombinethetop200termsfoundwitheachmeasureandoneachrun,weobtainatotalof752terms,21.4%ofwhichirrelevant,25.9%somewhatrelevantand52.6%veryrelevant.4
InanagglutinativelanguagelikeJapanese,itisoftenhardtodecidewhichelementsshouldbeconsideredindependentfunctionwordsandwhichelementsshouldbetreatedasgram-maticalaffixes.
gestthatthetypesoftermsdiscoveredbythealgo-rithmtendtobecomplementarytothoseobtainedonthebasisofintuition.Interestingly,recallisde-cidedlylowerinthemutualinformationliststhaninthelog-likelihoodlists.Thisisprobablyduetothefactthatmutualinformationismostlypickinguplowfrequencyterms,whereashumansaremoreinclinedtoselecthighfrequencytermsasrepresen-tativeofadomain.
Lookingatthemanuallyselectedtermsthatwerenotinourfinalset,firstofallwenoticethatsometermsweremissedsincetheyaretypicalofWest-ernhotels(e.g.,nakaniwa‘courtyard’),whereasthelargemajorityofpagesweretrievedpertaintoJapanesehotels.Manytermsarenotpresentinthetokenizedcorpusbecauseofsegmentationissues.Forexample,thecomplextermyotsuboshi-hoteru‘fourstar+hotel’wasincorrectlyanalyzedasyotsu-hoshihoteru=‘four+starhotel’.Singletermssuchasbasu‘bath’areoftenfoundonlyaspartof(highlyranked)complextermssuchasbasu-taoru‘bathtowel’.Forsomemissedterms,wefoundtheirequivalentsprefixedbyahonorificmarker:e.g.,go-yoyaku-torikeshiinsteadofyoyaku-torikeshi‘reser-vationcancellation’.Aswesaid,hotelsitestendtouseapoliteregister,whichispartlyreflectedinthefrequentprefixationofthehonorificmarkergo-.5.3.2Corpusquality
Theretrievedcorporaareusedfortermextraction,buttheyalsoconstituteanimportantdeliverablebythemselves.Toevaluatethequalityofthecor-pora,werandomlyselected90downloadedpages(30pagesfromeachofthethreerounds).Thesec-ondauthorjudgedthesepagesona3-pointscale,assigningthehighestscoretopagesthatarehighlyinformative,veryreliable,andcompletelyrelevant.Outofthe30web-pagesselectedfromthefirstcorpus,27pageswereassignedthehighestrating,1pagewasassignedtheintermediaterating,and2pageswereassignedthelowestrating.Ofthe30web-pagesselectedfromthesecondcorpus,25pageswereassignedthehighestrating,3pageswereassignedtheintermediateratingand2pageswereassignedthelowestrating.Ofthe30web-pagesse-lectedfromthethirdcorpus,24pageswereassignedthehighestrating,2pageswereassignedtheinter-mediateratingand4pageswereassignedthelowestrating.TheseresultsindicatethattheBootCaTpro-cedureisabletofindrelevantpageswithhighpre-cision,andthattheincreaseinnumberofretrievedpagesinthesecondandthirdrunsdoesnotappear
tolowercorpusqualitytoomuch.
6Conclusion
OurexperimentsconfirmthattheBootCaTproce-dure,thankstoitsmodularandknowledge-poorna-ture,canbeeasilyadaptedtomineusableresourcesfromtypologicallyunrelatedlanguages.
Futureresearchwillfocusonthedevelopmentofsegmentationrulesthatavoidexcessiveover-segmentationandunder-segmentation.Wewillalsodeveloptechniquestoextractcomplextermsinamoresystematicway.Moregenerally,wewouldliketostudyhowfactorssuchasreferencecorpusandqualityandnumberofiterationsaffectthere-sults.
GiventhattheBootCaTtoolsandtheotherpro-gramsweused(ChaSen,UCSandrecode)arefreelyavailableandopen-source,wehopethatin-terestedresearchersandlanguageprofessionalswillhelptotest,improveandextendtheprocedure.
References
E.Banno,Y.Onno,Y.SakaneandC.Shinagawa.1999.Genki:AnintegratedcourseinelementaryJapanese.Tokyo:TheJapanTimes.
M.BaroniandS.Bernardini.2004.BootCaT:Bootstrappingcorporaandtermsfromtheweb.LREC2004.
S.Evert.2004.TheStatisticsofWordCooccur-rences:BigramsandCollocations.Ph.D.thesis,UniversityofStuttgart.
A.FujiiandT.Ishikawa.UtilizingtheWorldWideWebasanencyclopedia:Extractingtermdescrip-tionsfromsemi-structuredtexts.ACL2000.R.Ghani,R.Jones,andD.Mladenic.2001.Min-ingthewebtocreateminoritylanguagecorpora.CIKM2001,279–286.
A.KilgarriffandG.Grefenstette.2003.Introduc-tiontothespecialissueonthewebascorpus.ComputationalLinguistics,29:333–347.
Y.Matsumoto,A.Kitauchi,T.Yamashita,Y.Hi-rano,H.Matsuda,K.Takaoka,andM.Asahara.2000.MorphologicalanalysissystemChaSenversion2.2.1manual.NISTTechnicalReport.P.RaysonandR.Garside.2000.Comparingcor-porausingfrequencyprofiling.ProceedingsofWorkshoponComparingCorporaofACL2000,1-6.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- huatuo0.cn 版权所有 湘ICP备2023017654号-2
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务