您的当前位置：首页 Retrieving Japanese specialized terms and corpora from the World Wide Web

Retrieving Japanese specialized terms and corpora from the World Wide Web

来源：华佗小知识

RetrievingJapanesespecializedtermsandcorpora

fromtheWorldWideWeb

MarcoBaroni

SSLMIT,UniversityofBolognaCorsodellaRepubblica136

47100Forl`ı,Italy

baroni@sslmit.unibo.it

Abstract

TheBootCaTtoolkit(BaroniandBernardini,2004)isasuiteofperlprogramsimplementingapro-ceduretobootstrapspecializedcorporaandtermsfromthewebusingminimalknowledgesources.Inthispaper,wereportongoingworkinwhichweap-plytheBootCaTproceduretoaJapanesecorpusandtermextractiontaskinthehotelterminologydo-main.Theresultsofourexperimentsareveryen-couraging,indicatingthattheBootCaTprocedurecanbesuccessfullyapplied,withrelativelysmallmodiﬁcations,toalanguageverydifferentfromEnglishandtheotherIndo-Europeanlanguagesonwhichwetestedtheprocedureoriginally.

1Introduction

TheWorldWideWebisarichsourceofeasilyac-cessiblelanguagedata(KilgarriffandGrefenstette,2003).Amongthosewhocanbeneﬁtfromthisre-source,therearelanguageprofessionals(languageteachers,translators,interpreters,etc)whoroutinelyworkwithavarietyofspecializedlanguages,wherenewtermsareintroducedatafastpace.

WerecentlyintroducedtheBootCaTtoolkit,1asuiteofperlprogramsimplementinganiterativeknowledge-poorproceduretobootstrapspecializedcorporaandtermlistsfromtheweb.

Inthispaper,wereportpreliminaryresultsfromanongoingstudyinwhichweusetheBootCaTtoolstoextractJapanesehotelbusinessterminology.Thestudystartedwithpracticalmotivations,thatis,theinterestofourItalianstudentsofJapaneseinthisdomainandtheconsequentneedtobuildtherele-vantlanguageresourcesforteaching.Thestudyisalsogivingusachancetotestthecross-linguisticvi-abilityoftheBootCaTtoolsbyapplyingtheproce-

BootCaTstandsforBootstrappingCorporaandTerms.

Thetoolkitisfreelyavailablefrom:

http://sslmit.unibo.it/∼baroni/bootcat.htmlMotokoUeyama

SSLMIT,UniversityofBolognaCorsodellaRepubblica136

47100Forl`ı,Italy

motoko@sslmit.unibo.it

duretoatypologically(andorthographically)verydifferentlanguage.

Therestofthepaperisstructuredasfollows:Insection2weshortlyreviewsomerelatedwork.Insection3andsection4wedescribetheBootCaTprocedureandhowwetuneditforJapanese,respec-tively.Insection5wepresentourexperiments.Weconcludeinsection6bysketchingsomefuturedi-rections.

2Relatedwork

TheideaofbuildingacorpususingautomatedsearchenginequeriesoriginatesfromGhanietal.(2001),whoappliedittothecreationofmi-noritylanguagecorpora.Ourcorpus-comparison-basedtermextractionmethodologywasinspiredbyRaysonandGarside(2000).Thereis,ofcourse,alargebodyofworkonJapaneseterminology,someofitinvolvingwebmining.Forexample,FujiiandIshikawa(2000)usethewebtosearchfordeﬁni-tionsofpre-selectedterms.

However,asfarasweknow,thisistheﬁrststudypresentingafullknowledge-poorproceduretoex-tractJapanesetermsandspecializedcorporafromtheweb.

3TheBootCaTprocedure

Themaincorpus/termbootstrappingloopoftheBootCaTprocedureisillustratedinFigure1.Thebootstrappingprocessstartswithasmalllistofseedtermsrepresentativeoftheinvestigateddomain(ho-telterminologyinthepresentstudy).Theseedsarerandomlycombined,andeachcombinationisusedasaGooglequerystring.Thetopnpagesreturnedfromeachqueryareretrievedandformat-tedastext.Newseedsareextractedbycomparingthefrequencyofwords/termsintheretrievedcor-pusandinareferencecorpus.Inthecurrentstudy,corpuscomparisonstatisticsarecomputedwiththeUCStoolkit(Evert,2004).Randomcombinations

SelectInitialSeeds(Terms)CombineSeedsRandomlyRunGoogleQueriesRetrieveCorpusExtractNewSeeds(Terms)viaCorpusComparisonFigure1:TheBootCaTloop

ofthenewlyextractedseedtermsarethenusedforanotherroundofGooglequeries,andanewcorpusiscreatedbyretrievingandformattingthepagesfoundinthisround.Theiterativeprocedureofterms/corpusextractioncanberepeatedasmanytimesasdesired(e.g.,untilthecorpusreachesacer-tainsize).

4AdaptationtoJapanese

Therearetwoimportantissuesinadaptingthepro-ceduredescribedabovetoJapanese.First,Japaneseweb-pagescanbeindifferentcharactersets(shift-jis,euc-jp,iso-2022-jp,utf-8);second,inJapanesewords/tokensarenotseparatedbywhitespaceorotherdelimiters.Tosolvetheﬁrstproblem,wechangedthecodeoftheBootCaTscripttoretrieveandformatweb-pages.Now,thisscriptdetectsthecharactersetusedtoencodeapageintheHTMLcode,anditconvertsthetextofthepagefromthespeciﬁedcharactersetintoutf-8.SinceChaSen(seebelow)expectsinputandoutputtobecodedineuc-jp,weusetherecodecommandlinetool2tocon-vertbackandforthbetweenutf-8andeuc-jp.

Tosplittheretrievedtextintotokens,weuseChaSen(Matsumotoetal.,2000),apowerfulcom-mandlinetoolthatperformsJapanesetokenization,morphologicalanalysisandPOStagging.Thepars-ingandtokenizationrulesofChaSencanbemodi-ﬁedviaparameterﬁles.Forourpurposes,weadded“under-segmenting”rulestopreservetwocomplextemplates,i.e.,nominalcompounds(e.g.,yoyaku-kakunin‘reservation-conﬁrmation’;ryookin-hyoo‘rate-chart’)andnounspreﬁxedbyhonoriﬁcmark-ers(e.g.,go-yoyaku‘HONORIFIC-reservation’).

http://recode.progiciels-bpi.ca/

ByaddingthenominalcompoundtemplatetotheChaSenparameterﬁle,wecapturemanycandidatecomplextermsalreadyinthetokenizationphase.Thus,atthemomentwedonotdistinguishbetweenasimpleandacomplextermextractionphase(aswedo,instead,whentheBootCaTprocedureisappliedtoWesternlanguages).Infuturework,wewouldliketoexploremoresophisticatedmethodstoex-tractcomplextermsinJapanese.

Experiments

5.1Preparationofmaterials

Thesecondauthor,usinghernativespeakerknowl-edgeandmanualwebqueries,preparedalistof126(simpleandcomplex)termstypicalofhoteltermi-nology.20outofthese126termswereusedasini-tialseedsforthebootstrappingprocess:e.g.,yoy-aku‘reservation’,kyaku-sitsu‘guestroom’,ruumu-saabisu‘roomservice’.Theremaining106termsareusedforrecall-orientedevaluation(seesection5.3.1below).

Ourprocedurerequiresthecomparisonofthere-trievedspecializedcorporatoareferencecorpus.SincewedidnotownaJapanesecorpus,wecon-structedoneinthefollowingway.Wepreparedasetofseedsbyrandomlyselecting100wordsfromthebasicvocabularylistofanelementaryJapanesetextbook(Bannoetal.,1999).Theseedswerecom-binedtoform100randomtriplets,andthesewereusedforGooglequeries.Thecorpusobtainedbydownloadingandformattingthepagesfoundinthiswaycontainsabout3.5Mtokens.While,ofcourse,itisnotabalancedcorpus,itdoesincludetextsbelongingtoawidevarietyoftopics,genresandstyles.

5.2Procedure

UsingtheBootCaTtools,wequeriedGooglefor10randomlyconstructedtripletsofseeds.Were-trieved77pages,andwetokenizedthecontentsofthosepageswithChaSen.Weobtainedaﬁrstcor-pusofabout100Ktokens.WethenusedtheUCStoolkittoﬁndthemosttypicaltokensofthiscor-pusascomparedtothereferencecorpus.Inpartic-ular,werankedthetermsonthebasisoftwoas-sociationmeasures,log-likelihoodratioandmutualinformation,computedoncontingencytablesofoc-currencesoftermsinthespecializedandreferencecorpora.Beforecomputingmutualinformation,weﬁlteredouttermsthatoccurredlessthan10timesinthespecializedcorpus.

Log-likelihoodratioandmutualinformationtendtoﬁnditemsattheoppositeendsofthefrequencyscale.Forexample,atthetopofthelistrankedbylog-likelihoodratio,weseefrequenttermssuchashoteru‘hotel’andchoushoku‘breakfast’;atthetopofthelistrankedbymutualinformationweseerarertermssuchaskaraoke-ruumu‘karaokeroom’andyoyaku-kin‘reservationfee’.

Combiningthetop100termsfromthelog-likelihoodratioandmutualinformationlists,weob-tainedanewsetof1seedtermsforthenextrun.Inthesecondandthirdrunsoftheprocedure,webuilt50tripletstobeusedasGooglequerystrings.Inthesecondrun,weretrieved236pageswhich,again,wetokenizedwithChaSen.Theresultingcorpuscontainedabout390Ktokens.Anewlistoftermswasextractedwiththesamecorpuscompar-isonmethoddescribedabove.Thistime,thecom-binedlistcontained194terms.Inthethirdrun,weretrieved225pages,865Ktokensand196combinedterms.Intotal,weretrieved424distinctterms.Wedecidedtostopandanalyzethedatawecollecteduptothispoint.5.3Evaluation

5.3.1Termquality

Thesecondauthorratedalltheextractedtermsus-inga3-pointscale:irrelevantterms,somewhatrelevantterms,completelyrelevantterms.The“somewhatrelevant”categoryincludedtoponymsandtermsofcloselyrelateddomains(e.g.,travelandtransportations).Theresultsofthisevaluationaresummarizedintable1.

notsomewhatverytotalrelevantrelevantrelevantterms1strun,ll13%12%75%1001strun,mi7%23%70%1001strun,ll+mi10.9%16.4%72.5%12ndrun,ll18%7%75%1002ndrun,mi15%25%60%1002ndrun,ll+mi16.4%16.4%67%1943drun,ll23%19%58%1003drun,mi24%30%46%1003drun,ll+mi23.9%25%51%196combined,ll16.9%15.5%67.4%212combined,mi16.7%28.2%54.9%262combined,ll+mi

18.1%

23.3%

58.4%

424

Table1:RelevanceofretrievedtermsTheresultsreportedinthistableareverypromis-ing:intheﬁnalcombinedlist,almost60%oftheretrievedtermsareveryrelevant,andlessthan20%

arecompletelyirrelevant.3

Acloserexaminationoftheirrelevantitemsshowsthatmostofthemaregrammaticalmor-phemes/words(adverbialsufﬁxes,conjunctions,conjugationendings,etc).4Thisisparticularlytrueinthelog-likelihoodlists,sincegrammaticalmor-phemestendtobehighfrequencyitems.Speciﬁ-cally,themostcommongrammaticalelementsex-tractedbythealgorithmarethosethataretypicalofinterrogative/exhortativesentencesinthepolitereg-ister(forexample,kudasai‘please’).Itisnotsur-prisingtoﬁndahighoccurrenceofsuchformsinpagesaddressedtotouristsandpotentialhotelcos-tumers.Indeed,itmaybeusefultoourtargetusers(teachersandstudentsofspecializedlanguages)tobeawarethatthelanguageoftourismisrichinthiskindofexpressions.

Wealsoperformedrecall-orientedevaluationbycountinghowmanyofthe106non-seeditemsinouroriginallistofmanuallypickedterms(seesection5.1above)wererankedbytheautomatedprocedureinthetop100/200termsaccordingtoatleastonemeasure.Theresultsarereportedintable2.

proportionofretrievedpre-selectedtermstop100cutofftop200cutoff

1strun,ll15%24.5%1strun,mi4.7%16.9%1strun,ll+mi17.9%26.4%2ndrun,ll16.9%26.4%2ndrun,mi1.8%4.7%2ndrun,ll+mi17.9%30.1%3drun,ll6.6%12.2%3drun,mi1.8%1.8%3drun,ll+mi8.4%14.1%combined,ll21.6%32%combined,mi6.6%19.8%combined,ll+mi

24.5%36.7%

Table2:Recallofpre-selectedtermsEvenwiththemaximumrecallsetting(combinedrunsandmeasures,top200lists),justaboveonethirdofthemanuallyselectedtermswereretrievedautomatically.Thisisnotnecessarilybad,inlightofourgoodprecisionresults.Itratherseemstosug-3

Ifweselectandcombinethetop200termsfoundwitheachmeasureandoneachrun,weobtainatotalof752terms,21.4%ofwhichirrelevant,25.9%somewhatrelevantand52.6%veryrelevant.4

InanagglutinativelanguagelikeJapanese,itisoftenhardtodecidewhichelementsshouldbeconsideredindependentfunctionwordsandwhichelementsshouldbetreatedasgram-maticalafﬁxes.

gestthatthetypesoftermsdiscoveredbythealgo-rithmtendtobecomplementarytothoseobtainedonthebasisofintuition.Interestingly,recallisde-cidedlylowerinthemutualinformationliststhaninthelog-likelihoodlists.Thisisprobablyduetothefactthatmutualinformationismostlypickinguplowfrequencyterms,whereashumansaremoreinclinedtoselecthighfrequencytermsasrepresen-tativeofadomain.

Lookingatthemanuallyselectedtermsthatwerenotinourﬁnalset,ﬁrstofallwenoticethatsometermsweremissedsincetheyaretypicalofWest-ernhotels(e.g.,nakaniwa‘courtyard’),whereasthelargemajorityofpagesweretrievedpertaintoJapanesehotels.Manytermsarenotpresentinthetokenizedcorpusbecauseofsegmentationissues.Forexample,thecomplextermyotsuboshi-hoteru‘fourstar+hotel’wasincorrectlyanalyzedasyotsu-hoshihoteru=‘four+starhotel’.Singletermssuchasbasu‘bath’areoftenfoundonlyaspartof(highlyranked)complextermssuchasbasu-taoru‘bathtowel’.Forsomemissedterms,wefoundtheirequivalentspreﬁxedbyahonoriﬁcmarker:e.g.,go-yoyaku-torikeshiinsteadofyoyaku-torikeshi‘reser-vationcancellation’.Aswesaid,hotelsitestendtouseapoliteregister,whichispartlyreﬂectedinthefrequentpreﬁxationofthehonoriﬁcmarkergo-.5.3.2Corpusquality

Theretrievedcorporaareusedfortermextraction,buttheyalsoconstituteanimportantdeliverablebythemselves.Toevaluatethequalityofthecor-pora,werandomlyselected90downloadedpages(30pagesfromeachofthethreerounds).Thesec-ondauthorjudgedthesepagesona3-pointscale,assigningthehighestscoretopagesthatarehighlyinformative,veryreliable,andcompletelyrelevant.Outofthe30web-pagesselectedfromtheﬁrstcorpus,27pageswereassignedthehighestrating,1pagewasassignedtheintermediaterating,and2pageswereassignedthelowestrating.Ofthe30web-pagesselectedfromthesecondcorpus,25pageswereassignedthehighestrating,3pageswereassignedtheintermediateratingand2pageswereassignedthelowestrating.Ofthe30web-pagesse-lectedfromthethirdcorpus,24pageswereassignedthehighestrating,2pageswereassignedtheinter-mediateratingand4pageswereassignedthelowestrating.TheseresultsindicatethattheBootCaTpro-cedureisabletoﬁndrelevantpageswithhighpre-cision,andthattheincreaseinnumberofretrievedpagesinthesecondandthirdrunsdoesnotappear

tolowercorpusqualitytoomuch.

6Conclusion

OurexperimentsconﬁrmthattheBootCaTproce-dure,thankstoitsmodularandknowledge-poorna-ture,canbeeasilyadaptedtomineusableresourcesfromtypologicallyunrelatedlanguages.

Futureresearchwillfocusonthedevelopmentofsegmentationrulesthatavoidexcessiveover-segmentationandunder-segmentation.Wewillalsodeveloptechniquestoextractcomplextermsinamoresystematicway.Moregenerally,wewouldliketostudyhowfactorssuchasreferencecorpusandqualityandnumberofiterationsaffectthere-sults.

GiventhattheBootCaTtoolsandtheotherpro-gramsweused(ChaSen,UCSandrecode)arefreelyavailableandopen-source,wehopethatin-terestedresearchersandlanguageprofessionalswillhelptotest,improveandextendtheprocedure.

References

E.Banno,Y.Onno,Y.SakaneandC.Shinagawa.1999.Genki:AnintegratedcourseinelementaryJapanese.Tokyo:TheJapanTimes.

M.BaroniandS.Bernardini.2004.BootCaT:Bootstrappingcorporaandtermsfromtheweb.LREC2004.

S.Evert.2004.TheStatisticsofWordCooccur-rences:BigramsandCollocations.Ph.D.thesis,UniversityofStuttgart.

A.FujiiandT.Ishikawa.UtilizingtheWorldWideWebasanencyclopedia:Extractingtermdescrip-tionsfromsemi-structuredtexts.ACL2000.R.Ghani,R.Jones,andD.Mladenic.2001.Min-ingthewebtocreateminoritylanguagecorpora.CIKM2001,279–286.

A.KilgarriffandG.Grefenstette.2003.Introduc-tiontothespecialissueonthewebascorpus.ComputationalLinguistics,29:333–347.

Y.Matsumoto,A.Kitauchi,T.Yamashita,Y.Hi-rano,H.Matsuda,K.Takaoka,andM.Asahara.2000.MorphologicalanalysissystemChaSenversion2.2.1manual.NISTTechnicalReport.P.RaysonandR.Garside.2000.Comparingcor-porausingfrequencyproﬁling.ProceedingsofWorkshoponComparingCorporaofACL2000,1-6.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文