ContentslistsavailableatScienceDirect
Data&KnowledgeEngineering
journalhomepage:www.elsevier.com/locate/datak
OndetectionofemerginganomaloustrafficpatternsusingGPSdata
LinseyXiaolinPanga,d,⁎,SanjayChawlaa,WeiLiub,YuZhengcabcdSchoolofInformationTechnologies,UniversityofSydney,Australia
Dept.ofComputerScienceandSoftwareEngineering,UniversityofMelbourne,AustraliaMicrosoftResearchAsiaNICTA,Sydney,Australia
articleinfoabstract
Theincreasingavailabilityoflarge-scaletrajectorydataprovidesusgreatopportunitytoexplorethemforknowledgediscoveryintransportationsystemsusingadvanceddataminingtechniques.Nowadays,largenumberoftaxicabsinmajormetropolitancitiesareequippedwithaGPSdevice.Sincetaxisareontheroadnearly24haday(withdriverschangingshifts),theycannowactasreliablesensorstomonitorthebehavioroftraffic.Inthisarticle,weuseGPSdatafromtaxistomonitortheemergenceofunexpectedbehaviorintheBeijingmetropolitanarea,whichhasthepotentialtoestimateandimprovetrafficconditionsinadvance.Weadaptlikelihoodratioteststatistic(LRT)whichhavepreviouslybeenmostlyusedinepidemiologicalstudiestodescribetrafficpatterns.TothebestofourknowledgetheuseofLRTintrafficdomainisnotonlynovelbutresultsinaccurateandrapiddetectionofanomalousbehavior.
CrownCopyright©2013PublishedbyElsevierB.V.Allrightsreserved.
Availableonline18May2013Keywords:Datamining
MiningmethodsandalgorithmsSpatial/temporaldatabases
1.Introduction
WiththeincreasingavailabilityofhighresolutionGPStracesfromvehiclesinlargemetropolitanareas,thereisanopportunitytoinfersophisticatedpatternsandtrendswhichtillnowhasnotbeenpossible[1].Theinferredtrendscanthenbeusedasinputintopolicyplanningacrossavarietyofdomainsincludingtrafficmanagement,urbanplanningandenvironmentalmonitoring.
Inthisarticle,weapplystatisticalapproachonmassivetaxilocationtraces,whichareexploredtoextracttheoutliertrafficpatternintransportationsystems.WeknowthousandsoftaxisplytheroadsoflargemetropolitancitieslikeNewYork,London,BeijingandTokyoeveryday.Mosttaxisareontheroad24hadaywithdriverschangingshifts.ManyofthesetaxisarenowequippedwithGPSandtheirspatio-temporalcoordinatesareavailable.Thusifacityispartitionedintoagridthenatagiventimewecanestimatethecountofthenumberoftaxisinthegridcells.Overtime,thecellcountswillsettleintoapatternandvaryperiodically.Forexample,duringmorningrushhourmoretaxiswillbeconcentratedinbusinessdistrictsthanatothertimesoftheday.Similarlytaxicountsnearairportswillsynchronizewithaircraftarrivalanddepartureschedules.Occasionallytherewillbeadepartureofthecellcountsfromperiodicbehaviorduetounforeseeneventslikevehiclebreakdownsorone-timeeventslikebigsportingevents,fairsandconventions.
Ourobjectiveistoidentifycontiguoussetofcellsandtimeintervalswhichhavethelargeststatisticallysignificantdeparturefromexpectedbehavior.
Oncesuchregionsandtimeintervalshavebeendiscoveredthenexpertscanbeginidentifyingeventswhichmayhavecausedtheunexpectedbehavior.Thisinturncanhelpmakeprovisionstomanagefuturetrafficbehavior.Similarproblemsappearinmanyotherdomains.Forexample,governmenthealthcareagenciesareinterestedindetectingemergenceofdiseasepatternswhichdeviatefromexpectedbehavior.
⁎Correspondingauthorat:SchoolofInformationTechnologies,UniversityofSydney,Australia.
E-mailaddresses:qlinsey@it.usyd.edu.au(L.X.Pang),sanjay.chawla@sydney.edu.au(S.Chawla),wei.liu@unimelb.edu.au(W.Liu),yuzheng@microsoft.com(Y.Zheng).
0169-023X/$–seefrontmatter.CrownCopyright©2013PublishedbyElsevierB.V.Allrightsreserved.http://dx.doi.org/10.1016/j.datak.2013.05.002
358L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373
Thenumberofcontiguousregionsandtimeintervalsisverylarge.Forexample,ifthespatialgridcorrespondstoan×nmatrixandthereareTtimeintervals,thentherearepotentiallyO(n2T)spatio-temporalcellsandO(n4T2)cubicregions.1Thehugeamountofspatio-temporaldata,suchastaxicountacrossdifferentgridregionswithindifferenttimestepsfromminutestohourstodays,requiresanefficientapproachtodetectspatial–temporaloutliersforpredictingabnormaleventsandimplementingtrafficcontrolmeasuresinadvance.Forthismotivation,weapplytheroadnetworkofBeijingandpartitionitintogridtofindoutliers(Fig.1).
Inapaperofparticularrelevancetoourwork,theLRTframework[2]statesthecomputationcostforsinglestatisticvalueaswellasenumeratingallthespatialregionstobeexpensive.Toavoidperformingstatisticalcomputationsforeveryregion,itprovidesapruningstrategybasedonclassicallikelihoodteststatistic.Inthisarticle,weextendtheLRTframeworktodetectabnormaltrafficpattern.Morespecifically,thecontributionsare:••••
Ageneralandefficientpatternminingapproachforspatio-temporaloutlierdetectionisproposed.Persistentandemergingoutlierdetectionstatisticalmodelsareprovided.
Wegiveourproofthattheupper-boundingstrategyofLRTisapplicableto“persistent”and“emerging”outlierdetectionmodels.Experimentsareconductedonsyntheticdatatoverifytheextendedpruningapproachandshowthesignificantimprovementofsearchingwhendatasetsizeislarge;wealsoperformedrealdatavalidationinthedetectionofemergingtaxicounttrendduetosomemajorevents.
Therestofthisarticleisorganizedasfollows.Section2reviewsrelatedwork.Section3illustratesthestatisticalbackgroundandupper-boundingmethodologyforpruning.Section4proposesourapproach,inwhichthestatisticaldetectionmodelsareprovided.Theupper-boundingandpruningmechanisminthisframeworkbasedonourproofarepresentedinSection5.Computationalcomplexityisalsodiscussedinthissection.Section6showstheexperimentsandcasestudies.Finally,Section7concludesthisworkandSection8givesthefuturework.2.Relatedwork2.1.Trafficoutliers
Recently,quiteafewresearchprojectsstartedtofindouttrafficpatternsandanomaliesusingtaxitrajectories[3–6].Forinstance,toprovideauserwiththefastestroutetoagivendestinationatagivendeparturetime,Yuanetal.[4]minesmartdrivingdirectionsfromhistoricalGPStrajectoriesofalargenumberoftaxis.Itproposesatime-dependentlandmarkgraphtomodelthepropertiesofdynamicroadnetworksandappliestwo-stageroutingalgorithmtofindtheefficientdrivingdirections.InanotherworkofYuanetal.[3],itpresentsacloud-basedsystemtoretrievethefastestdrivingroutesbasedontrafficconditionsanddriverbehavior.TheCloudbuildsamodelincorporatingdayoftheweek,timeofday,weatherconditions,andindividualdrivingstrategies.Usingthismodel,thesystempredictsthetrafficconditionsofafuturetimebyagivenrouteandperformsaself-adaptivedrivingdirectionserviceforaparticularuser.
Miningtrafficpatternisanimportantresearchapproach,butdetectingtheoutliersfrommaintrafficflowisalsomeaningful.Forinstance,whenatrafficincidentorjamhappens,trafficflowchangessuddenlyandthiswillbereflectedbyoutliers.Trafficincidentscanbedetectedthroughrecognizingoutliers.Suchunusualtrafficpatternreflectsabnormaltrafficstreamsonroadnetworksandprovidesuseful,importantandvaluableinformation.Unknownbutpotentiallyimportantpatternscanbeforecastbyanalyzingtheseoutliers.Therefore,thedetectionofoutliers/anomaliesfromtrajectorydatacanhelpinsensingabnormaleventsandplanfortheirimpacttoensuresmootherflowoftraffic.InLiuetal.'s[5]work,algorithmsarepresentedfordiscoveringspatio-temporaloutliersandcausalrelationships.Thediscoveryofrelationships,especiallycausalinteractions,amongdetectedtrafficoutliersisinvestigated.ChawlaandZhengetal.[7]furtherdiagnosedetectedtrafficanomaliesbystudyingthetrafficflows(paths)thatleadtoananomaly.InZhengetal.'s[6]work,itdetectsflawedurbanplanningusingtheGPStrajectoriesoftaxicabstravelinginurbanareas.Itfindsanomalouspatternsinacityusingtaxitrajectorieswhichprovidesadeeperunderstandingoftheflawedplanning.Althoughthesetwoapproachesareusedtodetecttrafficoutliers,theyaredifferentfromourworksincewedetecttrafficoutliersusingstatistical-basedapproachandwehaveadifferentdefinitionforabnormaltrafficpatterns.2.2.Outlierdetectionmethods
Tillnow,manyanomalydetectiontechniqueshavebeenspecificallydevelopedforcertainapplicationdomains,whileothersaremoregeneric.Thetechniquescanbecategorizedasclassification-based,distanced-based,clustered-based,andstatistical-based,etc.Inthesesurveys[8,9],theygivecomprehensiveandhigh-leveloverviewofdifferentoutlierdetectiontechniquesandsomeoftheirapplications.
Inthisarticle,wefocusonthestatistic-basedapproachtodetectspatio-temporaloutlier.Theunderlyingprincipleofanystatisticaloutlierdetectiontechniqueis:“Ananomalyisanobservationwhichissuspectedofbeingpartiallyorwhollyirrelevant
Inthiswork,n×nspatialgridandTtimeintervalsaremappedtoathree-dimensionalgrid.Theunitcellisintheshapeofacubeandeverysub-regioninthegridiscalledcubicregion.
1L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373359
(a) Road Network of Beijing(b) Grid Map
Fig.1.AnexampleofthetrafficnetworkofBeijing.Basedonthelongitudeandlatitude,theentirecityispartitionedintoagridmap.Sub-figure(a)ispartitionedintosub-figure(b).
becauseitisnotgeneratedbythestochasticmodelassumed”[10].Itisbasedonthekeyassumption:Normaldatainstancesoccurinhighprobabilityregionsofastochasticmodel,whileanomaliesoccurinthelowprobabilityregionsofthestochasticmodel.Statisticaltechniquesfitastatisticalmodel(usuallyfornormalbehavior)tothegivendataandthenapplyastatisticalinferencetesttodetermineifanunseeninstancebelongstothismodelornot.Instancesthathavealowprobabilityfromtheappliedteststatisticaredeclaredasoutliers.Scoringtechniquesareusedtoassignananomalyscoretoeachinstanceinthetestdatadependingonthedegreetowhichthatinstanceisconsideredananomaly.Usually,theoutputofsuchtechniquesisarankedlistofoutliers.Wemaychoosetoeitheranalyzethetopfewoutliersoruseacut-offthresholdtoselecttheoutliers.Bothparametricaswellasnon-parametrictechniqueshavebeenappliedtofitastatisticalmodel[11,12].Inourwork,weonlyconsidertheparametricanomalydetectiontechniquesbasedontheclassicallikelihoodratioteststatistic(i.e.LRT).
Inthedomainofspatio-temporalapplications,moststatisticoutlierdetectionapproachesproposedsofarareonpurelyspatialsearching[13–15].Evenwhenconsideringtimeaspect,mostoftheexistingworkonspatio-temporaloutlierdetectiontreatsthetimedimensionsimply,eitherbyapplyingpurelyspatialoutlierdetectionmethodsateachtimestep,orbytreatingtimeasanotherspatialdimensionandthusapplyingspatialoutlierdetectioninonemoredimensionalspace(originalspatialdimensionsplustimedimension).Thedisadvantageofthefirstapproachisthatbyonlyexaminingonetimestepofdataatatime,moreslowlyemergingoutliersmaynotbedetected.Thedisadvantageofthesecondapproachisthatlessrelevantoutliersmaybedetected:thoseoutliersthathaveconstantlyexistedforalongtime,ratherthanthosethatarenewlyemerging[16,17].Inourfollowingwork,wedifferentiatetemporalpropertywithspatialproperty.Morespecifically,weinvestigatethespatialregionintwodifferentscenariosinwhichthetemporalpropertyispersistentorotherwiseemerging.Persistenttemporaldatareferstodatawhereitstemporalpropertyisconsistentovertime.Emergingtemporaldatareferstothedatawhereitstemporalpropertyisnon-decreasingovertime.Thesetwoconceptsprovidemoreinformationforpracticaloutlierdetections.
Amongthevariousstatisticalmethodsfordiscoveringoutlier,thespatialandspace–timescanstatistic,introducedbyKulldorff[18–21],hasbeenthemostwidelyadopted.However,itisoriginallydesignedforPoissonandBernoullidata.Lateron,thedifferentvariationsofordinal,exponentialandnormalmodelsareproposed[22–25].Theyhavebeenimplementedinthesoftware(SaTScan)[26].Inthespace–timescanstatisticofKulldorff,thekeyparameterisassumedtobeconsistentovertime.Thetechniquesimplyappliestimeasonemoredimension.Nielletal.[16]pointoutthedistinctfeatureoftimeaspectandproposeamodifiedteststatistictodetectlocalizedandglobalizedemergingcluster.Tangoetal.[27]alsoproposeaspace–timescanstatisticbasedonnegativebinomialmodelbytakingintoaccountthepossibilityofnonnegligibletime-to-timevariationofPoissonmean.Wuetal.[2]proposeagenericframeworkcalledLRTforanyunderlyingstatisticalmodel.Itusestheclassiclikelihoodratiotest(LRT)statisticasascoringfunctiontoevaluatethe“anomalousness”ofagivenspatialregionwithrespecttotherestofthespatialregion.Moreoveragenericpruningstrategywasproposedtogreatlyreducethenumberoflikelihoodratiotests.However,itisusedforspatialanomalydetectionwithoutconsideringthetemporalproperty.Liuetal.[5]proposeanapproachtodiscovercasualrelationshipsamongspatio-temporaloutliers.Here,weonlyfocusondetectingspatial–temporaloutliers.
2.3.Performanceissue
Furthermore,performanceissueisalsoabigprobleminspatialorspatio-temporaloutlierdetection.Thenaivecomputationofspatialoutlierdetectionisverytime-consuming,variousstrategieshavebeenproposedtospeeduptheprocess[18,28–30].
360L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373
TheseexistingmethodsintheliteraturearebasedonKulldorff'sspatialscanstatisticandtheyaimtoactuallyavoidconsideringallO(n4)rectangularareas.Alsotheyareonlyapplicabletothoserelativelysimpledensitymeasuresthatareconvexormonotonicwithrespecttotheratioofzonepopulationovertheentirepopulationandtheratioofzone'seventcountovertheentireeventcount.TheLRTframeworkstatesthecomputationcostforsinglestatisticvalueaswellasenumeratingallthespatialregionstobeexpensive[2].Toavoidperformingstatisticalcomputationsforeveryregion,itprovidesapruningstrategybasedonclassicallikelihoodteststatistic.Inthisarticle,weextendittobeapplicabletopersistentandemergingoutlierdetectionscenarios.3.Background
3.1.Thelikelihoodratiotest(LRT)
Weprovideabriefbutself-containedintroductionforfindingthemostanomalousregion(rectangle)inaspatialsetting.Theregionsarerectanglesmappedontoaspatialgrid.Wealsoexplainapruningstrategywhichcancutdownthenumberofrectanglesthatneedtobechecked.Thebasictooltofindtheanomalousregionisthelikelihoodratiotest(LRT).
GivenadatasetX,themodeldistributionf(X,θ),anullhypothesisH0:θ∈Θ0andanalternatehypothesisH1:θ∈Θ−Θ0,LRTistheratio
λ¼
supΘ0fLðθjXÞjH0gsupΘfLðθjXÞjH1gwhereL()isthelikelihoodfunction.θisasetofparameterscomingfromcompleteparameterspaceΘandnullparameterspaceΘ0.Seedetailin[2,31].Inaspatialsetting,thenullhypothesisisthatthestatisticalaspectofthephenomenonofinterestinaregionR(thatiscurrentlybeingtested)isnodifferentfromrestofthespatialarea(denotedasR).ThusifaregionRisanomalousthenthealternatehypothesiswillmostlikelybeabetterfitandthedenominatorofλwillhaveahighervalueforthemaximumlikelihoodestimatorofθ.Aremarkablefactaboutλisthatundermildregularityconditions,theasymptoticdistributionofΛ≡−2logλfollows
2aχ2kdistributionwithkdegreesoffreedom,wherekisthenumberoffreeparameters.(seeFig.2).ThusregionswhoseΛvaluedropsinthetailofχ2distributionarelikelytobeanomalous.3.2.Constrainedmaximumlikelihoodestimation
Constrainedmaximumlikelihoodestimationisasetofproceduresfortheestimationoftheparametersofmodelsviathemaximumlikelihoodmethodwithgeneralconstraintsontheparameters,alongwithanadditionalsetofproceduresforstatisticalinference[32].Barlowetal.[33,34]solvedthemaximumlikelihoodestimationbasedonareliabilitygrowthmodel,whichisapplicabletoouremergingscenario.ItassumesthatasystemisbeingmodifiedduringKstagesofdevelopment(inourapplication,weassumetaxicountsinaregiontobevaryingduringKtimesteps).Dataconsistsofxisuccessesinnitrialsinstagei,i=1,…,k.Letpibethesystemreliabilityatthei-thstage(inourcase,piistheincreasingrateoftaxicountati-thtimestep).Barlowetal.,obtainedthemaximumlikelihoodestimatesofp1,p2,…,pk,undertherestrictionthatp1≤p2,≤…,≤pk.Toobtainthemaximumlikelihoodestimatesofp1,p2,…,pksubjecttotherestrictionthatp1≤p2,…,≤pk,firstformtheratiosx1/n1,x2/
^iofpi.Ifforsomej(j=1,…,K−1),xj/nj,combinethen2,…,xk/nk.Ifx1/n1≤x2/n2,≤…,≤xk/nk,thenxi/xnistheMLEp
observationsinthej-thand(j+1)-ststagesandexaminetheratios:x1/n1,…,xj−1/nj−1,xj+xj+1/nj+nj+1,xj+2/nj+2,…,xk/nk,forthe(k−1)stagesthusformed.Iftheseratiosareinnon-decreasingorder,theyconstitutetheMLEsof
ÀÁÀÁ
^j¼p^jþ1¼xjþxjþ1=njþnjþ1.Ifnot,continuetheprocessofcombiningstagesuntiltheratiosareinp1≤p2,≤…,≤pkwithp
non-decreasingorder.Thisprocessneedstoberepeatedatmost(k−1)times,andtheresultisindependentoftheorderinwhichstagesarecombinedtoeliminatereversalsinthesequenceofratios.
Tosimplify,weget:
\"#kkXX
^i¼Maxk≥iMins≤ixi=nip
i¼s
i¼s
,wherei=1,…,k.3.3.MonteCarlosimulation
Thelikelihoodratio,orequivalentlyitslogarithm,canbeusedtocomputeap-value,orcomparedtoacriticalvaluetodecide
whethertorejectnullhypothesis(i.e.thereisnoanomalyinourcase)infavorofthealternativehypothesis(i.e.thereisanomaly).Theprobabilitydistributionoflikelihoodratio,assumingthatthenullhypothesisistrue,canbeapproximatedbychi-squaredistribution
2Iftheχ2distributionisnotapplicablethenMonteCarlosimulationcanbeusedtoascertainthep-value.
L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373361
Fig.2.Chi-squaredistributionwithdegreeoffreedom=3.
[35].ButwecannotexpecttofindthedistributionofthelikelihoodratioteststatisticinclosedanalyticalformandthusMonteCarlosimulationcanbeperformedtoobtainp-valueinsomecases.Therefore,oncewehavediscoveredtheregionwithmaximumlikelihoodratiovalue,thestatisticalsignificanceofthisregioncanbederivedfromthechi-squaredistributionorbyconductingMonteCarlosimulations.Torunthesimulationtest,alargenumberofreplicationsofdatasetsaregeneratedunderthenullhypothesis,forinstance,9999suchreplicasarecreatedtoperformlikelihoodratiotest.Thetestissignificantatthe5%levelifthelikelihoodratiovalueisamongthe500highestvaluesoftheteststatisticcomingfromthereplications.3.4.Upper-boundingmethodology
Theupper-boundingstrategyforLRTforanomalydetectionwasintroducedbyWuandJermaine[2].ThebasicobservationisthatthelikelihoodvalueofanygivenregionRundercompleteparameterspaceisnotgreaterthanthemultiplicationofthelikelihoodvalueofallitsnon-overlappingsub-regionsundernullparameterspace.Therefore,theloglikelihoodofanygivenregionRcanbeupper-bounded.Forinstance,ifaregionRiscomposedoftwonon-overlappingsub-regionsR1andR2,then
LðθRjXRÞ≤LðθR1jXR1ÞÂLðθR2jXR2Þ:Itisequivalentto
logLðθRjXRÞ≤logLðθR1jXR1ÞþlogLðθR2jXR2Þ:
HereθR;θ′R1andθ′R2arethemaximumlikelihoodestimatorsundercompleteandnullparameterspacesseparately.SeeFig.3.
′
′
′
′
Fig.3.TheloglikelihoodofregionRundercompleteparameterspaceisupperboundedbythesumoftwonon-overlappingsubregionsR1andR2undernullparameterspace.
362L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373
Fig.4.Anexampleof(4×4)gridtoillustratetheLRTcalculationandtheupper-boundingpruningmethodology.
Theupper-boundingstrategyisusedtoprunenon-outliers:IfwereplacethelikelihoodofaregionRbytheproductofthelikelihoodsofitssub-regionsandthenewLRTisbelowtheanomalousthreshold(i.e.confidencelevelα),thenRcannotbeanomalous.3.5.Examples
3.5.1.Example1
Usingasimplebutconcreteexample,wewillnowexplainhowtofindanomalousregionusingtraditionalLRTcomputationandtheupper-boundingpruningstrategy.Considerthe4×4grid(G)inFig.4.Thenumberofsuccesses(mi)independentlygeneratedbyPoissonmodelPo(bip)isdisplayedineachcellci.Thebaselinebiineachcellciissetto10.Thesuccessratepis0.5fortheregionRand0.1fortherestofcells.Thesignificancelevelissettoα=0.05.Wereferthesuccessratepasthetestparameter.3.5.1.1.Procedures.ForagivenregionR,traditionalLRTcalculationinvolvesseveralsteps:maximumlikelihoodestimatorfortestparameterofR,RandG;likelihoodcalculationofR,RandG;theratiocalculationfromtheprevioustwosteps.AlthoughthecalculationforPoissondistributeddatacanbesimplifiedinto1EXPstatisticalmodel,inordertoillustratethetraditionalLRTcomputation,theoriginalstepsarecarriedoutasthefollowing:1.Thelikelihoodfunctionofeachcelliis:
ðbipÞkie−ðbipÞ
:fðpjciÞ¼
ki!2.ThelikelihoodofanygivenregionR,whichiscomposedofcellc1,c2,…,ci,…,ctis:
ðbipÞkie−bip
:LðpjRÞ¼Πci∈R
ki!ð
Þ
ð1Þ
ð2Þ
^)iscalculatedas:3.TheMLE0ofpforaregionR(denotedasp
^¼∑c∈Rki=∑c∈Rbipii
ð7þ8Þ^^^^^R¼ð10Thuspþ10Þ¼0:75.Similarly,pR,pR1,pR2andpGareobtainedas:0.14,0.7,0.8and0.21.
4.TheΛofregionRisgivenby
ð3Þ
ÀÁ19þ15−0:21Â160
ÂeΛR¼−2logLðpjGÞþ2logLðpjRÞþ2logLðpjRÞ¼−2log0:21
15−0:75Â20
þ2log0:75Âe
19−0:14Â140
þ2log0:14Âe¼20:76:
ð4Þ
Fromtheabovesteps,wegettheexactloglikelihoodvalueofregionR:logL(p|R)=−19.31;andtheexactloglikelihoodofR1andR2:logL(p|R1)=−9.49,andlogL(p|R2)=−9.78separately.
Weknowthecriticalvalueofχ2(α)=3.84.Obviously,20.76isgreaterthan3.84.ThereforeregionRistreatedasapotentialoutlier.
L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373363
3.5.1.2.Upper-boundingpruning.Wecanalsoverifytheupper-boundofregionR:ThesumofloglikelihoodofR1andR2is−19.27,whichisgreaterthantheexactloglikelihoodvalueofRwhichwascomputedas−19.31.
Similarly,forregionR′,logLpR′1¼−1:31,andlogLpR′2¼−1,thesumofloglikelihoodofR′1andR′2is−2.31.ItisgreaterthantheloglikelihoodvalueofR′,whichis−2.66.Furthermore,ΛR′¼2:14,whichissmallerthanχ2(α)=3.84.Thisshowsthattheupper-boundedLRTvalueissmallerthanthecriticalvalue.Usingthepruningstrategy,theactualvalueofregionR′neednotbecalculatedandcanbepruned.
3.5.2.Example2
Thisexampleillustrateshowtoestimatemaximumlikelihoodofsystemreliabilityinreliabilitygrowthmodel.Todetectemergingoutlier,weusethiswaytocalculatemaximumlikelihoodestimator.
(1).Thenumberofsuccesses,populationandsuccessrateineachtimestepareshowninTable1.
^i,theprocesstogetasequenceofnon-decreasingratiosissummarizedbelow(2).ToestimatethemaximumlikelihoodofP
accordingtotheBarlowtheorem(Table2):^1¼P^2¼P^3¼P^4¼0:385;P^5¼0:833.Fromtheaboveprocedures,weobtainthemaximumlikelihoodestimates:P4.Proposedstatisticalmodels
Definition1.KP:Itrefersto“keyparameter”,denotedasKP{θ1,θ2,…,θi,…,θn}.θiisaparametercomingfromthekeyparameter
set.Forinstance,inepidemiology,ifweareconcernedaboutthetrendofthediseaserateinaspatio-temporalview,thediseaserateisKP.Inourapplication,thevariationoftaxicountwithinaperiodisKP.Forsimplicity,weonlyconsideroneparameterfromthekeyparametersetinourwork(denotedasKP).4.1.PSTOmodel(PersistentSpatio-TemporalOutliermodel)
Itisusedtodetectpersistentspatio-temporaloutliers.ThenullhypothesisH0assumesthattheKPisconsistentforallregionsovertime.ThealternativehypothesisH1assumesthatKPhasahighervalueinregionri∈Rthanthevalueoutsideofregionrj∈G–R(i.e.R),butthevalueinregionri∈Risconsistentovertime.Wecalculatethelikelihoodratiotestasfollows:
8><Πr∈RLðθrjXRÞΠ
i
DðRÞ¼
>:
ri∈RLðθrjXrÞ1
Πri∈GLðθGjXGÞforθr≥θr;otherwise:
ThisformulaistheclassicalLRTstatistic.WefirstcalculatetheMLEofθrandθrtomaximizethenumeratorandtheMLEofθGto
maximizethedenominator.Thentheratioisthescoreweusetoevaluatethe“anomalousness”ofagivenspatio-temporalregion.4.2.ESTOmodel(EmergingSpatio-TemporalOutliermodel)
Thismodelisusedtodetectemergingspatio-temporaloutliers.ThenullhypothesisH0assumesthattheKPisconsistentforallregionsovertime.ThealternativehypothesisH1assumesthatKPisnon-decreasingwitheverytimestepoverregionri∈Randhigherthanrj∈R.Wecalculatethelikelihoodratiotestasfollows:
>>>>ttt>> Table1 Reliabilitygrowthprocedure.Timestep(i)01234 Numberofsuccesses(xi)2030302050 Population(ni)5070806060 Successrate(xi/ni)0.4000.4290.3750.3330.833 3 Table2 Themaximumlikelihoodestimatorprocedure.i12345 xi2030302050 ni5070806060 L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 xi/ni0.4000.4290.3750.3330.833 Firstcalculation0.400 60/150=0.4000.3330.833 Secondcalculation80/210=0.381 Thirdcalculation100/260=0.385 0.8330.833 ThisformulaisderivedfromtheclassicalLRTstatisticanddesignedfortheemergingscenario.UserneedstofindasolutiontomaximizethenumeratorwiththeincreasingKP.Forinstance,Barlow[34]providesanapproachtosolvetheconstrainedmaximumlikelihoodestimationonthereliabilitygrowthmodelinwhichtherelativeriskisnon-decreasingovertime.OrEMalgorithmcanbeperformedtoestimatethekeyparameter. 5.Upper-boundingstrategyandpruningmechanismforproposedframework5.1.Upper-boundingstrategy (1).InPSTOmodel,theupper-boundingstrategyexplainedinSection3.4canbeextendeddirectlytospatio-temporal dimension. (2).InESTOmodel,KPisassumedtovaryatdifferenttimesteps;weshowbelowthattheupper-boundingstrategyisstill applicabletothismodel. Theorem1.LetregionR=Rt1∪Rt2,fornon-overlappingtimeintervalst1andt2,wehave: LðθRjXRÞ≤LðθRt1jXRt1ÞÂLðθRt2jXRt2Þ ′ ′ ð5Þ ,whereθR¼θRt1∪θRt2andXR¼XRt1∪XRt2. ÀÀÁÀÁ Proof.WeknowLθRjXRÞ¼LθRt1XRt1ÂLθRt2XRt2.UsingtheLRTupper-boundingbasicconcepts,weknowthatθRt1ischosenundermorestrictcompleteparameterspaceandθ′Rt1ischosenunderloosennullparameterspace.ThismeansthatperformingMLE0onasub-intervalofRhasloosentheconstraintscomparingwithperformingMLE1onR.Thus,wehaveLðθRt1jXRt1Þ≤Lðθ′Rt1jXRt1ÞandLðθRt2jXRt2Þ≤LðθR′t2jXRt2Þ.Therefore,LðθRjXRÞ≤Lðθ′Rt1jXRt1ÞÂLðθ′Rt2jXRt2Þ. Theorem2.LetregionR=R1∪R2,fornon-overlappingspatialregionsR1andR2,wehave: LðθR1;θR2jXR1;XR2Þ≤LðθR1t1;θR1t2jXR1t1;XR1t2ÞÂLðθR2t1;θR2t2jXR2t1;XR2t2Þ ′ ′ ′ ′ ð6Þ ,whereR,R1,andR2arecomposedof(t1,t2)timestepsrespectively.Herewejustusetwotimestepstoillustrate.Itisapplicabletoanyttimesteps. Proof.Foreachtimestepi,wehave:LðθRtijXRtiÞ≤Lðθ′R1tijXR1tiÞÂLðθ′R2tijXR2tiÞ LðθR1;θR2jXR1;XR2Þ¼LðθR1jXR1ÞÂLðθR2jXR2Þ LðθR1t1;θR1t2jXR1t1;XR1t2Þ¼LðθR1t1jXR1t1ÞÂLðθR1t2jXR1t2ÞLðθR2t1;θR2t2jXR2t1;XR2t2Þ¼LðθR2t1jXR2t2ÞÂLðθR2t2jXR2t2Þ Thereforeweget LðθR1;θR2jXR1;XR2Þ≤LðθR1t1;θR1t2jXR1t1;XR1t2ÞÂLðθR2t1;θR2t2jXR2t1;XR2t2Þ: ′ ′ ′ ′ FromTheorems1and2,weknowthattheupper-boundingstrategyisapplicabletoemergingmodel(ESTO). L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373365 (a) R(b) R ¯(c)R Fig.5.Pre-computationofanygivenspatial–temporalregionRandtilingofR.Sub-figure(a)showsan8×8×8spatial–temporalgrid;sub-figure(b)shows:one ofthecuboidsfromspatialpre-computedsetissplitfromtemporaldimensionandresultsin15smallercuboids.Sub-figure(c)istheradialmethodtotileR. 5.2.Pre-computationandpruningmechanism 5.2.1.Pre-computationforregionR Werecursivelysplittheregionintotwosub-regionsofthesamesize,startingfromthebiggestcuboidenclosedbytwoplanesfromtimeview,endingatthelowestresolutionofthespatial–temporalgrid.Fig.5bshowsthesplitapproachesforasub-cuboidhighlightedasbluefromthetemporaldimensioninan8×8×8grid(Fig.5a).Thelikelihoodofanygivenregioncanbeupper-boundedbythispre-computedsetviathetilingofLRT. 5.2.2.Pre-computationforthecomplementofregionR(i.e.R)Byconsideringalloftheintersectionpoints,weconnecteachintersectionpointonthe3-dimensionalgridwiththeeightcornersofthegrid.Thisproduceseightdiagonals,eachofwhichcreatesonecuboidinthepre-computedset.SincethereareO(n4)intersectionpoints,thereareO(n4)cuboidsinthepre-computedset.Afterwegetthepre-computedset,foranygivenregionR,weusetheradialandsandwichmethodsinLRTtogettheupper-boundedlikelihoodvalueofR.Thesetwomethodsproducesixnon-overlappingsub-cuboidregionsforRseparately.RadialmethodisperformedbyelongatingthesidesofaregionRuntilthesideshitthegridbordersusingclock-wisecounterclock-wideorder.SandwichmethodisperformedbyelongatingtwoparallelsidesofRinbothdirectionsuntiltheyreachthebordersofthegrid.Seedetailofthesemethodsin2-dimensionalgrid[2].In3-dimensionalview,twelvetimestilingisinvolved.Fig.5cshowsthetilinginradialway. 5.2.3.Computationalcomplexity Inthebrute-forceapproach,thereareatotalofO(n6)regionsthatneedtobesearched.OurapproachreducesthecostbyprecomputingtwolikelihooddatasetwithsizeofO(n4).Thelikelihoodofeveryregionisupper-boundedandthereallikelihoodiscalculatedonlyforanumberofregions.Furthermore,inourimplementation,wehavealreadyrankedthetop-kregionsaccordingtothelikelihoodratiovalues.Therefore,theperformancewon'tbeaffectednomatterwhichsignificancetestingmethodisbeingapplied. TheprocessofoutlierdetectionisshowninAlgorithm1.Theinputtingparametersare:datagrid(G),probabilitydensityfunction(f),maximumlikelihoodestimationfunctionunderdifferentparameterspace(MLE0,MLE1),likelihoodfunction(L),numberoftopregionstobereturned(K)andthesignificancelevel(α).Inthisprocess,steps1and2performpre-computations;step5tostep8obtainstheupper-boundedlikelihoodvalueofcurrentcuboidforeachiteration.Duringeachiteration,thechi-squareddistributionisappliedtoprunenormalregions.Finally,itoutputstop-kanomalousregions.6.Experiments,resultsandanalysis WereportonexperimentsconductedwherewehaveusedAlgorithm1totestforaccuracy,pruningabilityandperformance.Kwassetto1.InSections6.1and6.2allexperimentswerecarriedoutonsyntheticdata.InSection6.3,wedemonstratetheusefulnessofourapproachonarealdataset.6.1.Resultsonsyntheticdata Wetestedfourvariantsoftheoutlierdetection:1.2.3.4. brute-forcepersistentspatio-temporaloutliers(bpsto)brute-forceemergingspatio-temporaloutliers(besto) pruning-basedpersistentspatial–temporaloutliers(ppsto)pruning-basedemergingspatio-temporaloutliers(pesto). 366L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 Algorithm1.Topkspatio-temporaloutlierdetection Wegenerateddatasetonagridsizevaryingfrom(4×4×4)to(128×16×16).Fiftyseparatetrialswerecarriedoutforeachscenario(seebelow)andwemeasuredthreeaspects:(a)pruningrate(b)accuracy,and(c)runningtime.Thesignificancelevelwassetatα=0.05. 6.1.1.ScenarioI Thenullhypothesisholds.ThebaselinebcisgeneratedrelativelyuniformlybyaNormaldistribution(μ=104,σ=103)andafixedsuccessratepof0.001.ThenumberofsuccesseskcisgeneratedfromPo(bcp).ResultsareshowninTable3. 6.1.2.ScenarioII Thenullhypothesisholds.TheonlydifferencewithScenarioIisthatthedatainarandomselectedcuboidareawithsizeof(5×4×3)isgeneratedbyaNormaldistributionwithdifferentparametersettings(μ=105,σ=5×103).ResultsareshowninTable3. 6.1.3.ScenarioIII Thealternativehypothesisholds.Itissimilartothenullmodelexceptthatthedataofarandomlyselectedcuboidareaofsize(5×4×3)isgeneratedfromaPoissondistributionwithp=3,6,9,18,and36foremergingcaseandp=3forpersistentcase.ThedatanotwithinthecuboidareawasalsofromaPoissondistributionwithp=1.ResultsareshowninTable4. Table3 AveragepruningrateandaccuracyinScenariosIandII.Test (a)ScenarioI4×4×48×8×816×16×16(b)ScenarioII4×4×48×8×816×16×16 Pruning(%)10010099.9 Accuracy(%)NofalsealarmNofalsealarm0.1falsealarm 10099.99100Nofalsealarm0.01falsealarmNofalsealarm L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 Table4 AveragepruningrateinScenarioIII.Testppsto(%)pesto(%) 16×16×1695.2798.37 32×16×1697.3598.46 ×16×1697.98.69 32×32×3297.4799.11 367 128×16×1696.7499.23 6.1.4.ScenarioIV Thealternativehypothesisholds.ItissimilartoScenarioIIIexceptthatthedataofarandomlyselectedcuboidareaofsize(5×4×3)wasgeneratedfromaPoissondistributionwithp=10,50,250,1250,and6250foremergingcaseandp=10forpersistentcase.ResultsareshowninTable5.6.2.Evaluationsonsyntheticdata 6.2.1.AnalysisonScenariosIandII TheresultsofScenariosIandIIshowthatweachieveahighpruningrateandnofalsealarmisgeneratedevenwhenweperturbthedistributionofoneregion.Thisisasexpectedanddemonstratesthatthealgorithmiswellcalibrated.ByahighpruningratewemeanthatwecanruleouttheoutliersbyjustcheckingtheLRTupperboundderivedfromthetiling.IftheupperboundvalueislessthanthecriticalvaluethenthetrueLRTvalueoftheregioncannotbeanomalous. 6.2.2.AnalysisonScenariosIIIandIV ForScenariosIIIandIVtheanomalousregionswerecorrectlyidentifiedwhilemaintainingahighpruningrate.Alsotherewerenoregionsdeclaredasfalsepositives.6.2.3.Analysisonrunningtime I.ProportionofRunningTime:WeanalyzetherunningtimewithandwithoutpruningforScenariosIIIandIV.Weplotouttheproportionofrunningtimeofpruningapproachrelativetobrute-forceapproachinFig.6,whichisthecomputationofrunningtimeof(brute-force−pruning)/brute-force.Also,theproportionpercentageisdisplayedontheselinegraphs.Itshowsthatasthesizeofthespatialandtemporalregionincreases,theeffectofpruningbecomesprominent.Forthelargestdatatested,thepruningmechanismresultedinasavingsofnearly50%comparedtothebrute-forceapproach. II.SingleLRTCalculationCost:Wehavealsocalculatedthecostofasinglelikelihoodcalculationasthedimensionofthegridsizeincreases.TheresultsareshowninFig.7.Forthe8×8×8dataset,thecostofthelikelihoodcalculationusingthebrute-forceapproachis0.01mswhilewithpruningitincreasesto0.08ms.However,forthelargerdatasets(e.g.,128×16×16)thecostofasinglelikelihoodcalculationgoesfrom0.30msforthebrute-forceapproachtoaround0.16mswithpruning.Anotherobservationisthatthecostofthesinglelikelihoodcalculationisnearlysimilarfordatasetsofthesamesizebutdifferentdimensions,forexample128×16×16and32×32×32. III.ComponentsofRunningCost:Wehavealsoanalyzedandcomparedtherunningofthedifferentcomponentsbothforthe brute-forceandpruningapproaches.TheresultsareshowninFig.8.Thebrute-forceapproachhasthefollowingcomponents: (1).ThecostofthelikelihoodcalculationforeachregionR(RComputation). (2).ThecostofthelikelihoodcalculationforthecomplementofeachregionR,denotedasR(RComputation). Thepruningapproachismorecomplexandinvolvesthefollowingcomponents: (1)ThecostofcomputingthelikelihoodforeachelementofthetilingsetTR.Thiswillbeusedtoupperboundthe likelihoodvalueforanarbitraryspatio-temporalregion.(Rpre-computation) (2).ThecostofcomputingthelikelihoodforeachelementofthetilingsetTR(Rpre-computation). (3).Thecostofupper-boundingthelikelihoodofR.ThisinvolvesfirstexpressingRasaunionofsubregionsandtheneach subregionasaunionoftilesfromTR. (4).Thecostofupper-boundingthelikelihoodofR(RComputation).ThisinvolvesfirstexpressingRasaunionof subregionsandtheneachsubregionasaunionoftilesfromTR.EachRregioncanbeexpressedasaunionofsixsubregionsandtherearetwotypesoftilingmethods:sandwichandradial.Wecalculatethelikelihoodvalueusingbothtilingmethodsandthenselectthetightestupper-bound. Table5 AveragepruningrateinScenarioIV.Testppsto(%)pesto(%) 16×16×1679.2795.57 32×16×1697.5197.40 ×16×1697.7796.78 32×32×3297.2294.70 128×16×1696.65.23 368L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 (a) Scenario III psto(b) Scenario III esto (c) Scenario IV psto(d) Scenario IV esto Fig.6.Theproportionofrunningtimeofpruningvs.brute-forceapproach.Itshowsthatoutlierpruningsearchingissignificantlyimprovedwhenthedatasetsizestartsfrom32×16×16inthesefourdifferentscenarios. AsisclearinFig.8,Rcomputationisthemostexpensivepartofthecalculation.However,asthedatasetsizeincreases,theoverheadsofthetilinggivewaytoitsmoreefficientreuseresultinginconsiderablesavings.6.3.Casestudies:BeijingtaxiGPSdata WeillustratetheuseofthePestomethodonarealdataset[4,5].ThedatasetconsistsofthreemonthsofGPStrajectoriescollectedfrom33,000taxisinBeijingbetween01/03/2009and31/05/2009.TheroadnetworkofBeijingissplitintogrid.Thetaxicountsofeachcellwhichisidentifiedbycolumnindexandrowindexareprovidedinatextfile.Thetimefrequencyofmonitoringthetaxicountsis15min.Todealwithrealdata,weonlyneedtocalculatethetotaltaxicountsineachcellwithinagivenperiod,thenthispre-processeddatacanbedirectlyloadedintoouralgorithm.Inthissection,wesearchforthemostanomalousemergingregionwithinaspecifiedtimeperiodandthenprovideapossibleexplanationfortheanomaly. 6.3.1.CaseI All(8×8)gridsweretestedbetween9:00:00amand10:00:00amforsixteendays.Wechoose20daysofdatatocalculatethebaselineprobabilities. 6.3.1.1.ResultI.Theperiodfrom01/05/2009to02/05/2009emergedasatopoutlieratthepositionof(0,1)and(1,1)onthegrid.ThisperiodcorrespondstotheLabordaypublicholiday(“GoldenWeek”).Usuallytheholidaydurationissevendays(fromMay1sttoMay7th),butstartingfrom2009,theholidayperiodwasshortentobetweenMay1standMay3rdinclusive.TocelebratetheholidaysitappearsthatmanypeoplevisitedHappyValley,thebiggestamusementparkinBeijing.The3rdInternationalFashionfestivalwasalsoheldinthatlocation.Ourresultscoincidewiththefactthattaxisenjoygoodbusinessonpublicholidays L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373369 (a) Scenario III psto(b) Scenario III esto (c) Scenario IV psto(d) Scenario IV esto Fig.7.Thesinglelikelihoodrationcalculationcostofpruningvs.brute-forceapproach.Itshowsthatoutlierpruningsearchingissignificantlyimprovedwhenthedatasetsizestartsfrom32×16×16inthesefourdifferentscenarios. andthereisusuallyanincreaseinthenumberoftaxisneartouristspots.TheresultsareshowninFigs.9aandb,and10a.Wecanseethatthenumberoftaxisincreasedfrom1stMayto2ndMayandthendecreasedfrom3rdofMayonwards. 6.3.2.CaseII All(8×8)gridsweretestedbetween3:15:00pmto4:30:00pmfor8days.Weuse12daysofdatatocalculatethebaselineprobabilities. 6.3.2.1.ResultII.Theregionhighlightedasblueonthemapwasdetectedasanemergingoutlierfrom16/03/2009to20/03/2009.ItisoneofthecityexpressroadcalledTonghuiheNorthRoad.From01/03/2009to13/03/20093,the11thNationalPeople'sCongress(i.e.NPC)washeldinBeijing,whichistheannualmeetingofthehighestlegislativebodyofthePeople'sRepublicofChina.Nearly3000deputiesfromalloverChinaattendedtheCongress.Duringthisperiod,thetrafficauthoritiesinBeijingimposedtemporaryrestrictionmeasuresonvehiclestocontroltrafficflow.Mostpeoplechoosetotakebusorsubwayinsteadofdrivingortakingtaxitocommutetowork.ThenumberoftaxitravelingonTonghuiheNorthroadincreaseduntilmostofthedeputiesleftBeijing.TheresultsareshowninFigs.9candd,and10b. Toinvestigatemore,wesetk=5inourcasestudies.Wefoundthattheothertop4outliershavebigoverlaponthetop1outlierregion.Theseoutliershavesimilarspatialareaandspanningtimeperiod.Itverifiesthattheemergingoutliercanbecorrectlylocated.7.Conclusion Inthisarticle,anefficientpatternminingapproachwasproposedtocaterforspatio-temporaltrafficdata,whichisabletodetect“persistentoutliers”and“emergingoutliers”.Weproposedtwostatisticalmodels,whichencompassthegenericfeaturesofanomalouspatterns.Wealsoderivedanupper-boundingstrategyforthetwostatisticalmodelssupportingforfastoutlierdetection.Ourcomprehensiveexperimentsshowthattheperformanceofcomputationaltimeisgreatlyimprovedwhenthedatasetsizeislarge,andwecanstillfindthecorrectoutliers.Wealsocarefullyanalyzedthetilingschemeandtheupper-boundingstrategyinthesyntheticexperiments.Inourcasestudies,ourmodelisabletodetectregionswithemergingnumberoftaxisthatcanbevalidatedbyknownmajortrafficevents. 370L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 (a) Split cost of ESTO with smaller dataset (b) Split cost of ESTO with larger dataset (c) Split cost of PSTO with smaller dataset (d) Split cost of PSTO with larger dataset Fig.8.Therunningtimeofcomparablepartsofbrute-forcevs.pruningapproachinscenarioIII.Itshowsthatthepruningsearchingisfasterwiththelargerdataset.AlthoughthetilingofeveryRtakesthelongesttimeinpruningsearching,thecostissmallcomparedtothelikelihoodcalculationofeveryRinbruteforcesearching. Ourapproachisapplicabletoawidevarietyofcontexts.Forinstance,inweatherforecastmodels,itcanbeusedtodetectemergingweatherpatternwhichraisesthepossibilityofdryandwarmclimates.Theseclimatesmayhavegreatimpactoninfectiousdisease.Investigatingsuchweathervariablesassociatedwithinfectiousdiseasescanhelpinanticipatingfutureepidemics,andearlywarningsystemcanbedevelopedforsurveillanceandinterventions.Ingeneexpressionmodels,discoveringemergingpatternsishelpfulfordiagnosisandunderstandingcorrelationofgeneexpressionprofilestodiseasestatesinasignificantway.Theyareusefulforcapturinginteractionsamonggenes,findingsignaturepatternsfordiseasesubtypes,andgeneratingpotentialdiseasetreatmentplans,etc.Infinancemodels,miningemergingbusinesstrendfromdifferentgeographicalregionscanprovideinsightforidentifyingsomeprofitableinvestments.Itcanalsobeappliedinemergingeventidentification,intrusiondetection,etc.8.Futurework Fromtheabovediscussion,wealreadyobservedthatthecostofpruningapproachhasnearlyabove50%speed-upwithrespecttothenaivealgorithmasthegridsizeincreases.Itallowsthecomputationtobeperformedforlarge-scaleapplications.However,wealsonoticedthatitstillneedsalmostonetotwodaystofindthetopoutlierevenwithpruningapproachinoursyntheticexperimentswithgridsizeof128×16×16or32×32×32.Whenthedatasetsarenotmodestlysized,thescalabilityisstillnotgood.Thisperformancemightnotbeacceptableinrealdetectionapplications. Weseethatthescalabilityofoutlierdetectionhasbecomeimportantastheamountofdataforanalysishasbeenincreasinggreatly.Fordealingwithlargedatasets,itisimportanttobothparallelizethealgorithms,andimplementthemtoexecuteefficiently.Fortunately,theLRTscanningalgorithmishighlyparallelizable,eachsub-gridcomputationisindependentofeachotherandthewholegridcanbepartitionedintoequalpartsanddistributedoverthemultipleprocessors.TheGPUcomputingorGPGPU(i.e.General-PurposecomputationonGraphicsProcessingUnits)hasbecomeanewtrendforresearcherstodogeneralpurposescientificandengineeringcomputationbytheuseofGPU.ItenablesdramaticincreasesincomputingperformancebyharnessingthepoweroftheGPUandisstartingtoplayasignificantroleinlarge-scalemodeling.Duetoourhighlyparallelizable L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373371 (a) The average taxi counts within outlier regions vs. non-outlier regions from01/05/2009 to 02/05/2009 (b) The average taxi counts within outlier regions from 01/05/2009 to 08/05/2009 (c) The average taxi counts within outlier regions vs. non-outlier regions from16/03/2009 to 20/03/2009 (d) The average taxi counts within outlier regions from 14/03/2009 to 21/03/2009 Fig.9.Comparisonofoutlyingandnon-outlyingregionsinan8×8×8grid.Itshows:(a)theaveragetaxicountswithinoutlierregionsarenon-decreasingcomparedtonon-outlierregionswhichsharethesameemergingperiodwithoutlier.(b)Theaveragetaxicountswithinoutlierregionsthroughouttheemergingperiodarenon-decreasingcomparedtotheoutlierregionsfortherestoftheperiod. algorithm,thetechnologyofgraphicsprocessingunit(GPU)andcomputeunifieddevicearchitecture(CUDA),whichareidealformassivedataparallelism,mightbeconsideredinourimplementationtoacceleratethespatio-temporalexploringandanalyzingprocesses.Aspartofourfuturework,parallelizationofpruningapproachwillbepursuedtoachievefasteroutlierdetection. (a) The region highlighted with blue borders on the map is the outlier region of Case I.The icons hows the exact location ofHappy Valley. (b) The region highlighted with blue borders is the outlier of Case II. It is the city expressroad of Bei-jing. (i.e. Tonghuihe North Road) Fig.10.OutlierlocationsfromourtwocasestudiesonBeijingMap. 372L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373 References [1]Y.Zheng,X.F.Zhou,ComputingwithSpatialTrajectories,Springer,2011. [2]M.Wu,X.Song,C.Jermaine,S.Ranka,J.Gums,ALRTframeworkforfastspatialanomalydetection,In:Proceedingsofthe15thACMSIGKDDInternational ConferenceonKnowledgeDiscoveryandDataMining(KDD'09),2009,pp.887–6. [3]J.Yuan,Y.Zheng,X.Xie,G.Sun,Drivingwithknowledgefromthephysicalworld,in:Proceedingsofthe17thACMSIGKDDConferenceonKnowledge DiscoveryandDataMining(KDD'11),2011,pp.316–324. [4]J.Yuan,Y.Zheng,C.Zhang,W.Xie,X.Xie,G.Sun,Y.Huang,T-drive:drivingdirectionsbasedontaxitrajectories,Proceedingsofthe18thACMSIGSPATIAL InternationalConferenceonAdvancesinGeographicInformationSystems(ACMSIGSPATIALGIS2010),2010,pp.99–108. [5]W.Liu,Y.Zheng,S.Chawla,J.Yuan,X.Xie,Discoveringspatio-temporalcausalinteractionsintrafficdatastreams,in:KDD'1117thSIGKDDConferenceon KnowledgeDiscoveryandDataMining,2011,pp.1010–1018. [6]Y.Zheng,Y.Liu,J.Yuan,X.Xie,UrbanComputingwithTaxicabs,UbiComp'11,September17–21,2011,pp.–98,(Beijing,China). [7]S.Chawla,Y.Zheng,J.Hu,Inferringtherootcauseinroadtrafficanomalies,IEEEInternationalConferenceonDataMining(ICDM2012),2012.[8]V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,in:AcmComputingSurveys,vol.41,2009,pp.1–58. [9]V.J.Hodge,J.Austin,Asurveyofoutlierdetectionmethodologies,in:ArtificialIntelligenceReview,vol.22,2004,pp.85–126.[10]F.J.Anscombe,I.Guttman,Rejectionofoutliers,in:Technometrics,vol.2,1960,pp.123–147. [11]E.Eskin,Anomalydetectionovernoisydatausinglearnedprobabilitydistributions,in:InProceedingsofthe17thInternationalConferenceonMachine Learning,MorganKaufmannPublishersInc.,2000,pp.255–262. [12]M.Desforges,P.Jacob,J.Cooper,Applicationsofprobabilitydensityestimationtothedetectionofabnormalconditionsinengineering,in:InProceedingsof theInstituteoftheMechanicalEngineers,vol.212,1998,pp.687–703. [13]R.Ng,J.H.Clarans,Amethodforclusteringobjectsforspatialdatamining,IEEETransactionsonKnowledgeandDataEngineering14(2002)1003–1016.[14]J.S.M.Ester,H.-P.Kriegel,X.Xu,Adensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise,DataMiningandKnowledge Discovery2(1998)226–231. [15]W.Wang,J.Yang,R.R.M.Sting,Sting:astatisticalinformationgridapproachtospatialdatamining,VLDB'97,Proceedingsof23rdInternationalConference onVeryLargeDataBases,August25-29,1997,MorganKaufmann,Athens,Greece,1997. [16]D.B.Neill,A.W.Moore,M.Sabhnani,K.Daniel,Detectionofemergingspace–timeclusters,in:Proceedingsofthe11thACMSIGKDDInternationalConference onKnowledgeDiscoveryinDataMining(KDD'05),2005,pp.218–227. [17]D.B.Neill,A.W.Moore,DetectionofEmergingSpace–TimeClusters:PriorWorkandNewDirections,TechnicalReport,CarnegieMellonUniversity,2004.[18]M.Kulldorff,Aspatialscanstatistic,Comm.InStat.:TheoryandMethods,1997.1481–1496. [19]M.Kulldorff,Spatialscanstatistics:models,calculations,andapplications,in:J.Glaz,N.Balakrishnan(Eds.),ScanStatisticsandApplications,Birkhauser,1999.[20]M.Kulldorff,N.Nagarwalla,Spatialdiseaseclusters:detectionandinference,StatisticsinMedicine(1995)799–810. [21]M.Kulldorff,W.Athas,E.Feuer,B.Miller,C.Key,Evaluatingclusteralarms:aspace–timescanstatisticandclusteralarmsinLosAlamos,AmericanJournalof PublicHealth88(1998)1377–1380. [22]L.Huang,M.Kulldorff,D.Gregorio,ASpatialScanStatisticforSurvivalData,InternationalBiometricsSociety,2007.109–118.[23]I.Jung,M.Kulldorff,A.Klassen,Aspatialscanstatisticforordinaldata,StatisticsinMedicine(2007)1594–1607. [24]I.Jung,M.Kulldorff,O.Richard,Aspatialscanstatisticformultinomialdata,StatisticsinMedicine(2010)1910–1918. [25]L.Huang,R.Tiwari,M.Kulldorff,J.Zou,E.Feuer,WeightedNormalSpatialScanStatisticforHeterogenousPopulationData,AmericanStatisticalAssociation,2009.[26]http://www.satscan.org2008. [27]T.Tango,K.Takahashi,K.Kohriyama,Aspace–timescanstatisticfordetectingemergingoutbreaks,InternationalBiometricsSociety(2010)106–115.[28]J.H.Friedman,N.I.Fisher,Bumphuntinginhigh-dimensionaldata,StatisticsandComputing9(1999)123–143. [29]D.B.Neill,A.W.Moore,AFastMulti-resolutionMethodforDetectionofSignificantSpatialDiseaseClusters,NIPS,2003.651–658.[30]V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,ACMComputingSurveys41(2009)1–58.[31]http://en.wikipedia.org/wiki/likelihood-ratio-test2011. [32]R.Schoenberg,Maximumlikelihoodestimationandconservativeconfidenceintervalproceduresinreliablitygrowthanddebuggingproblems,Computational Economics10(1997)251–266. [33]R.E.Barlow,F.Proschan,E.M.Scheuer,MaximumLikelihoodEstimationandConservativeConfidenceIntervalProceduresinReliabilityGrowthand DebuggingProblems,1966. [34]R.E.Barlow,E.M.Scheuer,Reliabilitygrowthduringadevelopmenttestingprogram,Technometrics8(1966)53–60.[35]http://en.wikipedia.org2011. LinseyXiaolinPangiscurrentlystudyingPh.D.intheSchoolofInformationTechnologiesatUniversityofSydney.Shecompletedhermaster'sdegreefromtheNationalUniversityofSingapore.Herresearchinterestsincludedatamining,outlierdetectionandparallelcomputing. SanjayChawlaisafullprofessorintheSchoolofInformationTechnologies,UniversityofSydney.HereceivedhisPh.D.in1995fromtheUniversityofTennessee,Knoxville,USA.HisresearchworkhasappearedinleadingdataminingjournalsandconferencesincludingACMTKDD,MachineLearning,IEEETKDE,DMKD,ACMSIGKDD,IEEEICDM,SDM,andPAKDD.HeservesontheeditorialboardofDataMiningandKnowledgeDiscoveryandistheProgramCo-ChairofPAKDD2012.Hehasreceivedfourbestpaperawardsinthelastfiveyears—mostrecentlyattheIEEEInternationalConferenceinDataMining(ICDM)2010. L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373373 WeiLiuisaresearchfellowattheUniversityofMelbourne,Australia.HeobtainedhisPh.D.degreefromtheUniversityofSydneyin2011,intheareaofclassskewnessandadversariallearning.HisworkhasappearedinseveralconferencesandjournalsincludingKDD,SDM,ECML/PKDD,PAKDDandMachineLearningjournal.Hisresearchinterestsincludedatamining,machinelearningandbioinformatics. YuZhengisaleadresearcherfromMicrosoftResearchAsia(MSRA).HeisanIEEEseniormemberandACMseniormember.Hisresearchinterestsincludelocation-basedservices,spatio-temporaldatamining,ubiquitouscomputing,andmobilesocialapplications.Hehaspublishedover50referredpapersasaleadingauthorathigh-qualityinternationalconferencesandjournals,suchasSIGMOD,SIGKDD,ICDE,WWW,AAAI,andIEEETKDE,wherehehasreceived3bestpaperawardsaswellas1bestpapernominee.Thesepapershavealsobeenfeaturedbytop-tierpresseslikeMITTechnologyReviewmultipletimes.Inaddition,hehasbeenservingover30prestigiousinternationalconferencesasachairoraprogramcommitteemember,includingICDE,KDD,Ubicomp,andIJCAI,etc.Sofar,hehasreceived3technicaltransferawardsfromMicrosoftand20granted/filedpatents.In2008,hewasrecognizedastheMicrosoftGoldenStar.HejoinedMSRAin2006afterreceivingaPh.D.degreeinEEfromSouthwestJiaotongUniversity.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- huatuo0.cn 版权所有 湘ICP备2023017654号-2
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务