您的当前位置：首页 On detection of emerging anomalous traffic patterns using GPS data

On detection of emerging anomalous traffic patterns using GPS data

来源：华佗小知识

Data&KnowledgeEngineering87(2013)357–373

ContentslistsavailableatScienceDirect

Data&KnowledgeEngineering

journalhomepage:www.elsevier.com/locate/datak

OndetectionofemerginganomaloustrafﬁcpatternsusingGPSdata

LinseyXiaolinPanga,d,⁎,SanjayChawlaa,WeiLiub,YuZhengcabcdSchoolofInformationTechnologies,UniversityofSydney,Australia

Dept.ofComputerScienceandSoftwareEngineering,UniversityofMelbourne,AustraliaMicrosoftResearchAsiaNICTA,Sydney,Australia

articleinfoabstract

Theincreasingavailabilityoflarge-scaletrajectorydataprovidesusgreatopportunitytoexplorethemforknowledgediscoveryintransportationsystemsusingadvanceddataminingtechniques.Nowadays,largenumberoftaxicabsinmajormetropolitancitiesareequippedwithaGPSdevice.Sincetaxisareontheroadnearly24haday(withdriverschangingshifts),theycannowactasreliablesensorstomonitorthebehavioroftraffic.Inthisarticle,weuseGPSdatafromtaxistomonitortheemergenceofunexpectedbehaviorintheBeijingmetropolitanarea,whichhasthepotentialtoestimateandimprovetrafficconditionsinadvance.Weadaptlikelihoodratioteststatistic(LRT)whichhavepreviouslybeenmostlyusedinepidemiologicalstudiestodescribetrafficpatterns.TothebestofourknowledgetheuseofLRTintrafficdomainisnotonlynovelbutresultsinaccurateandrapiddetectionofanomalousbehavior.

Availableonline18May2013Keywords:Datamining

MiningmethodsandalgorithmsSpatial/temporaldatabases

1.Introduction

WiththeincreasingavailabilityofhighresolutionGPStracesfromvehiclesinlargemetropolitanareas,thereisanopportunitytoinfersophisticatedpatternsandtrendswhichtillnowhasnotbeenpossible[1].Theinferredtrendscanthenbeusedasinputintopolicyplanningacrossavarietyofdomainsincludingtrafficmanagement,urbanplanningandenvironmentalmonitoring.

Inthisarticle,weapplystatisticalapproachonmassivetaxilocationtraces,whichareexploredtoextracttheoutliertrafficpatternintransportationsystems.WeknowthousandsoftaxisplytheroadsoflargemetropolitancitieslikeNewYork,London,BeijingandTokyoeveryday.Mosttaxisareontheroad24hadaywithdriverschangingshifts.ManyofthesetaxisarenowequippedwithGPSandtheirspatio-temporalcoordinatesareavailable.Thusifacityispartitionedintoagridthenatagiventimewecanestimatethecountofthenumberoftaxisinthegridcells.Overtime,thecellcountswillsettleintoapatternandvaryperiodically.Forexample,duringmorningrushhourmoretaxiswillbeconcentratedinbusinessdistrictsthanatothertimesoftheday.Similarlytaxicountsnearairportswillsynchronizewithaircraftarrivalanddepartureschedules.Occasionallytherewillbeadepartureofthecellcountsfromperiodicbehaviorduetounforeseeneventslikevehiclebreakdownsorone-timeeventslikebigsportingevents,fairsandconventions.

Ourobjectiveistoidentifycontiguoussetofcellsandtimeintervalswhichhavethelargeststatisticallysigniﬁcantdeparturefromexpectedbehavior.

Oncesuchregionsandtimeintervalshavebeendiscoveredthenexpertscanbeginidentifyingeventswhichmayhavecausedtheunexpectedbehavior.Thisinturncanhelpmakeprovisionstomanagefuturetrafficbehavior.Similarproblemsappearinmanyotherdomains.Forexample,governmenthealthcareagenciesareinterestedindetectingemergenceofdiseasepatternswhichdeviatefromexpectedbehavior.

⁎Correspondingauthorat:SchoolofInformationTechnologies,UniversityofSydney,Australia.

E-mailaddresses:qlinsey@it.usyd.edu.au(L.X.Pang),sanjay.chawla@sydney.edu.au(S.Chawla),wei.liu@unimelb.edu.au(W.Liu),yuzheng@microsoft.com(Y.Zheng).

358L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

Thenumberofcontiguousregionsandtimeintervalsisverylarge.Forexample,ifthespatialgridcorrespondstoan×nmatrixandthereareTtimeintervals,thentherearepotentiallyO(n2T)spatio-temporalcellsandO(n4T2)cubicregions.1Thehugeamountofspatio-temporaldata,suchastaxicountacrossdifferentgridregionswithindifferenttimestepsfromminutestohourstodays,requiresanefficientapproachtodetectspatial–temporaloutliersforpredictingabnormaleventsandimplementingtrafficcontrolmeasuresinadvance.Forthismotivation,weapplytheroadnetworkofBeijingandpartitionitintogridtofindoutliers(Fig.1).

Inapaperofparticularrelevancetoourwork,theLRTframework[2]statesthecomputationcostforsinglestatisticvalueaswellasenumeratingallthespatialregionstobeexpensive.Toavoidperformingstatisticalcomputationsforeveryregion,itprovidesapruningstrategybasedonclassicallikelihoodteststatistic.Inthisarticle,weextendtheLRTframeworktodetectabnormaltrafficpattern.Morespecifically,thecontributionsare:••••

Ageneralandefficientpatternminingapproachforspatio-temporaloutlierdetectionisproposed.Persistentandemergingoutlierdetectionstatisticalmodelsareprovided.

Wegiveourproofthattheupper-boundingstrategyofLRTisapplicableto“persistent”and“emerging”outlierdetectionmodels.Experimentsareconductedonsyntheticdatatoverifytheextendedpruningapproachandshowthesignificantimprovementofsearchingwhendatasetsizeislarge;wealsoperformedrealdatavalidationinthedetectionofemergingtaxicounttrendduetosomemajorevents.

Therestofthisarticleisorganizedasfollows.Section2reviewsrelatedwork.Section3illustratesthestatisticalbackgroundandupper-boundingmethodologyforpruning.Section4proposesourapproach,inwhichthestatisticaldetectionmodelsareprovided.Theupper-boundingandpruningmechanisminthisframeworkbasedonourproofarepresentedinSection5.Computationalcomplexityisalsodiscussedinthissection.Section6showstheexperimentsandcasestudies.Finally,Section7concludesthisworkandSection8givesthefuturework.2.Relatedwork2.1.Trafﬁcoutliers

Recently,quiteafewresearchprojectsstartedtofindouttrafficpatternsandanomaliesusingtaxitrajectories[3–6].Forinstance,toprovideauserwiththefastestroutetoagivendestinationatagivendeparturetime,Yuanetal.[4]minesmartdrivingdirectionsfromhistoricalGPStrajectoriesofalargenumberoftaxis.Itproposesatime-dependentlandmarkgraphtomodelthepropertiesofdynamicroadnetworksandappliestwo-stageroutingalgorithmtofindtheefficientdrivingdirections.InanotherworkofYuanetal.[3],itpresentsacloud-basedsystemtoretrievethefastestdrivingroutesbasedontrafficconditionsanddriverbehavior.TheCloudbuildsamodelincorporatingdayoftheweek,timeofday,weatherconditions,andindividualdrivingstrategies.Usingthismodel,thesystempredictsthetrafficconditionsofafuturetimebyagivenrouteandperformsaself-adaptivedrivingdirectionserviceforaparticularuser.

Miningtrafficpatternisanimportantresearchapproach,butdetectingtheoutliersfrommaintrafficflowisalsomeaningful.Forinstance,whenatrafficincidentorjamhappens,trafficflowchangessuddenlyandthiswillbereflectedbyoutliers.Trafficincidentscanbedetectedthroughrecognizingoutliers.Suchunusualtrafficpatternreflectsabnormaltrafficstreamsonroadnetworksandprovidesuseful,importantandvaluableinformation.Unknownbutpotentiallyimportantpatternscanbeforecastbyanalyzingtheseoutliers.Therefore,thedetectionofoutliers/anomaliesfromtrajectorydatacanhelpinsensingabnormaleventsandplanfortheirimpacttoensuresmootherflowoftraffic.InLiuetal.'s[5]work,algorithmsarepresentedfordiscoveringspatio-temporaloutliersandcausalrelationships.Thediscoveryofrelationships,especiallycausalinteractions,amongdetectedtrafficoutliersisinvestigated.ChawlaandZhengetal.[7]furtherdiagnosedetectedtrafficanomaliesbystudyingthetrafficflows(paths)thatleadtoananomaly.InZhengetal.'s[6]work,itdetectsflawedurbanplanningusingtheGPStrajectoriesoftaxicabstravelinginurbanareas.Itfindsanomalouspatternsinacityusingtaxitrajectorieswhichprovidesadeeperunderstandingoftheflawedplanning.Althoughthesetwoapproachesareusedtodetecttrafficoutliers,theyaredifferentfromourworksincewedetecttrafficoutliersusingstatistical-basedapproachandwehaveadifferentdefinitionforabnormaltrafficpatterns.2.2.Outlierdetectionmethods

Tillnow,manyanomalydetectiontechniqueshavebeenspecificallydevelopedforcertainapplicationdomains,whileothersaremoregeneric.Thetechniquescanbecategorizedasclassification-based,distanced-based,clustered-based,andstatistical-based,etc.Inthesesurveys[8,9],theygivecomprehensiveandhigh-leveloverviewofdifferentoutlierdetectiontechniquesandsomeoftheirapplications.

Inthisarticle,wefocusonthestatistic-basedapproachtodetectspatio-temporaloutlier.Theunderlyingprincipleofanystatisticaloutlierdetectiontechniqueis:“Ananomalyisanobservationwhichissuspectedofbeingpartiallyorwhollyirrelevant

Inthiswork,n×nspatialgridandTtimeintervalsaremappedtoathree-dimensionalgrid.Theunitcellisintheshapeofacubeandeverysub-regioninthegridiscalledcubicregion.

1L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373359

(a) Road Network of Beijing(b) Grid Map

Fig.1.AnexampleofthetrafficnetworkofBeijing.Basedonthelongitudeandlatitude,theentirecityispartitionedintoagridmap.Sub-figure(a)ispartitionedintosub-figure(b).

becauseitisnotgeneratedbythestochasticmodelassumed”[10].Itisbasedonthekeyassumption:Normaldatainstancesoccurinhighprobabilityregionsofastochasticmodel,whileanomaliesoccurinthelowprobabilityregionsofthestochasticmodel.Statisticaltechniquesfitastatisticalmodel(usuallyfornormalbehavior)tothegivendataandthenapplyastatisticalinferencetesttodetermineifanunseeninstancebelongstothismodelornot.Instancesthathavealowprobabilityfromtheappliedteststatisticaredeclaredasoutliers.Scoringtechniquesareusedtoassignananomalyscoretoeachinstanceinthetestdatadependingonthedegreetowhichthatinstanceisconsideredananomaly.Usually,theoutputofsuchtechniquesisarankedlistofoutliers.Wemaychoosetoeitheranalyzethetopfewoutliersoruseacut-offthresholdtoselecttheoutliers.Bothparametricaswellasnon-parametrictechniqueshavebeenappliedtofitastatisticalmodel[11,12].Inourwork,weonlyconsidertheparametricanomalydetectiontechniquesbasedontheclassicallikelihoodratioteststatistic(i.e.LRT).

Inthedomainofspatio-temporalapplications,moststatisticoutlierdetectionapproachesproposedsofarareonpurelyspatialsearching[13–15].Evenwhenconsideringtimeaspect,mostoftheexistingworkonspatio-temporaloutlierdetectiontreatsthetimedimensionsimply,eitherbyapplyingpurelyspatialoutlierdetectionmethodsateachtimestep,orbytreatingtimeasanotherspatialdimensionandthusapplyingspatialoutlierdetectioninonemoredimensionalspace(originalspatialdimensionsplustimedimension).Thedisadvantageofthefirstapproachisthatbyonlyexaminingonetimestepofdataatatime,moreslowlyemergingoutliersmaynotbedetected.Thedisadvantageofthesecondapproachisthatlessrelevantoutliersmaybedetected:thoseoutliersthathaveconstantlyexistedforalongtime,ratherthanthosethatarenewlyemerging[16,17].Inourfollowingwork,wedifferentiatetemporalpropertywithspatialproperty.Morespecifically,weinvestigatethespatialregionintwodifferentscenariosinwhichthetemporalpropertyispersistentorotherwiseemerging.Persistenttemporaldatareferstodatawhereitstemporalpropertyisconsistentovertime.Emergingtemporaldatareferstothedatawhereitstemporalpropertyisnon-decreasingovertime.Thesetwoconceptsprovidemoreinformationforpracticaloutlierdetections.

Amongthevariousstatisticalmethodsfordiscoveringoutlier,thespatialandspace–timescanstatistic,introducedbyKulldorff[18–21],hasbeenthemostwidelyadopted.However,itisoriginallydesignedforPoissonandBernoullidata.Lateron,thedifferentvariationsofordinal,exponentialandnormalmodelsareproposed[22–25].Theyhavebeenimplementedinthesoftware(SaTScan)[26].Inthespace–timescanstatisticofKulldorff,thekeyparameterisassumedtobeconsistentovertime.Thetechniquesimplyappliestimeasonemoredimension.Nielletal.[16]pointoutthedistinctfeatureoftimeaspectandproposeamodifiedteststatistictodetectlocalizedandglobalizedemergingcluster.Tangoetal.[27]alsoproposeaspace–timescanstatisticbasedonnegativebinomialmodelbytakingintoaccountthepossibilityofnonnegligibletime-to-timevariationofPoissonmean.Wuetal.[2]proposeagenericframeworkcalledLRTforanyunderlyingstatisticalmodel.Itusestheclassiclikelihoodratiotest(LRT)statisticasascoringfunctiontoevaluatethe“anomalousness”ofagivenspatialregionwithrespecttotherestofthespatialregion.Moreoveragenericpruningstrategywasproposedtogreatlyreducethenumberoflikelihoodratiotests.However,itisusedforspatialanomalydetectionwithoutconsideringthetemporalproperty.Liuetal.[5]proposeanapproachtodiscovercasualrelationshipsamongspatio-temporaloutliers.Here,weonlyfocusondetectingspatial–temporaloutliers.

2.3.Performanceissue

Furthermore,performanceissueisalsoabigprobleminspatialorspatio-temporaloutlierdetection.Thenaivecomputationofspatialoutlierdetectionisverytime-consuming,variousstrategieshavebeenproposedtospeeduptheprocess[18,28–30].

360L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

TheseexistingmethodsintheliteraturearebasedonKulldorff'sspatialscanstatisticandtheyaimtoactuallyavoidconsideringallO(n4)rectangularareas.Alsotheyareonlyapplicabletothoserelativelysimpledensitymeasuresthatareconvexormonotonicwithrespecttotheratioofzonepopulationovertheentirepopulationandtheratioofzone'seventcountovertheentireeventcount.TheLRTframeworkstatesthecomputationcostforsinglestatisticvalueaswellasenumeratingallthespatialregionstobeexpensive[2].Toavoidperformingstatisticalcomputationsforeveryregion,itprovidesapruningstrategybasedonclassicallikelihoodteststatistic.Inthisarticle,weextendittobeapplicabletopersistentandemergingoutlierdetectionscenarios.3.Background

3.1.Thelikelihoodratiotest(LRT)

Weprovideabriefbutself-containedintroductionforfindingthemostanomalousregion(rectangle)inaspatialsetting.Theregionsarerectanglesmappedontoaspatialgrid.Wealsoexplainapruningstrategywhichcancutdownthenumberofrectanglesthatneedtobechecked.Thebasictooltofindtheanomalousregionisthelikelihoodratiotest(LRT).

GivenadatasetX,themodeldistributionf(X,θ),anullhypothesisH0:θ∈Θ0andanalternatehypothesisH1:θ∈Θ−Θ0,LRTistheratio

λ¼

supΘ0fLðθjXÞjH0gsupΘfLðθjXÞjH1gwhereL()isthelikelihoodfunction.θisasetofparameterscomingfromcompleteparameterspaceΘandnullparameterspaceΘ0.Seedetailin[2,31].Inaspatialsetting,thenullhypothesisisthatthestatisticalaspectofthephenomenonofinterestinaregionR(thatiscurrentlybeingtested)isnodifferentfromrestofthespatialarea(denotedasR).ThusifaregionRisanomalousthenthealternatehypothesiswillmostlikelybeabetterfitandthedenominatorofλwillhaveahighervalueforthemaximumlikelihoodestimatorofθ.Aremarkablefactaboutλisthatundermildregularityconditions,theasymptoticdistributionofΛ≡−2logλfollows

2aχ2kdistributionwithkdegreesoffreedom,wherekisthenumberoffreeparameters.(seeFig.2).ThusregionswhoseΛvaluedropsinthetailofχ2distributionarelikelytobeanomalous.3.2.Constrainedmaximumlikelihoodestimation

Constrainedmaximumlikelihoodestimationisasetofproceduresfortheestimationoftheparametersofmodelsviathemaximumlikelihoodmethodwithgeneralconstraintsontheparameters,alongwithanadditionalsetofproceduresforstatisticalinference[32].Barlowetal.[33,34]solvedthemaximumlikelihoodestimationbasedonareliabilitygrowthmodel,whichisapplicabletoouremergingscenario.ItassumesthatasystemisbeingmodifiedduringKstagesofdevelopment(inourapplication,weassumetaxicountsinaregiontobevaryingduringKtimesteps).Dataconsistsofxisuccessesinnitrialsinstagei,i=1,…,k.Letpibethesystemreliabilityatthei-thstage(inourcase,piistheincreasingrateoftaxicountati-thtimestep).Barlowetal.,obtainedthemaximumlikelihoodestimatesofp1,p2,…,pk,undertherestrictionthatp1≤p2,≤…,≤pk.Toobtainthemaximumlikelihoodestimatesofp1,p2,…,pksubjecttotherestrictionthatp1≤p2,…,≤pk,firstformtheratiosx1/n1,x2/

^iofpi.Ifforsomej(j=1,…,K−1),xj/nj,combinethen2,…,xk/nk.Ifx1/n1≤x2/n2,≤…,≤xk/nk,thenxi/xnistheMLEp

observationsinthej-thand(j+1)-ststagesandexaminetheratios:x1/n1,…,xj−1/nj−1,xj+xj+1/nj+nj+1,xj+2/nj+2,…,xk/nk,forthe(k−1)stagesthusformed.Iftheseratiosareinnon-decreasingorder,theyconstitutetheMLEsof

ÀÁÀÁ

^j¼p^jþ1¼xjþxjþ1=njþnjþ1.Ifnot,continuetheprocessofcombiningstagesuntiltheratiosareinp1≤p2,≤…,≤pkwithp

non-decreasingorder.Thisprocessneedstoberepeatedatmost(k−1)times,andtheresultisindependentoftheorderinwhichstagesarecombinedtoeliminatereversalsinthesequenceofratios.

Tosimplify,weget:

\"#kkXX

^i¼Maxk≥iMins≤ixi=nip

i¼s

,wherei=1,…,k.3.3.MonteCarlosimulation

Thelikelihoodratio,orequivalentlyitslogarithm,canbeusedtocomputeap-value,orcomparedtoacriticalvaluetodecide

whethertorejectnullhypothesis(i.e.thereisnoanomalyinourcase)infavorofthealternativehypothesis(i.e.thereisanomaly).Theprobabilitydistributionoflikelihoodratio,assumingthatthenullhypothesisistrue,canbeapproximatedbychi-squaredistribution

2Iftheχ2distributionisnotapplicablethenMonteCarlosimulationcanbeusedtoascertainthep-value.

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373361

Fig.2.Chi-squaredistributionwithdegreeoffreedom=3.

[35].ButwecannotexpecttofindthedistributionofthelikelihoodratioteststatisticinclosedanalyticalformandthusMonteCarlosimulationcanbeperformedtoobtainp-valueinsomecases.Therefore,oncewehavediscoveredtheregionwithmaximumlikelihoodratiovalue,thestatisticalsignificanceofthisregioncanbederivedfromthechi-squaredistributionorbyconductingMonteCarlosimulations.Torunthesimulationtest,alargenumberofreplicationsofdatasetsaregeneratedunderthenullhypothesis,forinstance,9999suchreplicasarecreatedtoperformlikelihoodratiotest.Thetestissignificantatthe5%levelifthelikelihoodratiovalueisamongthe500highestvaluesoftheteststatisticcomingfromthereplications.3.4.Upper-boundingmethodology

Theupper-boundingstrategyforLRTforanomalydetectionwasintroducedbyWuandJermaine[2].ThebasicobservationisthatthelikelihoodvalueofanygivenregionRundercompleteparameterspaceisnotgreaterthanthemultiplicationofthelikelihoodvalueofallitsnon-overlappingsub-regionsundernullparameterspace.Therefore,theloglikelihoodofanygivenregionRcanbeupper-bounded.Forinstance,ifaregionRiscomposedoftwonon-overlappingsub-regionsR1andR2,then

LðθRjXRÞ≤LðθR1jXR1ÞÂLðθR2jXR2Þ:Itisequivalentto

logLðθRjXRÞ≤logLðθR1jXR1ÞþlogLðθR2jXR2Þ:

HereθR;θ′R1andθ′R2arethemaximumlikelihoodestimatorsundercompleteandnullparameterspacesseparately.SeeFig.3.

′

Fig.3.TheloglikelihoodofregionRundercompleteparameterspaceisupperboundedbythesumoftwonon-overlappingsubregionsR1andR2undernullparameterspace.

362L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

Fig.4.Anexampleof(4×4)gridtoillustratetheLRTcalculationandtheupper-boundingpruningmethodology.

Theupper-boundingstrategyisusedtoprunenon-outliers:IfwereplacethelikelihoodofaregionRbytheproductofthelikelihoodsofitssub-regionsandthenewLRTisbelowtheanomalousthreshold(i.e.confidencelevelα),thenRcannotbeanomalous.3.5.Examples

3.5.1.Example1

Usingasimplebutconcreteexample,wewillnowexplainhowtofindanomalousregionusingtraditionalLRTcomputationandtheupper-boundingpruningstrategy.Considerthe4×4grid(G)inFig.4.Thenumberofsuccesses(mi)independentlygeneratedbyPoissonmodelPo(bip)isdisplayedineachcellci.Thebaselinebiineachcellciissetto10.Thesuccessratepis0.5fortheregionRand0.1fortherestofcells.Thesignificancelevelissettoα=0.05.Wereferthesuccessratepasthetestparameter.3.5.1.1.Procedures.ForagivenregionR,traditionalLRTcalculationinvolvesseveralsteps:maximumlikelihoodestimatorfortestparameterofR,RandG;likelihoodcalculationofR,RandG;theratiocalculationfromtheprevioustwosteps.AlthoughthecalculationforPoissondistributeddatacanbesimplifiedinto1EXPstatisticalmodel,inordertoillustratethetraditionalLRTcomputation,theoriginalstepsarecarriedoutasthefollowing:1.Thelikelihoodfunctionofeachcelliis:

ðbipÞkie−ðbipÞ

:fðpjciÞ¼

ki!2.ThelikelihoodofanygivenregionR,whichiscomposedofcellc1,c2,…,ci,…,ctis:

ðbipÞkie−bip

:LðpjRÞ¼Πci∈R

ki!ð

ð1Þ

ð2Þ

^)iscalculatedas:3.TheMLE0ofpforaregionR(denotedasp

󰀄󰀅󰀄󰀅

^¼∑c∈Rki=∑c∈Rbipii

ð7þ8Þ^^^^^R¼ð10Thuspþ10Þ¼0:75.Similarly,pR,pR1,pR2andpGareobtainedas:0.14,0.7,0.8and0.21.

4.TheΛofregionRisgivenby

ð3Þ

󰀄󰀅ÀÁ19þ15−0:21Â160

ÂeΛR¼−2logLðpjGÞþ2logLðpjRÞþ2logLðpjRÞ¼−2log0:21

󰀄󰀅

15−0:75Â20

þ2log0:75Âe󰀄󰀅

19−0:14Â140

þ2log0:14Âe¼20:76:

ð4Þ

Fromtheabovesteps,wegettheexactloglikelihoodvalueofregionR:logL(p|R)=−19.31;andtheexactloglikelihoodofR1andR2:logL(p|R1)=−9.49,andlogL(p|R2)=−9.78separately.

Weknowthecriticalvalueofχ2(α)=3.84.Obviously,20.76isgreaterthan3.84.ThereforeregionRistreatedasapotentialoutlier.

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373363

3.5.1.2.Upper-boundingpruning.Wecanalsoverifytheupper-boundofregionR:ThesumofloglikelihoodofR1andR2is−19.27,whichisgreaterthantheexactloglikelihoodvalueofRwhichwascomputedas−19.31.

󰀄󰀆󰀅󰀄󰀆󰀅󰀆󰀆

Similarly,forregionR′,logLp󰀆R′1¼−1:31,andlogLp󰀆R′2¼−1,thesumofloglikelihoodofR′1andR′2is−2.31.ItisgreaterthantheloglikelihoodvalueofR′,whichis−2.66.Furthermore,ΛR′¼2:14,whichissmallerthanχ2(α)=3.84.Thisshowsthattheupper-boundedLRTvalueissmallerthanthecriticalvalue.Usingthepruningstrategy,theactualvalueofregionR′neednotbecalculatedandcanbepruned.

3.5.2.Example2

Thisexampleillustrateshowtoestimatemaximumlikelihoodofsystemreliabilityinreliabilitygrowthmodel.Todetectemergingoutlier,weusethiswaytocalculatemaximumlikelihoodestimator.

(1).Thenumberofsuccesses,populationandsuccessrateineachtimestepareshowninTable1.

^i,theprocesstogetasequenceofnon-decreasingratiosissummarizedbelow(2).ToestimatethemaximumlikelihoodofP

accordingtotheBarlowtheorem(Table2):^1¼P^2¼P^3¼P^4¼0:385;P^5¼0:833.Fromtheaboveprocedures,weobtainthemaximumlikelihoodestimates:P4.Proposedstatisticalmodels

Deﬁnition1.KP:Itrefersto“keyparameter”,denotedasKP{θ1,θ2,…,θi,…,θn}.θiisaparametercomingfromthekeyparameter

set.Forinstance,inepidemiology,ifweareconcernedaboutthetrendofthediseaserateinaspatio-temporalview,thediseaserateisKP.Inourapplication,thevariationoftaxicountwithinaperiodisKP.Forsimplicity,weonlyconsideroneparameterfromthekeyparametersetinourwork(denotedasKP).4.1.PSTOmodel(PersistentSpatio-TemporalOutliermodel)

Itisusedtodetectpersistentspatio-temporaloutliers.ThenullhypothesisH0assumesthattheKPisconsistentforallregionsovertime.ThealternativehypothesisH1assumesthatKPhasahighervalueinregionri∈Rthanthevalueoutsideofregionrj∈G–R(i.e.R),butthevalueinregionri∈Risconsistentovertime.Wecalculatethelikelihoodratiotestasfollows:

8><Πr∈RLðθrjXRÞΠ

DðRÞ¼

ri∈RLðθrjXrÞ1

Πri∈GLðθGjXGÞforθr≥θr;otherwise:

ThisformulaistheclassicalLRTstatistic.WefirstcalculatetheMLEofθrandθrtomaximizethenumeratorandtheMLEofθGto

maximizethedenominator.Thentheratioisthescoreweusetoevaluatethe“anomalousness”ofagivenspatio-temporalregion.4.2.ESTOmodel(EmergingSpatio-TemporalOutliermodel)

Thismodelisusedtodetectemergingspatio-temporaloutliers.ThenullhypothesisH0assumesthattheKPisconsistentforallregionsovertime.ThealternativehypothesisH1assumesthatKPisnon-decreasingwitheverytimestepoverregionri∈Randhigherthanrj∈R.Wecalculatethelikelihoodratiotestasfollows:

>>>>ttt>>>Πri∈GLðθt>>GjXGÞ>>:;otherwise:1

Table1

Reliabilitygrowthprocedure.Timestep(i)01234

Numberofsuccesses(xi)2030302050

Population(ni)5070806060

Successrate(xi/ni)0.4000.4290.3750.3330.833

Table2

Themaximumlikelihoodestimatorprocedure.i12345

xi2030302050

ni5070806060

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

xi/ni0.4000.4290.3750.3330.833

Firstcalculation0.400

60/150=0.4000.3330.833

Secondcalculation80/210=0.381

Thirdcalculation100/260=0.385

0.8330.833

ThisformulaisderivedfromtheclassicalLRTstatisticanddesignedfortheemergingscenario.UserneedstofindasolutiontomaximizethenumeratorwiththeincreasingKP.Forinstance,Barlow[34]providesanapproachtosolvetheconstrainedmaximumlikelihoodestimationonthereliabilitygrowthmodelinwhichtherelativeriskisnon-decreasingovertime.OrEMalgorithmcanbeperformedtoestimatethekeyparameter.

5.Upper-boundingstrategyandpruningmechanismforproposedframework5.1.Upper-boundingstrategy

(1).InPSTOmodel,theupper-boundingstrategyexplainedinSection3.4canbeextendeddirectlytospatio-temporal

dimension.

(2).InESTOmodel,KPisassumedtovaryatdifferenttimesteps;weshowbelowthattheupper-boundingstrategyisstill

applicabletothismodel.

Theorem1.LetregionR=Rt1∪Rt2,fornon-overlappingtimeintervalst1andt2,wehave:

LðθRjXRÞ≤LðθRt1jXRt1ÞÂLðθRt2jXRt2Þ

′

ð5Þ

,whereθR¼θRt1∪θRt2andXR¼XRt1∪XRt2.

󰀆ÀÀÁÀ󰀆Á

Proof.WeknowLθRjXRÞ¼LθRt1󰀆XRt1ÂLθRt2󰀆XRt2.UsingtheLRTupper-boundingbasicconcepts,weknowthatθRt1ischosenundermorestrictcompleteparameterspaceandθ′Rt1ischosenunderloosennullparameterspace.ThismeansthatperformingMLE0onasub-intervalofRhasloosentheconstraintscomparingwithperformingMLE1onR.Thus,wehaveLðθRt1jXRt1Þ≤Lðθ′Rt1jXRt1ÞandLðθRt2jXRt2Þ≤LðθR′t2jXRt2Þ.Therefore,LðθRjXRÞ≤Lðθ′Rt1jXRt1ÞÂLðθ′Rt2jXRt2Þ.

Theorem2.LetregionR=R1∪R2,fornon-overlappingspatialregionsR1andR2,wehave:

LðθR1;θR2jXR1;XR2Þ≤LðθR1t1;θR1t2jXR1t1;XR1t2ÞÂLðθR2t1;θR2t2jXR2t1;XR2t2Þ

′

ð6Þ

,whereR,R1,andR2arecomposedof(t1,t2)timestepsrespectively.Herewejustusetwotimestepstoillustrate.Itisapplicabletoanyttimesteps.

Proof.Foreachtimestepi,wehave:LðθRtijXRtiÞ≤Lðθ′R1tijXR1tiÞÂLðθ′R2tijXR2tiÞ

LðθR1;θR2jXR1;XR2Þ¼LðθR1jXR1ÞÂLðθR2jXR2Þ

LðθR1t1;θR1t2jXR1t1;XR1t2Þ¼LðθR1t1jXR1t1ÞÂLðθR1t2jXR1t2ÞLðθR2t1;θR2t2jXR2t1;XR2t2Þ¼LðθR2t1jXR2t2ÞÂLðθR2t2jXR2t2Þ

Thereforeweget

LðθR1;θR2jXR1;XR2Þ≤LðθR1t1;θR1t2jXR1t1;XR1t2ÞÂLðθR2t1;θR2t2jXR2t1;XR2t2Þ:

′

FromTheorems1and2,weknowthattheupper-boundingstrategyisapplicabletoemergingmodel(ESTO).

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373365

(a) R(b) R

¯(c)R

Fig.5.Pre-computationofanygivenspatial–temporalregionRandtilingofR.Sub-figure(a)showsan8×8×8spatial–temporalgrid;sub-figure(b)shows:one

ofthecuboidsfromspatialpre-computedsetissplitfromtemporaldimensionandresultsin15smallercuboids.Sub-figure(c)istheradialmethodtotileR.

5.2.Pre-computationandpruningmechanism

5.2.1.Pre-computationforregionR

Werecursivelysplittheregionintotwosub-regionsofthesamesize,startingfromthebiggestcuboidenclosedbytwoplanesfromtimeview,endingatthelowestresolutionofthespatial–temporalgrid.Fig.5bshowsthesplitapproachesforasub-cuboidhighlightedasbluefromthetemporaldimensioninan8×8×8grid(Fig.5a).Thelikelihoodofanygivenregioncanbeupper-boundedbythispre-computedsetviathetilingofLRT.

5.2.2.Pre-computationforthecomplementofregionR(i.e.R)Byconsideringalloftheintersectionpoints,weconnecteachintersectionpointonthe3-dimensionalgridwiththeeightcornersofthegrid.Thisproduceseightdiagonals,eachofwhichcreatesonecuboidinthepre-computedset.SincethereareO(n4)intersectionpoints,thereareO(n4)cuboidsinthepre-computedset.Afterwegetthepre-computedset,foranygivenregionR,weusetheradialandsandwichmethodsinLRTtogettheupper-boundedlikelihoodvalueofR.Thesetwomethodsproducesixnon-overlappingsub-cuboidregionsforRseparately.RadialmethodisperformedbyelongatingthesidesofaregionRuntilthesideshitthegridbordersusingclock-wisecounterclock-wideorder.SandwichmethodisperformedbyelongatingtwoparallelsidesofRinbothdirectionsuntiltheyreachthebordersofthegrid.Seedetailofthesemethodsin2-dimensionalgrid[2].In3-dimensionalview,twelvetimestilingisinvolved.Fig.5cshowsthetilinginradialway.

5.2.3.Computationalcomplexity

Inthebrute-forceapproach,thereareatotalofO(n6)regionsthatneedtobesearched.OurapproachreducesthecostbyprecomputingtwolikelihooddatasetwithsizeofO(n4).Thelikelihoodofeveryregionisupper-boundedandthereallikelihoodiscalculatedonlyforanumberofregions.Furthermore,inourimplementation,wehavealreadyrankedthetop-kregionsaccordingtothelikelihoodratiovalues.Therefore,theperformancewon'tbeaffectednomatterwhichsignificancetestingmethodisbeingapplied.

TheprocessofoutlierdetectionisshowninAlgorithm1.Theinputtingparametersare:datagrid(G),probabilitydensityfunction(f),maximumlikelihoodestimationfunctionunderdifferentparameterspace(MLE0,MLE1),likelihoodfunction(L),numberoftopregionstobereturned(K)andthesignificancelevel(α).Inthisprocess,steps1and2performpre-computations;step5tostep8obtainstheupper-boundedlikelihoodvalueofcurrentcuboidforeachiteration.Duringeachiteration,thechi-squareddistributionisappliedtoprunenormalregions.Finally,itoutputstop-kanomalousregions.6.Experiments,resultsandanalysis

WereportonexperimentsconductedwherewehaveusedAlgorithm1totestforaccuracy,pruningabilityandperformance.Kwassetto1.InSections6.1and6.2allexperimentswerecarriedoutonsyntheticdata.InSection6.3,wedemonstratetheusefulnessofourapproachonarealdataset.6.1.Resultsonsyntheticdata

Wetestedfourvariantsoftheoutlierdetection:1.2.3.4.

brute-forcepersistentspatio-temporaloutliers(bpsto)brute-forceemergingspatio-temporaloutliers(besto)

pruning-basedpersistentspatial–temporaloutliers(ppsto)pruning-basedemergingspatio-temporaloutliers(pesto).

366L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

Algorithm1.Topkspatio-temporaloutlierdetection

Wegenerateddatasetonagridsizevaryingfrom(4×4×4)to(128×16×16).Fiftyseparatetrialswerecarriedoutforeachscenario(seebelow)andwemeasuredthreeaspects:(a)pruningrate(b)accuracy,and(c)runningtime.Thesignificancelevelwassetatα=0.05.

6.1.1.ScenarioI

Thenullhypothesisholds.ThebaselinebcisgeneratedrelativelyuniformlybyaNormaldistribution(μ=104,σ=103)andafixedsuccessratepof0.001.ThenumberofsuccesseskcisgeneratedfromPo(bcp).ResultsareshowninTable3.

6.1.2.ScenarioII

Thenullhypothesisholds.TheonlydifferencewithScenarioIisthatthedatainarandomselectedcuboidareawithsizeof(5×4×3)isgeneratedbyaNormaldistributionwithdifferentparametersettings(μ=105,σ=5×103).ResultsareshowninTable3.

6.1.3.ScenarioIII

Thealternativehypothesisholds.Itissimilartothenullmodelexceptthatthedataofarandomlyselectedcuboidareaofsize(5×4×3)isgeneratedfromaPoissondistributionwithp=3,6,9,18,and36foremergingcaseandp=3forpersistentcase.ThedatanotwithinthecuboidareawasalsofromaPoissondistributionwithp=1.ResultsareshowninTable4.

Table3

AveragepruningrateandaccuracyinScenariosIandII.Test

(a)ScenarioI4×4×48×8×816×16×16(b)ScenarioII4×4×48×8×816×16×16

Pruning(%)10010099.9

Accuracy(%)NofalsealarmNofalsealarm0.1falsealarm

10099.99100Nofalsealarm0.01falsealarmNofalsealarm

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

Table4

AveragepruningrateinScenarioIII.Testppsto(%)pesto(%)

16×16×1695.2798.37

32×16×1697.3598.46

×16×1697.98.69

32×32×3297.4799.11

367

128×16×1696.7499.23

6.1.4.ScenarioIV

Thealternativehypothesisholds.ItissimilartoScenarioIIIexceptthatthedataofarandomlyselectedcuboidareaofsize(5×4×3)wasgeneratedfromaPoissondistributionwithp=10,50,250,1250,and6250foremergingcaseandp=10forpersistentcase.ResultsareshowninTable5.6.2.Evaluationsonsyntheticdata

6.2.1.AnalysisonScenariosIandII

TheresultsofScenariosIandIIshowthatweachieveahighpruningrateandnofalsealarmisgeneratedevenwhenweperturbthedistributionofoneregion.Thisisasexpectedanddemonstratesthatthealgorithmiswellcalibrated.ByahighpruningratewemeanthatwecanruleouttheoutliersbyjustcheckingtheLRTupperboundderivedfromthetiling.IftheupperboundvalueislessthanthecriticalvaluethenthetrueLRTvalueoftheregioncannotbeanomalous.

6.2.2.AnalysisonScenariosIIIandIV

ForScenariosIIIandIVtheanomalousregionswerecorrectlyidentifiedwhilemaintainingahighpruningrate.Alsotherewerenoregionsdeclaredasfalsepositives.6.2.3.Analysisonrunningtime

I.ProportionofRunningTime:WeanalyzetherunningtimewithandwithoutpruningforScenariosIIIandIV.Weplotouttheproportionofrunningtimeofpruningapproachrelativetobrute-forceapproachinFig.6,whichisthecomputationofrunningtimeof(brute-force−pruning)/brute-force.Also,theproportionpercentageisdisplayedontheselinegraphs.Itshowsthatasthesizeofthespatialandtemporalregionincreases,theeffectofpruningbecomesprominent.Forthelargestdatatested,thepruningmechanismresultedinasavingsofnearly50%comparedtothebrute-forceapproach.

II.SingleLRTCalculationCost:Wehavealsocalculatedthecostofasinglelikelihoodcalculationasthedimensionofthegridsizeincreases.TheresultsareshowninFig.7.Forthe8×8×8dataset,thecostofthelikelihoodcalculationusingthebrute-forceapproachis0.01mswhilewithpruningitincreasesto0.08ms.However,forthelargerdatasets(e.g.,128×16×16)thecostofasinglelikelihoodcalculationgoesfrom0.30msforthebrute-forceapproachtoaround0.16mswithpruning.Anotherobservationisthatthecostofthesinglelikelihoodcalculationisnearlysimilarfordatasetsofthesamesizebutdifferentdimensions,forexample128×16×16and32×32×32.

III.ComponentsofRunningCost:Wehavealsoanalyzedandcomparedtherunningofthedifferentcomponentsbothforthe

brute-forceandpruningapproaches.TheresultsareshowninFig.8.Thebrute-forceapproachhasthefollowingcomponents:

(1).ThecostofthelikelihoodcalculationforeachregionR(RComputation).

(2).ThecostofthelikelihoodcalculationforthecomplementofeachregionR,denotedasR(RComputation).

Thepruningapproachismorecomplexandinvolvesthefollowingcomponents:

(1)ThecostofcomputingthelikelihoodforeachelementofthetilingsetTR.Thiswillbeusedtoupperboundthe

likelihoodvalueforanarbitraryspatio-temporalregion.(Rpre-computation)

(2).ThecostofcomputingthelikelihoodforeachelementofthetilingsetTR(Rpre-computation).

(3).Thecostofupper-boundingthelikelihoodofR.ThisinvolvesfirstexpressingRasaunionofsubregionsandtheneach

subregionasaunionoftilesfromTR.

(4).Thecostofupper-boundingthelikelihoodofR(RComputation).ThisinvolvesfirstexpressingRasaunionof

subregionsandtheneachsubregionasaunionoftilesfromTR.EachRregioncanbeexpressedasaunionofsixsubregionsandtherearetwotypesoftilingmethods:sandwichandradial.Wecalculatethelikelihoodvalueusingbothtilingmethodsandthenselectthetightestupper-bound.

Table5

AveragepruningrateinScenarioIV.Testppsto(%)pesto(%)

16×16×1679.2795.57

32×16×1697.5197.40

×16×1697.7796.78

32×32×3297.2294.70

128×16×1696.65.23

368L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

(a) Scenario III psto(b) Scenario III esto

Fig.6.Theproportionofrunningtimeofpruningvs.brute-forceapproach.Itshowsthatoutlierpruningsearchingissignificantlyimprovedwhenthedatasetsizestartsfrom32×16×16inthesefourdifferentscenarios.

AsisclearinFig.8,Rcomputationisthemostexpensivepartofthecalculation.However,asthedatasetsizeincreases,theoverheadsofthetilinggivewaytoitsmoreefficientreuseresultinginconsiderablesavings.6.3.Casestudies:BeijingtaxiGPSdata

WeillustratetheuseofthePestomethodonarealdataset[4,5].ThedatasetconsistsofthreemonthsofGPStrajectoriescollectedfrom33,000taxisinBeijingbetween01/03/2009and31/05/2009.TheroadnetworkofBeijingissplitintogrid.Thetaxicountsofeachcellwhichisidentifiedbycolumnindexandrowindexareprovidedinatextfile.Thetimefrequencyofmonitoringthetaxicountsis15min.Todealwithrealdata,weonlyneedtocalculatethetotaltaxicountsineachcellwithinagivenperiod,thenthispre-processeddatacanbedirectlyloadedintoouralgorithm.Inthissection,wesearchforthemostanomalousemergingregionwithinaspecifiedtimeperiodandthenprovideapossibleexplanationfortheanomaly.

6.3.1.CaseI

All(8×8)gridsweretestedbetween9:00:00amand10:00:00amforsixteendays.Wechoose20daysofdatatocalculatethebaselineprobabilities.

6.3.1.1.ResultI.Theperiodfrom01/05/2009to02/05/2009emergedasatopoutlieratthepositionof(0,1)and(1,1)onthegrid.ThisperiodcorrespondstotheLabordaypublicholiday(“GoldenWeek”).Usuallytheholidaydurationissevendays(fromMay1sttoMay7th),butstartingfrom2009,theholidayperiodwasshortentobetweenMay1standMay3rdinclusive.TocelebratetheholidaysitappearsthatmanypeoplevisitedHappyValley,thebiggestamusementparkinBeijing.The3rdInternationalFashionfestivalwasalsoheldinthatlocation.Ourresultscoincidewiththefactthattaxisenjoygoodbusinessonpublicholidays

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373369

(a) Scenario III psto(b) Scenario III esto

Fig.7.Thesinglelikelihoodrationcalculationcostofpruningvs.brute-forceapproach.Itshowsthatoutlierpruningsearchingissignificantlyimprovedwhenthedatasetsizestartsfrom32×16×16inthesefourdifferentscenarios.

andthereisusuallyanincreaseinthenumberoftaxisneartouristspots.TheresultsareshowninFigs.9aandb,and10a.Wecanseethatthenumberoftaxisincreasedfrom1stMayto2ndMayandthendecreasedfrom3rdofMayonwards.

6.3.2.CaseII

All(8×8)gridsweretestedbetween3:15:00pmto4:30:00pmfor8days.Weuse12daysofdatatocalculatethebaselineprobabilities.

6.3.2.1.ResultII.Theregionhighlightedasblueonthemapwasdetectedasanemergingoutlierfrom16/03/2009to20/03/2009.ItisoneofthecityexpressroadcalledTonghuiheNorthRoad.From01/03/2009to13/03/20093,the11thNationalPeople'sCongress(i.e.NPC)washeldinBeijing,whichistheannualmeetingofthehighestlegislativebodyofthePeople'sRepublicofChina.Nearly3000deputiesfromalloverChinaattendedtheCongress.Duringthisperiod,thetrafficauthoritiesinBeijingimposedtemporaryrestrictionmeasuresonvehiclestocontroltrafficflow.Mostpeoplechoosetotakebusorsubwayinsteadofdrivingortakingtaxitocommutetowork.ThenumberoftaxitravelingonTonghuiheNorthroadincreaseduntilmostofthedeputiesleftBeijing.TheresultsareshowninFigs.9candd,and10b.

Toinvestigatemore,wesetk=5inourcasestudies.Wefoundthattheothertop4outliershavebigoverlaponthetop1outlierregion.Theseoutliershavesimilarspatialareaandspanningtimeperiod.Itverifiesthattheemergingoutliercanbecorrectlylocated.7.Conclusion

Inthisarticle,anefficientpatternminingapproachwasproposedtocaterforspatio-temporaltrafficdata,whichisabletodetect“persistentoutliers”and“emergingoutliers”.Weproposedtwostatisticalmodels,whichencompassthegenericfeaturesofanomalouspatterns.Wealsoderivedanupper-boundingstrategyforthetwostatisticalmodelssupportingforfastoutlierdetection.Ourcomprehensiveexperimentsshowthattheperformanceofcomputationaltimeisgreatlyimprovedwhenthedatasetsizeislarge,andwecanstillfindthecorrectoutliers.Wealsocarefullyanalyzedthetilingschemeandtheupper-boundingstrategyinthesyntheticexperiments.Inourcasestudies,ourmodelisabletodetectregionswithemergingnumberoftaxisthatcanbevalidatedbyknownmajortrafficevents.

370L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

(a) Split cost of ESTO with

smaller dataset

(b) Split cost of ESTO with

larger dataset

smaller dataset

(d) Split cost of PSTO with

larger dataset

Fig.8.Therunningtimeofcomparablepartsofbrute-forcevs.pruningapproachinscenarioIII.Itshowsthatthepruningsearchingisfasterwiththelargerdataset.AlthoughthetilingofeveryRtakesthelongesttimeinpruningsearching,thecostissmallcomparedtothelikelihoodcalculationofeveryRinbruteforcesearching.

Ourapproachisapplicabletoawidevarietyofcontexts.Forinstance,inweatherforecastmodels,itcanbeusedtodetectemergingweatherpatternwhichraisesthepossibilityofdryandwarmclimates.Theseclimatesmayhavegreatimpactoninfectiousdisease.Investigatingsuchweathervariablesassociatedwithinfectiousdiseasescanhelpinanticipatingfutureepidemics,andearlywarningsystemcanbedevelopedforsurveillanceandinterventions.Ingeneexpressionmodels,discoveringemergingpatternsishelpfulfordiagnosisandunderstandingcorrelationofgeneexpressionprofilestodiseasestatesinasignificantway.Theyareusefulforcapturinginteractionsamonggenes,findingsignaturepatternsfordiseasesubtypes,andgeneratingpotentialdiseasetreatmentplans,etc.Infinancemodels,miningemergingbusinesstrendfromdifferentgeographicalregionscanprovideinsightforidentifyingsomeprofitableinvestments.Itcanalsobeappliedinemergingeventidentification,intrusiondetection,etc.8.Futurework

Fromtheabovediscussion,wealreadyobservedthatthecostofpruningapproachhasnearlyabove50%speed-upwithrespecttothenaivealgorithmasthegridsizeincreases.Itallowsthecomputationtobeperformedforlarge-scaleapplications.However,wealsonoticedthatitstillneedsalmostonetotwodaystofindthetopoutlierevenwithpruningapproachinoursyntheticexperimentswithgridsizeof128×16×16or32×32×32.Whenthedatasetsarenotmodestlysized,thescalabilityisstillnotgood.Thisperformancemightnotbeacceptableinrealdetectionapplications.

Weseethatthescalabilityofoutlierdetectionhasbecomeimportantastheamountofdataforanalysishasbeenincreasinggreatly.Fordealingwithlargedatasets,itisimportanttobothparallelizethealgorithms,andimplementthemtoexecuteefficiently.Fortunately,theLRTscanningalgorithmishighlyparallelizable,eachsub-gridcomputationisindependentofeachotherandthewholegridcanbepartitionedintoequalpartsanddistributedoverthemultipleprocessors.TheGPUcomputingorGPGPU(i.e.General-PurposecomputationonGraphicsProcessingUnits)hasbecomeanewtrendforresearcherstodogeneralpurposescientificandengineeringcomputationbytheuseofGPU.ItenablesdramaticincreasesincomputingperformancebyharnessingthepoweroftheGPUandisstartingtoplayasignificantroleinlarge-scalemodeling.Duetoourhighlyparallelizable

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373371

(a) The average taxi counts within outlier

regions vs. non-outlier regions from01/05/2009 to 02/05/2009

(b) The average taxi counts within outlier

regions from 01/05/2009 to 08/05/2009

regions vs. non-outlier regions from16/03/2009 to 20/03/2009

(d) The average taxi counts within outlier

regions from 14/03/2009 to 21/03/2009

Fig.9.Comparisonofoutlyingandnon-outlyingregionsinan8×8×8grid.Itshows:(a)theaveragetaxicountswithinoutlierregionsarenon-decreasingcomparedtonon-outlierregionswhichsharethesameemergingperiodwithoutlier.(b)Theaveragetaxicountswithinoutlierregionsthroughouttheemergingperiodarenon-decreasingcomparedtotheoutlierregionsfortherestoftheperiod.

algorithm,thetechnologyofgraphicsprocessingunit(GPU)andcomputeunifieddevicearchitecture(CUDA),whichareidealformassivedataparallelism,mightbeconsideredinourimplementationtoacceleratethespatio-temporalexploringandanalyzingprocesses.Aspartofourfuturework,parallelizationofpruningapproachwillbepursuedtoachievefasteroutlierdetection.

(a) The region highlighted with blue borders

on the map is the outlier region of Case I.The icons hows the exact location ofHappy Valley.

(b) The region highlighted with blue borders

is the outlier of Case II. It is the city expressroad of Bei-jing. (i.e. Tonghuihe North Road)

Fig.10.OutlierlocationsfromourtwocasestudiesonBeijingMap.

372L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373

References

[1]Y.Zheng,X.F.Zhou,ComputingwithSpatialTrajectories,Springer,2011.

[2]M.Wu,X.Song,C.Jermaine,S.Ranka,J.Gums,ALRTframeworkforfastspatialanomalydetection,In:Proceedingsofthe15thACMSIGKDDInternational

ConferenceonKnowledgeDiscoveryandDataMining(KDD'09),2009,pp.887–6.

[3]J.Yuan,Y.Zheng,X.Xie,G.Sun,Drivingwithknowledgefromthephysicalworld,in:Proceedingsofthe17thACMSIGKDDConferenceonKnowledge

DiscoveryandDataMining(KDD'11),2011,pp.316–324.

[4]J.Yuan,Y.Zheng,C.Zhang,W.Xie,X.Xie,G.Sun,Y.Huang,T-drive:drivingdirectionsbasedontaxitrajectories,Proceedingsofthe18thACMSIGSPATIAL

InternationalConferenceonAdvancesinGeographicInformationSystems(ACMSIGSPATIALGIS2010),2010,pp.99–108.

[5]W.Liu,Y.Zheng,S.Chawla,J.Yuan,X.Xie,Discoveringspatio-temporalcausalinteractionsintrafficdatastreams,in:KDD'1117thSIGKDDConferenceon

KnowledgeDiscoveryandDataMining,2011,pp.1010–1018.

[6]Y.Zheng,Y.Liu,J.Yuan,X.Xie,UrbanComputingwithTaxicabs,UbiComp'11,September17–21,2011,pp.–98,(Beijing,China).

[7]S.Chawla,Y.Zheng,J.Hu,Inferringtherootcauseinroadtrafficanomalies,IEEEInternationalConferenceonDataMining(ICDM2012),2012.[8]V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,in:AcmComputingSurveys,vol.41,2009,pp.1–58.

[9]V.J.Hodge,J.Austin,Asurveyofoutlierdetectionmethodologies,in:ArtificialIntelligenceReview,vol.22,2004,pp.85–126.[10]F.J.Anscombe,I.Guttman,Rejectionofoutliers,in:Technometrics,vol.2,1960,pp.123–147.

[11]E.Eskin,Anomalydetectionovernoisydatausinglearnedprobabilitydistributions,in:InProceedingsofthe17thInternationalConferenceonMachine

Learning,MorganKaufmannPublishersInc.,2000,pp.255–262.

[12]M.Desforges,P.Jacob,J.Cooper,Applicationsofprobabilitydensityestimationtothedetectionofabnormalconditionsinengineering,in:InProceedingsof

theInstituteoftheMechanicalEngineers,vol.212,1998,pp.687–703.

[13]R.Ng,J.H.Clarans,Amethodforclusteringobjectsforspatialdatamining,IEEETransactionsonKnowledgeandDataEngineering14(2002)1003–1016.[14]J.S.M.Ester,H.-P.Kriegel,X.Xu,Adensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise,DataMiningandKnowledge

Discovery2(1998)226–231.

[15]W.Wang,J.Yang,R.R.M.Sting,Sting:astatisticalinformationgridapproachtospatialdatamining,VLDB'97,Proceedingsof23rdInternationalConference

onVeryLargeDataBases,August25-29,1997,MorganKaufmann,Athens,Greece,1997.

[16]D.B.Neill,A.W.Moore,M.Sabhnani,K.Daniel,Detectionofemergingspace–timeclusters,in:Proceedingsofthe11thACMSIGKDDInternationalConference

onKnowledgeDiscoveryinDataMining(KDD'05),2005,pp.218–227.

[17]D.B.Neill,A.W.Moore,DetectionofEmergingSpace–TimeClusters:PriorWorkandNewDirections,TechnicalReport,CarnegieMellonUniversity,2004.[18]M.Kulldorff,Aspatialscanstatistic,Comm.InStat.:TheoryandMethods,1997.1481–1496.

[19]M.Kulldorff,Spatialscanstatistics:models,calculations,andapplications,in:J.Glaz,N.Balakrishnan(Eds.),ScanStatisticsandApplications,Birkhauser,1999.[20]M.Kulldorff,N.Nagarwalla,Spatialdiseaseclusters:detectionandinference,StatisticsinMedicine(1995)799–810.

[21]M.Kulldorff,W.Athas,E.Feuer,B.Miller,C.Key,Evaluatingclusteralarms:aspace–timescanstatisticandclusteralarmsinLosAlamos,AmericanJournalof

PublicHealth88(1998)1377–1380.

[22]L.Huang,M.Kulldorff,D.Gregorio,ASpatialScanStatisticforSurvivalData,InternationalBiometricsSociety,2007.109–118.[23]I.Jung,M.Kulldorff,A.Klassen,Aspatialscanstatisticforordinaldata,StatisticsinMedicine(2007)1594–1607.

[24]I.Jung,M.Kulldorff,O.Richard,Aspatialscanstatisticformultinomialdata,StatisticsinMedicine(2010)1910–1918.

[25]L.Huang,R.Tiwari,M.Kulldorff,J.Zou,E.Feuer,WeightedNormalSpatialScanStatisticforHeterogenousPopulationData,AmericanStatisticalAssociation,2009.[26]http://www.satscan.org2008.

[27]T.Tango,K.Takahashi,K.Kohriyama,Aspace–timescanstatisticfordetectingemergingoutbreaks,InternationalBiometricsSociety(2010)106–115.[28]J.H.Friedman,N.I.Fisher,Bumphuntinginhigh-dimensionaldata,StatisticsandComputing9(1999)123–143.

[29]D.B.Neill,A.W.Moore,AFastMulti-resolutionMethodforDetectionofSignificantSpatialDiseaseClusters,NIPS,2003.651–658.[30]V.Chandola,A.Banerjee,V.Kumar,Anomalydetection:asurvey,ACMComputingSurveys41(2009)1–58.[31]http://en.wikipedia.org/wiki/likelihood-ratio-test2011.

[32]R.Schoenberg,Maximumlikelihoodestimationandconservativeconfidenceintervalproceduresinreliablitygrowthanddebuggingproblems,Computational

Economics10(1997)251–266.

[33]R.E.Barlow,F.Proschan,E.M.Scheuer,MaximumLikelihoodEstimationandConservativeConfidenceIntervalProceduresinReliabilityGrowthand

DebuggingProblems,1966.

[34]R.E.Barlow,E.M.Scheuer,Reliabilitygrowthduringadevelopmenttestingprogram,Technometrics8(1966)53–60.[35]http://en.wikipedia.org2011.

LinseyXiaolinPangiscurrentlystudyingPh.D.intheSchoolofInformationTechnologiesatUniversityofSydney.Shecompletedhermaster'sdegreefromtheNationalUniversityofSingapore.Herresearchinterestsincludedatamining,outlierdetectionandparallelcomputing.

SanjayChawlaisafullprofessorintheSchoolofInformationTechnologies,UniversityofSydney.HereceivedhisPh.D.in1995fromtheUniversityofTennessee,Knoxville,USA.HisresearchworkhasappearedinleadingdataminingjournalsandconferencesincludingACMTKDD,MachineLearning,IEEETKDE,DMKD,ACMSIGKDD,IEEEICDM,SDM,andPAKDD.HeservesontheeditorialboardofDataMiningandKnowledgeDiscoveryandistheProgramCo-ChairofPAKDD2012.Hehasreceivedfourbestpaperawardsinthelastﬁveyears—mostrecentlyattheIEEEInternationalConferenceinDataMining(ICDM)2010.

L.X.Pangetal./Data&KnowledgeEngineering87(2013)357–373373

WeiLiuisaresearchfellowattheUniversityofMelbourne,Australia.HeobtainedhisPh.D.degreefromtheUniversityofSydneyin2011,intheareaofclassskewnessandadversariallearning.HisworkhasappearedinseveralconferencesandjournalsincludingKDD,SDM,ECML/PKDD,PAKDDandMachineLearningjournal.Hisresearchinterestsincludedatamining,machinelearningandbioinformatics.

YuZhengisaleadresearcherfromMicrosoftResearchAsia(MSRA).HeisanIEEEseniormemberandACMseniormember.Hisresearchinterestsincludelocation-basedservices,spatio-temporaldatamining,ubiquitouscomputing,andmobilesocialapplications.Hehaspublishedover50referredpapersasaleadingauthorathigh-qualityinternationalconferencesandjournals,suchasSIGMOD,SIGKDD,ICDE,WWW,AAAI,andIEEETKDE,wherehehasreceived3bestpaperawardsaswellas1bestpapernominee.Thesepapershavealsobeenfeaturedbytop-tierpresseslikeMITTechnologyReviewmultipletimes.Inaddition,hehasbeenservingover30prestigiousinternationalconferencesasachairoraprogramcommitteemember,includingICDE,KDD,Ubicomp,andIJCAI,etc.Sofar,hehasreceived3technicaltransferawardsfromMicrosoftand20granted/ﬁledpatents.In2008,hewasrecognizedastheMicrosoftGoldenStar.HejoinedMSRAin2006afterreceivingaPh.D.degreeinEEfromSouthwestJiaotongUniversity.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文