The Expectation-maximization binary clustering (EMbC) is a general purpose, unsupervised, multi-variate, clustering algorithm (Garriga et al. 2016), driven by two main motivations: (i) it looks for a good compromise between statistical soundness and ease and generality of use - by minimizing prior assumptions and favouring the semantic interpretation of the final clustering- and, (ii) it allows taking into account the uncertainty in the data. These two features make it specially suitable for the behavioral annotation of animal’s movement trajectories.
The method is a variant of the well sounded Expectation-Maximization Clustering (EMC) algorithm - i.e. under the assumption of an underlying Gaussian Mixture Model (GMM) describing the distribution of the data-set - but constrained to generate a binary partition of the input space. This is achieved by means of the delimiters, a set of parameters that discretize the input features into high and low values and define the binary regions of the input space. As a result, each final cluster includes a unique combination of either low or high values of the input variables. Splitting the input features into low and high values is what favours the semantic interpretation of the final clustering.
The initial assumptions implemented in the EMbC algorithm aim at minimizing biases and sensitivity to initial conditions: (i) each data point is assigned a uniform probability of belonging to each cluster, (ii) the prior mixture distribution is uniform (each cluster starts with the same number of data points), (iii) the starting partition, (i.e. initial delimiters position), is selected based on a global maximum variance criterion, thus conveying the minimum information possible.
The number of output clusters is 2m determined by the number of input features m. This number is only an upper bound as some of the clusters can vanish along the likelihood optimization process. The EMbC algorithm is intended to be used with not more than 5 or 6 input features, yielding a maximum of 32 or 64 clusters. This limitation in the number of clusters is consistent with the main motivation of the algorithm of favouring the semantic interpretation of the results.
The algorithm deals very intuitively with data reliability: the larger the uncertainty associated with a data point, the smaller the leverage of that data point in the clustering.
With respect to close related methods like EMC and Hidden Markov Models (HMM), the EMbC is specially useful when: (i) we can expect bi-modality, to some extent, in the conditional distribution of the input features or, at least, we can assume that a binary partition of the input space can provide useful information, and (ii) a first order temporal dependence assumption, a necessary condition in HMM, can not be guaranteed.
The EMbC algorithm is of general purpose and can deal with any type of data sets or time series. However, the EMbC R-package is mainly intended for the behavioral annotation of animals’ movement trajectories where an easy interpretation of the final clustering and the reliability of the data constitute two key issues, and the conditions of bi-modality and unfair temporal dependence usually hold. In particular, the temporal dependence condition is easily violated in animal’s movement trajectories because of the heterogeneity in empirical time series due to large gaps, or prefixed sampling scheduling.
Input movement trajectories are given either as a data.frame or a Move object from the move R-package. The package deals also with stacks of trajectories for population level analysis. Segmentation is based on local estimates of velocity and turning angle, eventually including a solar position covariate as a daytime indicator.
The core clustering method is complemented with a set of functions to easily visualize and analyse the output:
Also, some functions are provided to further refine the output, either by pre-processing (smoothing) the input data or by post-processing (smoothing, relabeling, merging) the output labeling.
The results obtained for different empirical datasets suggest that the EMbC algorithm behaves reasonably well for a wide range of tracking technologies, species, and ecological contexts (e.g. migration, foraging).
The EMbC package has dependencies with the following packages:
We also suggest the package rgl, 3D visualization device system (OpenGL) (Adler, Murdoch, et al. 2015), to allow for dynamic 3D scatter-plots in multivariate analyses.
For researchers who are familiar with the MoveBank framework, we include a special link with the package move, Visualizing and analizing animal track data (Kranstauber and Smolla 2015), to allow users to make use of Move objects as input trajectories.
Basically, the package consists of a hierarchy of classes:
Instances of these classes are build by means of two main constructors:
embc(), the main core of the package, implementing the EMbC algorithm itself; this constructor takes as input a matrix of data-points and returns an object of class binClst with the multivariate binary clustering of the input data;
stbc(), a specific constructor for the behavioral annotation of movement trajectories; the input to this constructor is a trajectory (given either as a data.frame, a Move object or a list of them) and returns an object of class binClstPath (or any of its child classes) with the bivariate (velocity/turn) clustering of the trajectory; eventually it can compute a trivariate clustering by including a parameter indicating a solar covariate (either height or azimuth) to be used as a daytime indicator.
The behavior of the constructors can be modified by means of different parameters (e.g. maximum number of iterations, information shown at each step, pre-smoothing of the data).
The output objects have several slots containing all information related to the binary clustering (input data, intermediate computations and output data). All slots are accessible and can be used with any R function external to the package or even modified. However, we recommend not to manually change the values in the slots in order to keep the internal consistency.
Let’s load the package;
This is the core class that implements the multivariate binary clustering algorithm. The input data-set is given as a matrix with data points given as rows and input features as columns. No more than 5 (6 at most) variables should be used in order to get a meaningful set of clusters.
Let’s use the object x2d included in the package. This object contains a set of data points generated from a bivariate GMM with four components (slot x2d@D), and a labeling indicating which component generated each data point (slot x2d@L);
We can cluster a general dataset by calling the embc() constructor and passing in the input data, in matrix form, and storing the result in an output variable (e.g. mybc);
## ... computing starting delimiters
## [1] 0 -0.0000e+00 4 400
## [1] 1 -5.4847e+00 4 304
## [1] 2 -5.3257e+00 4 14
## [1] 3 -5.1640e+00 4 12
## [1] 4 -5.0542e+00 4 10
## [1] 5 -4.9906e+00 4 7
## [1] 6 -4.9597e+00 4 9
## [1] 7 -4.9448e+00 4 5
## [1] 8 -4.9360e+00 4 5
## [1] 9 -4.9308e+00 4 6
## [1] 10 -4.9282e+00 4 3
## [1] 11 -4.9271e+00 4 3
## [1] 12 -4.9265e+00 4 3
## [1] 13 -4.9270e+00 4 2
## [1] 14 -4.9281e+00 4 4
## [1] 15 -4.9282e+00 4 1
## [1] 16 -4.9278e+00 4 4
## [1] 17 -4.9279e+00 4 3
## [1] 18 -4.9274e+00 4 4
## [1] 19 -4.9300e+00 4 6
## [1] 20 -4.9291e+00 4 3
## [1] 21 -4.9292e+00 4 0
## [1] 22 -4.9295e+00 4 0
## [1] 23 -4.9293e+00 4 2
## [1] 24 -4.9291e+00 4 0
## [1] 25 -4.9288e+00 4 2
## [1] 26 -4.9287e+00 4 1
## [1] 27 -4.9293e+00 4 0
## [1] 28 -4.9289e+00 4 2
## [1] 29 -4.9296e+00 4 0
## [1] 30 -4.9296e+00 4 0
## [1] 31 -4.9294e+00 4 0
## [1] 32 -4.9294e+00 4 0
## [1] 33 -4.9294e+00 4 0
## [1] ... Stable clustering
At each iteration, the algorithm shows the iteration number, the current likelihood value, the number of effective clusters and the number of labels that have changed with respect to the previous iteration.
mybc is an instance of class binClst. Any slot of a binClst object is accessible and can be used by (passed to) any R function. The most basic slots of a binClst object are:
## [1] "X" "U" "stdv" "m" "k" "n" "R" "P" "W" "A"
## [11] "L" "C"
The likelihood plot allows a visual assessment of the convergence of the algorithm;
# the lkhp() function allows an offset parameter;
lkhp(mybc) # left panel
lkhp(mybc, 10) # right panel
The last iterations may show some decrease in likelihood. This is due to a slight discrepancy between binary and optimal likelihood clusterings that can appear at the last steps of the algorithm, normally involving just a few data-points at the boundaries of the binary regions (note the low number of data-points changing their labels beyond iteration number 12 where the likelihood starts decreasing).
The function stts() shows the statistics of the clustering. The columns Xi.mn and Xi.sd show the mean and standard deviation of the input features. The last two columns show the marginal distribution of the clustering in absolute (number of data-points) and relative (percentage) values;
## X1.mn X1.sd X2.mn X2.sd kn kn(%)
## 1 LL 2.60 1.72 0.94 1.51 69 17.25
## 2 LH 1.78 1.76 6.19 2.30 107 26.75
## 3 HL 8.76 2.37 0.94 1.11 121 30.25
## 4 HH 7.41 1.96 8.02 2.92 103 25.75
The complete set of parameters of the Gaussian mixture is accessible through the slot mybc@P. This slot is a list of inner named-lists (M for mean and S for the covariance matrix) for each cluster. For instance, the parameters for cluster 1 (LL) are;
## $M
## [1] 2.6000747 0.9403607
##
## $S
## [,1] [,2]
## [1,] 2.963896 1.924727
## [2,] 1.924727 2.265230
The delimiters are accessible through the slot mybc@R where we have the min() and max() values that delimit each binary region;
## X1.min X2.min X1.max X2.max
## 1.LL -1.994262 -2.162228 4.699946 2.553833
## 2.LH -1.994262 2.553833 4.129379 15.386923
## 3.HL 4.699946 -2.162228 13.762928 3.323976
## 4.HH 4.129379 3.323976 13.762928 15.386923
The function sctr() makes a scatter-plot of the data-points, showing the clusters in different colours, and depicting the binary delimiters (light grey lines) to show the binary regions;
The NC in the legend stands for not classified points. Not classified points may appear only when performing the behavioral annotation of movement trajectories (explained later in this document) and correspond to outliers due to errors or gaps in the trajectory or, typically, to the last track of the trajectory.
In a supervised case, that is in case that an expert’s labeling is available, we can use this labeling to validate the results of the clustering. The expert labeling must be numerically coded and translated to the range of the number of clusters, and must be given as a numeric vector with one numeric label for each location.
We can make a visual validation of the clustering versus the expert labeling by means of the sctr() function passing in the expert labeling vector as a second parameter;
We can perform a numeric assessment of the clustering in terms of a confusion matrix by means of the function cnfm();
## cls.01 cls.02 cls.03 cls.04 mrg. Prc.
## 1.LL 57 9 2 1 0.17 0.83
## 2.LH 13 86 0 8 0.27 0.80
## 3.HL 18 1 97 5 0.30 0.80
## 4.HH 12 4 1 86 0.26 0.83
## ------ ------ ------ ------ ------ ------
## mrg 0.25 0.25 0.25 0.25 0.82 0.81
## Rcl 0.57 0.86 0.97 0.86 0.81 NaN
## Fms 0.68 0.83 0.88 0.84 NaN 0.81
The confusion matrix shows values of row precision and row F-measure, and values of column recall and column F-measure. The 3x2 subset of cells at the bottom-right show respectively: the overall accuracy, the average recall, the average precision, NaN, NaN and the Macro-F-measure.
The binClstPath is a binClst subclass intended to automatically perform the bivariate clustering of a movement trajectory, based on estimated local values of velocity and turn. It can also perform a trivariate clustering by incorporating a daytime covariate (i.e. solar height or solar azimuth).
The input data-set is a trajectory given as a data.frame with timestamps, longitudes and latitudes in columns 1:3 respectively (column headers are user free). Timestamps must be given as.POSIXct() with the specific format “%Y-%m-%d %H:%M:%S”.
As an example, the package includes an object named expth. This is a synthetically generated trajectory stored as a data.frame;
## dTm lon lat lbl
## 1 2014-11-14 16:05:57 70.25000 -49.26000 1
## 2 2014-11-14 16:07:40 70.24925 -49.25862 1
## 3 2014-11-14 16:21:34 70.22671 -49.25500 1
## 4 2014-11-14 16:36:13 70.21570 -49.25763 1
## 5 2014-11-14 17:44:05 70.12802 -49.26653 3
## 6 2014-11-14 18:02:20 70.01982 -49.25872 3
expth is a synthetically generated trajectory with expert labeling (note the column expth$lbl). Further columns of data can be included in the input data.frame as long as the first three columns respect the required format.
Tip: by including an expert labeling with a column labeled lbl all validation functions will make use of it by default.
To perform the bivariate velocity/turn clustering of this trajectory we simply call the stbc() constructor passing in the data.frame with the time/space coordinates of the trajectory and storing the output binClstPath object in a variable (e.g. mybcp);
## [1] 0 -0.0000e+00 4 600
## [1] ... Stable clustering
The output object mybcp is a binClstPath instance with the following slots;
## [1] "pth" "spn" "dst" "hdg" "bursted" "tracks"
## [7] "midPoints" "X" "U" "stdv" "m" "k"
## [13] "n" "R" "P" "W" "A" "L"
## [19] "C"
As a child class, a binClstPath object inherits and extends the set of slots of the binClst class. The basic slot differences with respect to the binClst class are:
Slots tracks, midpoints and bursted are related to the bursted visualization of the trajectory (covered later in this document) and should not be manipulated.
Because of class inheritance, all functionality described for the binClst class (e.g. likelihood plot, clustering parameters, scatter-plot, validation) holds for a binClstPath instance.
## X1.mn X1.sd X2.mn X2.sd kn kn(%)
## 1 LL 0.91 0.59 0.57 0.35 95 15.83
## 2 LH 1.07 0.61 2.25 0.56 316 52.67
## 3 HL 7.65 2.61 0.34 0.21 133 22.17
## 4 HH 9.99 2.39 1.93 0.67 55 9.17
## cls.01 cls.02 cls.03 cls.04 mrg. Prc.
## 1.LL 95 0 0 0 0.16 1
## 2.LH 0 316 0 0 0.53 1
## 3.HL 0 0 133 0 0.22 1
## 4.HH 0 0 0 55 0.09 1
## ------ ------ ------ ------ ------ ------
## mrg 0.16 0.53 0.22 0.09 1 1
## Rcl 1.00 1.00 1.00 1.00 1 NaN
## Fms 1.00 1.00 1.00 1.00 NaN 1
Nonetheless, the binClstPath class has some particular functionalities of special interest for the case of behavioral annotation of movement trajectories. These functionalities are described in the following.
The function lblp() plots the temporal series of data and the temporal profile of the behavioral labeling;
The function view() shows the annotated trajectory and a top panel with the temporal sequence of behaviors;
We can generate kml or html documents for a detailed inspection of the output by means of google-earth or the user’s system browser. The package allows two types of visualization of the annotated trajectory: a point-wise visualization (functions pkml() or pmap()) or a burst-wise visualization (functions bkml() or bmap()) (Garriga et al. 2016);
# point-wise kml doc generation;
# display=TRUE launches google-earth from within R;
pkml(bc, display=TRUE)
By default, the kml or html documents are named with a Sys.time() based name and saved in a folder embcdocs automatically generated in the user’s home directory. This can be modified by means of the corresponding parameters.
The burst-wise visualization requires the computation of burst segments and midpoints. This is computed only the first time that a burst-visualization of a trajectory is requested. In case of long trajectories, this process can take some time.
Intermediate data computed by the stbc() constructor and stored in the binClstPath object can be easily plotted with automatic formatting and labeling of axes;
# plotting time-spans, distances and heading directions;
# this is the default behavior when we just pass the binClstPath instance;
varp(mybcp)
Indeed, the function varp() is a wrapper for the R plot() function. The purpose of this function is simply to ease the visualization of intermediate variables by formatting and labeling the axes accordingly to each one.
Note: the dependency with respect to the move R-package has been dropped, and the use of the old binClstMove objects is now deprecated. Nonetheless Move objects can still be passed directly to the stbc() function.
This is intended for users having trajectories in Movebank (https://www.movebank.org/) and familiarized with the move R-package. Let’s use the leroy data in the Move R-package.
leroy is a GPS trajectory of an urban Fisher (Martes pennati) with 919 tracks, spanning from 2009-02-11 12:16:45 to 2009-03-04 09:16:59.998, with a mean time interval between tracks of 32.7 minutes. Move objects can be passed directly to the stbc() constructor;
## [1] 0 -0.0000e+00 4 919
## [1] ... Stable clustering
Daytime covariates refer to the solar position. This can be given as solar height in degrees above the horizon (night/day distinction), or by solar azimuth in degrees from north (sunrise/sunset distinction).
Including daytime covariates is the natural way of incorporating time information in the clustering of an animal’s movement trajectory, with the potential advantage of increasing the maximum number of output clusters to 23 = 8, i.e. the number of movement behaviors that can potentially be distinguish.
A trivariate clustering including a daytime covariate is done by means of the parameter scv with possible values ‘height’ or ‘azimuth’;
## [1] 0 -0.0000e+00 8 919
## [1] ... Stable clustering
The output of the stbc() constructor is still a binClstPath (the binClstMove object of previous versions is deprecated). As we included a covariate, leroybc3 corresponds now to a trivariate binary clustering and therefore its functionality presents some particularities.
Let’s see the statistics of the clustering;
## X1.mn X1.sd X2.mn X2.sd X3.mn X3.sd kn kn(%)
## 1 LLL -54.05 5.00 0.04 0.10 1.09 0.80 117 12.73
## 2 LLH -51.57 5.44 0.03 0.10 2.89 0.20 85 9.25
## 3 LHL -27.80 18.93 0.56 0.21 0.38 0.27 79 8.60
## 4 LHH -33.94 19.46 0.49 0.17 1.89 0.61 62 6.75
## 5 HLL -0.28 25.43 0.04 0.10 1.26 0.79 281 30.58
## 6 HLH 2.24 23.99 0.03 0.10 2.87 0.19 182 19.80
## 7 HHL 29.03 6.36 0.55 0.22 0.65 0.42 96 10.45
## 8 HHH 32.89 5.00 0.52 0.26 2.53 0.50 16 1.74
Features are ordered as X1:daytime, X2:velocity and X3:turn. Note that highs and lows for daytime (the solar height in degrees above the horizon) do not necessarily correspond to daytime or night-time clusters (note the negative mean for X1 in HXX clusters). This is so because almost all of the activity of this animal happens during the night and it is more likely to discern different behaviors along night-time.
By default, the sctr() function of a trivariate clustering depicts a double scatter-plot corresponding to low and high values of the covariate respectively. This can be changed by means of the parameter showVars=c().
# showVars=c(1,2,3) is the default option and it is only shown for illustrative purposes
# by default the background colour is set to light-grey to enhance visibility
# the "bg"" parameter allows changing this default behavior
If the R-package rgl is installed one can use the function sct3() to get a dynamic 3D (i.e. can be zoomed and rotated) plot, more useful for a visual understanding of the clusters.
sct3(leroybc3, showClst=c(5, 6, 7, 8))
# with showClst=c() we can restrict the plot to a particular subset of clusters
The sct3() function is defined for and inherited from the binClst class, and therefore intended for a general multivariate clustering. If the number of input features is greater than 3 and showVars=c() is not specified, the first three variables are used by default.
When clustering a time series the EMbC disregards the temporal information. As a result, the output labeling may reveal small (possibly irrelevant) changes in behavior framed in a broader temporal context (e.g. a long-term predominant behavioral mode).
The package includes two possibilities to account for the temporal information in the time series and smooth out the fine grain locality of the output labeling.
The smth() function applies a post-smoothing procedure (Garriga et al. 2016) to the output labeling and returns a smoothed copy of the input instance;
# dlta is the maximum likelihood difference to accept a relabeling
# dlta=1 (accept all changes) is the default behavior
postbc3 <- smth(leroybc3, dlta=0.9)
Alternatively, a pre-smoothing of the input data is also possible by means of the parameter smth of the stbc() constructor.
# smth sets the smoothing time window length in hours
prebc3 <- stbc(leroy, smth=1, scv='height', info=-1)
## [1] 0 -0.0000e+00 8 919
## [1] ... break: optimization cycle
The lblp() function allows comparing two output labelings, adding a bottom line indicating the differences;
Note that by pre-smoothing the input data, cluster 5 (HLL) has been merged into cluster 6 (HLH) and we get a final clustering with only 7 different behaviors. When merging occurs, the semantics of the final labeling is somewhat misleading because the final labeling is only a result of how the algorithm evolved until reaching the merging point. In any case, the label should be read as HLX, that is, by taking into account that the last feature (in this case the turn) is meaningless given the values of the rest (i.e turn can be either H or L given H values of daytime and L values of velocity).
Using the pkml() function we can visualize which locations correspond to cluster HLH;
By combining the spatially clustered distribution of locations HLX (Figure ) with the semantics of the cluster (high daytime, low velocity), we could tell that these locations are most probably indicating the nests.
Obviously, the package does not deal with labels like HLX. However, one can change labels as desired (even to manually force the merging of two clusters). In this case, we would probably feel more comfortable by relabeling the cluster HLH (cluster number 6) as HLL (cluster number 5) to suggest a more clear semantics of resting behavior;
Note that the function rlbl() does not return a relabeled copy of the input instance, instead it relabels the self instance. Nonetheless, the parameters of the clustering remain unchanged. The relabeling is effective only for visualization purposes and can be easily reversed by means of the parameter “reset”.
The chkp() function is similar to lblp() but plots the labeling profile versus a control variable (e.g. environmental information). The control variable must be given as a numeric vector that is depicted as coloured background bars (with specific parameters to control the colouring and legend labels);
The binClstStck is an extension (not a child class) of the binClstPath class particularly designed to work with multiple trajectories. This is intended for population level analysis from trajectories of several individuals, or period level analysis by splitting long trajectories of several years.
To illustrate this let’s figure out two trajectories from our example path, simulating two different individuals;
tmp <- runif(nrow(expth))
# simulated trajectory of individual 1
expth1 <- expth[which(tmp<=0.5), ]
# simulated trajectory of individual 2
expth2 <- expth[which(tmp>=0.5), ]
To perform the clustering of a stack of trajectories we pass the individual trajectories to the stbc() constructor as a list (either of data.frame trajectories, Move objects, or a mixture of them);
# we can combine data.fame trajectories and move objects
# only for illustrative purposes !!!
mystck <- stbc(list(expth1, expth2, leroy), info=-1)
## [1] 0 -0.0000e+00 4 1519
## [1] ... Stable clustering
In this case, the stbc() constructor returns an instance of the binClstStck class. In general, all the functionality described for a binClst class will work for a binClstStck instance;
## X1.mn X1.sd X2.mn X2.sd kn kn(%)
## 1 LL 0.34 0.34 0.79 0.76 765 50.36
## 2 LH 0.04 0.10 2.69 0.43 423 27.85
## 3 HL 6.71 3.34 0.35 0.23 144 9.48
## 4 HH 3.91 3.71 2.05 0.72 184 12.11
The exception is the cnfm() function. This function will work only if expert’s labeling is supplied for all trajectories in the stack (in our example, leroy does not have expert’s labeling);
## Error: no reference labels for obj
It is worth noting that a binClstStck instance is not a binary clustering object itself. Instead, it is an object with two slots:
## [1] "bCS" "bC"
## [1] "binClst"
## attr(,"package")
## [1] "EMbC"
## [1] "list"
Each element in mystck@bCS is a binClstPath instance corresponding to each individual path given in the input data list;
## [[1]]
## [1] "binClstPath"
## attr(,"package")
## [1] "EMbC"
##
## [[2]]
## [1] "binClstPath"
## attr(,"package")
## [1] "EMbC"
##
## [[3]]
## [1] "binClstPath"
## attr(,"package")
## [1] "EMbC"
It is important to keep this in mind when applying the above functions to either the population (mystck@bC, a binClst instance) or the individual (mystck@bCS[[i]], a binClstPath instance) levels.
For ease of use, the function slct() allows selecting an individual’s clustering out of the population level;
As usual, it is not necessary to instantiate each individual;
We can use all the functionality of a binClstPath object that allows comparisons (e.g. sctr(), lblp(), cnfm()) to make numeric assessments or visualizations of diferences among individuals or among individuals and population:
## [1] 0 -0.0000e+00 4 303
## [1] ... Stable clustering
## cls.01 cls.02 cls.03 cls.04 mrg. Prc.
## 1.LL 32 0 15 0 0.16 0.68
## 2.LH 80 20 0 50 0.50 0.13
## 3.HL 0 0 61 16 0.25 0.79
## 4.HH 0 0 0 28 0.09 1.00
## ------ ------ ------ ------ ------ ------
## mrg 0.37 0.07 0.25 0.31 0.47 0.65
## Rcl 0.29 1.00 0.80 0.30 0.60 NaN
## Fms 0.41 0.23 0.79 0.46 NaN 0.47
# stbc(expth1, info=-1) is the individual level clustering corresponding to individual 1;
# slct(mystck, 1) is the population level clustering corresponding to individual 1;
mnormt
: The Multivariate Normal and t Distributions (Version
1.5-3). http://azzalini.stat.unipd.it/SW/Pkg-mnormt.