Title: | Big Data Mapping |
---|---|
Description: | Unsupervised clustering protocol for large scale structured data, based on a low dimensional representation of the data. Dimensionality reduction is performed using a parallelized implementation of the t-Stochastic Neighboring Embedding algorithm (Garriga J. and Bartumeus F. (2018), <arXiv:1812.09869>). |
Authors: | Joan Garriga [aut, cre], Frederic Bartumeus [aut] |
Maintainer: | Joan Garriga <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 4.6.2 |
Built: | 2025-03-07 05:20:14 UTC |
Source: | https://github.com/jgarriga65/bigmap |
Clustering statistics box-plot.
bdm.boxp(data, bdm, byVars = F, merged = T, clusters = NULL, layer = 1)
bdm.boxp(data, bdm, byVars = F, merged = T, clusters = NULL, layer = 1)
data |
A matrix of data to be plotted (either the input data matrix or any covariate matrix as well). |
bdm |
A bdm instance as generated by |
byVars |
A logical value. By default ( |
merged |
A logical value. If TRUE (default value) and the |
clusters |
A vector with a subset of cluster ids. (Default value is |
layer |
The number of a layer (1 by default). |
If the number of clusters is large, only the first 25 clusters will be plotted. Note that the WTT algorithm numbers the clusters based on density value at the peak cell of the cluster. Thus, the numbering of the clusters is highly correlated with their relevance in terms of partial density. Therefore, in case of more than 25 clusters, the most relevant should always be included in the plot.
None.
bdm.example() bdm.boxp(ex$map, data = ex$data[, 1:4]) bdm.boxp(ex$map, data = ex$data[, 1:4], byVars = TRUE)
bdm.example() bdm.boxp(ex$map, data = ex$data[, 1:4]) bdm.boxp(ex$map, data = ex$data[, 1:4], byVars = TRUE)
ptSNE cost & size plot.
bdm.cost(bdm, x.lim = NULL)
bdm.cost(bdm, x.lim = NULL)
bdm |
A bdm instance as generated by |
x.lim |
A vector with upper and lower bounds to limit the number of epochs in the x-axis (NULL by default). |
None.
bdm.example() bdm.cost(ex$map)
bdm.example() bdm.cost(ex$map)
Compute the class density maps of a set of classes on the embedding grid. This function returns a fuzzy mapping of the set of classes on the grid cells. The classes can be whatever set of classes of interest and must be given as a vector of point-wise discrete labels (either numeric, string or factor).
bdm.dMap(bdm, data = NULL, threads = 2, mpi.cl = NULL, layer = 1)
bdm.dMap(bdm, data = NULL, threads = 2, mpi.cl = NULL, layer = 1)
bdm |
A bdm instance as generated by |
data |
A vector of discret covariates or class labels. The covariate values can be of any factorizable type. By default ( |
threads |
Number of parallel threads (according to data size and hardware resources, |
mpi.cl |
MPI (inter-node parallelization) cluster as generated by |
layer |
The number of the t-SNE layer (1 by default). |
bdm.dMap()
computes the join distribution where
is the discrete covariate and
are the grid cells of the paKDE raster. That is, this function recomputes the paKDE but keeping track of the covariate (or class) label of each data-point. This results in a fuzzy distribution of the covariate (class) at each cell.
Usually, figuring out the join distribution entails an intensive computation. Thus
bdm.dMap()
performs the computation and stores the result in a dedicated element named $dMap. Afterwards the class density maps can be visualized with the bdm.dMap.plot()
function.
A copy of the input bdm instance with element $dMap, a matrix with a soft clustering of the grid cells.
# --- load example dataset bdm.example() ## Not run: m <- bdm.dMap(ex$map, threads = 4) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: m <- bdm.dMap(ex$map, threads = 4) ## End(Not run)
Class density maps plot.
bdm.dMap.plot( bdm, classes = NULL, join = FALSE, class.pltt = NULL, pakde.pltt = NULL, pakde.lvls = 16, wtt.lwd = 1, plot.peaks = T, labels.cex = 1, layer = 1 )
bdm.dMap.plot( bdm, classes = NULL, join = FALSE, class.pltt = NULL, pakde.pltt = NULL, pakde.lvls = 16, wtt.lwd = 1, plot.peaks = T, labels.cex = 1, layer = 1 )
bdm |
A bdm instance as generated by |
classes |
A vector with a subset of class names or covariate values. Default value is |
join |
Logical value. If FALSE (default value), class mapping is based on the class conditional distributions. If TRUE, class mapping is based on the overall classes join distribution. |
class.pltt |
A colour palette to show class labels in the hard mapping. By default ( |
pakde.pltt |
A palette of colours to indicate the levels of the class density maps. The length of the colour palette should be at least the number of levels specified in pakde.lvls. |
pakde.lvls |
The number of levels of the heat-map when plotting class density maps (16 by default). |
wtt.lwd |
The width of the watertrack lines (as set in |
plot.peaks |
Logical value (TRUE by default). If set to TRUE and the up-stream step |
labels.cex |
If plot.peaks is TRUE, the size of the labels of the clusters (as set in |
layer |
The number of the layer from which the class density maps are computed (1 by default). |
bdm.dMap.plot()
yields a multi-plot layout where the first plot shows the dominating value of the covariate (or dominating class) in each cell, and the rest of the plots show the density map of each covariate value (or class).
The join distribution is prone to the bias in the marginal distribution of the covariate. Therefore, the join distribution
is transformed, by default, into a conditional distribution
(where the
are the grid cells of the embedding and V is the covariate (or class)). Thus, the first plot shows a hard classification of grid-cells, (cells are coloured based on the dominating value of the covariate (or dominating class), i.e. the
for which
is maximum), and the rest of the plots show the conditional distributions
. This makes the plots of the different classes not directly comparable but the dominant areas of each class can be more easily identified.
However, the same plots can be depicted based on the join distribution by setting join = TRUE
. This makes sense when the bias in the covariate values (or classes) is not significant. In this case the hard clustering shows the real dominance of each covariate value (or class) over the embedding area and the density maps are comparable one to each other (although, individually, they are not real density functions as they do not add up to one).
The multi-plot layout can be limited to a subset of the values of the covariate (or subset of classes) specified in parameter classes
.
None.
# --- load example dataset bdm.example() ## Not run: m <- bdm.dMap(ex$map, threads = 4) bdm.dMap.plot(m) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: m <- bdm.dMap(ex$map, threads = 4) bdm.dMap.plot(m) ## End(Not run)
Loads a mapping example.
bdm.example()
bdm.example()
The object ex is a list with elements: ex$data, a matrix with raw data; ex$labels, a vector of datapoint labels; ex$map, a bdm data mapping instance. A bdm instance is the basic object of the mapping protocol, i.e. a list to which new elements are added at each step of the mapping protocol.
This example is based on a small synthetic dataset with n = 5000 observations drawn from a 4-variate Gaussian Mixture Model (GMM) with 16 Gaussian components with random means and variances.
An example dataset named ex
# --- load example dataset bdm.example() str(ex)
# --- load example dataset bdm.example() str(ex)
Pair-wise distance correlation between HD and LD neighborhoods.
bdm.hlCorr(data, bdm, zSampleSize = 1000, threads = 4, mpi.cl = NULL)
bdm.hlCorr(data, bdm, zSampleSize = 1000, threads = 4, mpi.cl = NULL)
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A bdm instance as generated by |
zSampleSize |
Number of data points to check by thread. (Default value is |
threads |
The number of parallel threads (according to data size and hardware resources, |
mpi.cl |
An MPI (inter-node parallelization) cluster as generated by |
A copy of the input bdm instance with new element bdm$knP.
# --- load example dataset ## Not run: bdm.example() m <- bdm.hlCorr(exData[, 1:4], ex$map, threads = 4) ## End(Not run)
# --- load example dataset ## Not run: bdm.example() m <- bdm.hlCorr(exData[, 1:4], ex$map, threads = 4) ## End(Not run)
Computes the precision parameters for the given perplexity (i.e. the local bandwidths for the input affinity kernels) and returns them as a bdm data mapping instance. A bdm data mapping instance is the starting object of the mapping protocol, a list to which new elements are added at each step of the mapping protocol.
bdm.init( data, is.distance = F, is.sparse = F, ppx = 100, mpi.cl = NULL, threads = 4, labels = NULL )
bdm.init( data, is.distance = F, is.sparse = F, ppx = 100, mpi.cl = NULL, threads = 4, labels = NULL )
data |
A data.frame or matrix with raw input-data. The dataset must not have duplicated rows. |
is.distance |
Default value is is.distance = FALSE. TRUE indicates that raw data is a distance matrix. |
is.sparse |
Default value is is.sparse = FALSE. TRUE indicates that the raw data is a sparse matrix. |
ppx |
The value of perplexity to compute similarities. |
mpi.cl |
An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). By default mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is used. |
threads |
Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4). |
labels |
If available, a length nrow(data) vector of class labels. Label values are factorized as as.numeric(as.factor(labels)). |
A bdm data mapping instance. This bdm instance is the starting object of the mapping protocol: a list to which new elements are added at each step of the mapping protocol.
# --- load example dataset bdm.example() ## Not run: m <- bdm.init(ex$data, ppx = 250, labels = ex$labels) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: m <- bdm.init(ex$data, ppx = 250, labels = ex$labels) ## End(Not run)
A measure of matching between HD and LD neighborhoods ('Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure', Lee et. al 2015).
bdm.knp(data, bdm, k.max = NULL, sampling = 0.9, threads = 4, mpi.cl = NULL)
bdm.knp(data, bdm, k.max = NULL, sampling = 0.9, threads = 4, mpi.cl = NULL)
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A bdm instance as generated by |
k.max |
Maximum neighborhood size to check. (By default |
sampling |
Fraction of data points to check for each neighborhood size. (Default value is |
threads |
The number of parallel threads (according to data size and hardware resources, |
mpi.cl |
An MPI (inter-node parallelization) cluster as generated by |
A copy of the input bdm instance with new element bdm$knP. #'
# --- load example dataset bdm.example() ## Not run: # --- compte the kNP m <- bdm.knp(ex$data, ex$map, threads = 4) # --- plot the kNP bdm.knp.plot(m) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: # --- compte the kNP m <- bdm.knp(ex$data, ex$map, threads = 4) # --- plot the kNP bdm.knp.plot(m) ## End(Not run)
Log and linear plots of the k-ary Neighborhood Preservation
bdm.knp.plot(bdm, ppxfrmt = 0)
bdm.knp.plot(bdm, ppxfrmt = 0)
bdm |
A bdm data mapping instance, or a list of them to make a comparative plot. |
ppxfrmt |
Format of ppx in the legend. If |
# --- load example dataset bdm.example() ## Not run: # --- compte the kNP m <- bdm.knp(ex$data, ex$map, threads = 4) # --- plot the kNP bdm.knp.plot(m, ppxfrmt = 0) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: # --- compte the kNP m <- bdm.knp(ex$data, ex$map, threads = 4) # --- plot the kNP bdm.knp.plot(m, ppxfrmt = 0) ## End(Not run)
Given that clusters are computed at grid-cell level, this function returns the clustering label for each data-point.
bdm.labels(bdm, merged = F, layer = 1)
bdm.labels(bdm, merged = F, layer = 1)
bdm |
A bdm data mapping instance. |
merged |
Default value is merged = FALSE. If merged = TRUE and the clustering has been merged, the labels are the ids of the clusters after merging. If merged = FALSE or the clustering has not been merged, the labels indicate the ids of to the top-level clustering. |
layer |
The ptSNE output layer. Default value is layer = 1. |
A vector of data-point clustering labels.
bdm.example() m.labels <- bdm.labels(ex$map)
bdm.example() m.labels <- bdm.labels(ex$map)
Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR) until reaching the desired number of clusters. The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering.
bdm.merge.s2nr( data, bdm, k = 10, plot.merge = T, ret.merge = F, info = T, layer = 1, ... )
bdm.merge.s2nr( data, bdm, k = 10, plot.merge = T, ret.merge = F, info = T, layer = 1, ... )
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A bdm instance as generated by |
k |
The number of desired clusters. The clustering will be recursively merged until reaching this number of clusters (default value is |
plot.merge |
Logical value. If TRUE, the merged clustering is plotted (default value is |
ret.merge |
Logical value. If TRUE, the function returns a copy of the input bdm instance with the merged clustering attached as bdm$merge (default value is |
info |
Logical value. If TRUE, all merging steps are shown (default value is |
layer |
The bdm$ptsne layer to be used (default value is |
... |
If plot.merge is TRUE, you can set the |
See details in bdm.optk.s2nr()
.
None if ret.merge = FALSE
. Else, a copy of the input bdm instance with new element bdm$merge.
bdm.example() m.labels <- bdm.labels(ex$map)
bdm.example() m.labels <- bdm.labels(ex$map)
Initialize parallel computing environment.
bdm.mpi.start(threads)
bdm.mpi.start(threads)
threads |
The number of parallel threads (in principle only limited by hardware resources, |
cl A cluster instance (as created by the snow::makeCluster() function).
Stops MPI parallel computing environment.
bdm.mpi.stop(cl)
bdm.mpi.stop(cl)
cl |
A cluster instance (as created by the bdm.mpi.start() function). |
Starts the multi-core t-SNE (mtSNE) algorithm.
bdm.mtsne( data, is.distance = F, is.sparse = F, ppx = 100, theta = 0.5, iters = 250, mpi.cl = NULL, threads = 4, infoRate = 25 )
bdm.mtsne( data, is.distance = F, is.sparse = F, ppx = 100, theta = 0.5, iters = 250, mpi.cl = NULL, threads = 4, infoRate = 25 )
data |
A data.frame or matrix with raw input-data. The dataset must not have duplicated rows. |
is.distance |
Default value is is.distance = FALSE. TRUE indicates that raw data is a distance matrix. |
is.sparse |
Default value is is.sparse = FALSE. TRUE indicates that the raw data is a sparse matrix. |
ppx |
The value of perplexity to compute similarities. |
theta |
Accuracy/speed trade-off factor, a value between 0.33 and 0.8. Default value is theta = 0.5. If theta < 0.33 the algorithm uses the exact computation of the gradient. The closer it is this value to 1 the faster the computation and the coarser the approximation of the gradient. |
iters |
Number of iters/epoch. Default value is iters = 250. |
mpi.cl |
An MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). By default mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is generated. |
threads |
Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4. |
infoRate |
Number of epochs to show output information. Default value is infoRate = 25. |
A bdm data mapping instance.
# --- load example dataset bdm.example() ## Not run: # --- perform mtSNE m <- bdm.mtsne(ex$data, ex$map, ppx = 250, iters = 250, threads = 4) # --- plot the Cost function bdm.cost(m) # --- plot mtSNE output (use bdm.ptsne.plot() function) bdm.ptsne.plot(m) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: # --- perform mtSNE m <- bdm.mtsne(ex$data, ex$map, ppx = 250, iters = 250, threads = 4) # --- plot the Cost function bdm.cost(m) # --- plot mtSNE output (use bdm.ptsne.plot() function) bdm.ptsne.plot(m) ## End(Not run)
The function bdm.optk.sn2r()
computes the S2NR that results from recursively merging clusters and, by deafult, makes a plot of these values. For large datasets this computation can take a while, so we can save this result by setting optk.ret = TRUE
. If this result is saved, we can plot it again at any time using this funcion.
bdm.optk.plot(bdm)
bdm.optk.plot(bdm)
bdm |
A bdm instance as generated by |
None.
bdm.example() m <- bdm.optk.s2nr(ex$data, ex$map, ret.optk = TRUE) bdm.optk.plot(m)
bdm.example() m <- bdm.optk.s2nr(ex$data, ex$map, ret.optk = TRUE) bdm.optk.plot(m)
Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR). The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering. Merging is applied recursively until reaching a configuration of only 2 clusters and the S2NR is measured at each step.
bdm.optk.s2nr(data, bdm, info = T, plot.optk = T, ret.optk = F, layer = 1)
bdm.optk.s2nr(data, bdm, info = T, plot.optk = T, ret.optk = F, layer = 1)
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A clustered bdm instance (i.e. all up-stream steps performed: |
info |
Logical value. If TRUE, all merging steps are shown (default value is |
plot.optk |
Logical value. If TRUE, this function plots the heuristic measure versus the number of clusters (default value is |
ret.optk |
Logical value. For large datasets this computation can take a while and it might be interesting to save it. If TRUE, the function returns a copy of the bdm instance with the values of S2NR attached as bdm$optk (default value is |
layer |
The bdm$ptsne layer to be used (default value is |
The underlying idea is that neigbouring clusters in the embedding correspond to close clusters in the high dimensional space, i.e. this merging heuristic is based on the spatial distribution of clusters. For each cluster (child cluster) we choose the neighboring cluster with steepest gradient along their common border (father cluster). Thus, we get a set of pairs of clusters (child/father) to be potentially merged. Given this set of candidates, the merging is performed recursively choosing, at each step, the pair of child/father clusters that results in a minimum loss of S2NR.
Typically some clusters dominate over all of their neighboring clusters. These clusters have no father. Thus, once all posible mergings have been performed we reach a blocked state where only the dominant clusters remain. This situation identifies a hierarchy level in the clustering. When this situation is reached, the algorithm starts a new merging round, identifying the child/father relations at that level of the hierarchy. The process stops when only two clusters remain.
Usually, the clustering hierarchy is clearly depicted by singular points in the S2NR function. This is a hint that the low dimensional clustering configuration is an image of a hierarchycal configuration in the high dimensional space. See bdm.optk.plot()
.
None if ret.optk = FALSE
. Else, a copy of the input bdm instance with new element bdm$optk (a matrix).
# --- load mapped dataset bdm.example() # --- compute optimal number of clusters and attach the computation bdm.optk.s2nr(ex$map, data = ex$data, plot.optk = TRUE, ret.optk = FALSE)
# --- load mapped dataset bdm.example() # --- compute optimal number of clusters and attach the computation bdm.optk.s2nr(ex$map, data = ex$data, plot.optk = TRUE, ret.optk = FALSE)
Starts the paKDE algorithm (second step of the mapping protocol).
bdm.pakde( bdm, ppx = 100, g = 200, g.exp = 3, mpi.cl = NULL, threads = 2, layer = 1 )
bdm.pakde( bdm, ppx = 100, g = 200, g.exp = 3, mpi.cl = NULL, threads = 2, layer = 1 )
bdm |
A bdm data mapping instance. |
ppx |
The value of perplexity to compute similarities in the low-dimensional embedding. Default value is ppx = 100. |
g |
The resolution of the density space grid ( |
g.exp |
A numeric factor to avoid border effects. The grid limits will be expanded so as to enclose the density of the kernel of the most extreme embedded datapoints up to g.exp times |
mpi.cl |
An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). Default value is mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated. |
threads |
Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4. |
layer |
The ptSNE output layer. Default value is layer = 1. |
When computing the paKDE the embedding area is discretized as a grid of size g*g cells. In order to avoid border effects, the limits of the grid are expanded by default so as to enclose at least the 0.9986 of the cumulative distribution function () of the kernels of the most extreme mapped points in each direction.
The presence of outliers in the embedding can lead to undesired expansion of the grid limits. We can overcome this using lower values of g.exp. By setting g.exp = 0 the grid limits will be equal to the range of the embedding.
The values g.exp = c(1, 2, 3, 4, 5, 6) enclose cdf values of 0.8413, 0.9772, 0.9986, 0.99996, 0.99999, 1.0 respectively.
A copy of the input bdm instance with new element bdm$pakde (paKDE output). bdm$pakde[[layer]]$layer = 'NC' stands for not computed layers.
# --- load mapped dataset bdm.example() # --- run paKDE ## Not run: m <- bdm.pakde(ex$map, ppx = 200, g = 200, g.exp = 3, threads = 4) # --- plot paKDE output bdm.pakde.plot(m) ## End(Not run)
# --- load mapped dataset bdm.example() # --- run paKDE ## Not run: m <- bdm.pakde(ex$map, ppx = 200, g = 200, g.exp = 3, threads = 4) # --- plot paKDE output bdm.pakde.plot(m) ## End(Not run)
Plot paKDE (density landscape)
bdm.pakde.plot(bdm, pakde.pltt = NULL, pakde.lvls = 16, layer = 1)
bdm.pakde.plot(bdm, pakde.pltt = NULL, pakde.lvls = 16, layer = 1)
bdm |
A bdm instance as generated by |
pakde.pltt |
A colour palette to show levels in the paKDE plot. By default ( |
pakde.lvls |
The number of levels of the density heat-map (16 by default). |
layer |
The bdm$ptsne layer to be used (default value is |
None.
bdm.example() m <- bdm.pakde.plot(ex$map)
bdm.example() m <- bdm.pakde.plot(ex$map)
Precision map (quantile map of betas)
bdm.pMap( bdm, pMap.levels = 8, pMap.cex = 0.1, pMap.bg = "#000000", colorbar = T )
bdm.pMap( bdm, pMap.levels = 8, pMap.cex = 0.1, pMap.bg = "#000000", colorbar = T )
bdm |
A bdm instance as generated by |
pMap.levels |
The number of levels of the quantile-map (8 by default). |
pMap.cex |
The size of the data-points (as in |
pMap.bg |
The background colour of the qMap plot. Default value is |
colorbar |
A logical value (TRUE by default). FALSE hides the side colorbar. |
None.
bdm.example() bdm.pMap(ex$map)
bdm.example() bdm.pMap(ex$map)
Starts the parallelized t-SNE algorithm (pt-SNE). This is the first step of the mapping protocol.
bdm.ptsne( data, bdm, theta = 0.5, Y.init = NULL, mpi.cl = NULL, threads = 4, layers = 2, info = 0 )
bdm.ptsne( data, bdm, theta = 0.5, Y.init = NULL, mpi.cl = NULL, threads = 4, layers = 2, info = 0 )
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A bdm data mapping instance. |
theta |
Accuracy/speed trade-off factor, a value between 0.33 and 0.8. (Default value is theta = 0.0). If theta < 0.33 the algorithm uses the exact computation of the gradient. The closer is this value to 1 the faster is the computation but the coarser is the approximation of the gradient. |
Y.init |
A n *2 *layers matrix with initial mapping positions. (By default Y.init=NULL will use random initial positions). |
mpi.cl |
MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). (By default mpi.cl = NULL a 'SOCK' (intra-node parallelization) cluster is generated). |
threads |
Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4). |
layers |
Number of layers (minimum 2, maximum the number of threads). Default value is layers = 2. |
info |
Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is info = 0. |
A bdm data mapping instance.
# --- load example dataset bdm.example() # --- perform ptSNE ## Not run: # --- run ptSNE m <- bdm.ptsne(ex$data, ex$map, threads = 10, layers = 2) # --- plot the Cost function bdm.cost(m) # --- plot ptSNE output bdm.ptsne.plot(m, class.lbls = ex$labels) ## End(Not run)
# --- load example dataset bdm.example() # --- perform ptSNE ## Not run: # --- run ptSNE m <- bdm.ptsne(ex$data, ex$map, threads = 10, layers = 2) # --- plot the Cost function bdm.cost(m) # --- plot ptSNE output bdm.ptsne.plot(m, class.lbls = ex$labels) ## End(Not run)
Plot ptSNE (low-dimensional embedding)
bdm.ptsne.plot( bdm, ptsne.cex = 0.5, ptsne.bg = "#FFFFFF", class.lbls = NULL, class.pltt = NULL, layer = 1 )
bdm.ptsne.plot( bdm, ptsne.cex = 0.5, ptsne.bg = "#FFFFFF", class.lbls = NULL, class.pltt = NULL, layer = 1 )
bdm |
A bdm instance as generated by |
ptsne.cex |
The size of the mapped data-points in the ptSNE plot. Default value is |
ptsne.bg |
The background colour of the ptSNE plot. Default value is |
class.lbls |
A vector of numeric class labels. If |
class.pltt |
A colour palette to show the class labels in the ptSNE plot. If |
layer |
The bdm$ptsne layer to be used (default value is |
None.
bdm.example() m <- bdm.ptsne.plot(ex$map, class.lbls = ex$labels)
bdm.example() m <- bdm.ptsne.plot(ex$map, class.lbls = ex$labels)
Maps quantitative variables onto the embedding space.
bdm.qMap( bdm, data, labels = NULL, subset = NULL, qMap.levels = 8, qMap.cex = 0.3, qMap.bg = "#FFFFFF", class.pltt = NULL, qtitle = NULL, cex.main = 1, colorbar = T, layer = 1 )
bdm.qMap( bdm, data, labels = NULL, subset = NULL, qMap.levels = 8, qMap.cex = 0.3, qMap.bg = "#FFFFFF", class.pltt = NULL, qtitle = NULL, cex.main = 1, colorbar = T, layer = 1 )
bdm |
A bdm instance as generated by |
data |
A |
labels |
A length |
subset |
A numeric vector with the indexes of a subset of data. Data-points in the subset are heat-mapped and the rest are shown in light grey. By default all data-points are heat-mapped. |
qMap.levels |
The number of levels of the quantile-map (8 by default). |
qMap.cex |
The size of the data-points (as in |
qMap.bg |
The background colour of the qMap plot. Default value is |
class.pltt |
If |
qtitle |
A vector of strings with titles for the plots. Default value is |
cex.main |
The font size of the title (as in |
colorbar |
A logical value (TRUE by default). FALSE hides the side colorbar. |
layer |
The number of a layer (1 by default). |
This is not a heat-map but a quantile-map plot. This function splits the range of each variable into as many quantiles as specified by levels so that the color gradient will hardly ever correspond to a constant numeric gradient. Thus, the mapping will show more evenly distributed colors though at the expense of possibly exaggerating artifacts. For variables with very extrem distributions, it will be impossible to find as many quantiles as desired and the distribution of colors will not be so homogeneous.
None.
bdm.example() bdm.qMap(ex$map, ex$data) # --- show only components (1, 2, 4, 8) of the GMM bdm.qMap(ex$map, ex$data, subset = which(ex$map$lbls %in% c(1, 4, 8, 16)))
bdm.example() bdm.qMap(ex$map, ex$data) # --- show only components (1, 2, 4, 8) of the GMM bdm.qMap(ex$map, ex$data, subset = which(ex$map$lbls %in% c(1, 4, 8, 16)))
Restarts the ptSNE algorithm (runs more epochs).
bdm.restart( data, bdm, epochs = NULL, iters = NULL, mpi.cl = NULL, threads = NULL, layers = NULL, info = 0 )
bdm.restart( data, bdm, epochs = NULL, iters = NULL, mpi.cl = NULL, threads = NULL, layers = NULL, info = 0 )
data |
Input data (a matrix, a big.matrix or a .csv file name). |
bdm |
A bdm data mapping instance. |
epochs |
Number of epochs to run. Default value epochs = NULL runs 4 *log(n) epochs. |
iters |
Number of iters per epoch. Default value iters = NULL runs 4 *log(thread_size) iters/epoch. |
mpi.cl |
An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). Default value is mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated. |
threads |
Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4. |
layers |
Number of layers (minimum 2, maximum the number of threads). Default value is layers = 2. |
info |
Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is 0. |
A bdm data mapping instance.
# --- load example dataset bdm.example() ## Not run: # --- restart ptSNE m <- bdm.restart(ex$data, ex$map, epochs = 50) ## End(Not run)
# --- load example dataset bdm.example() ## Not run: # --- restart ptSNE m <- bdm.restart(ex$data, ex$map, epochs = 50) ## End(Not run)
Starts the WTT algorithm (third setp of the mapping protocol).
bdm.wtt(bdm, layer = 1)
bdm.wtt(bdm, layer = 1)
bdm |
A bdm data mapping instance. |
layer |
The ptSNE output layer. Default value is layer = 1. |
This function requires the up-stream step bdm.pakde().
A bdm data mapping instance.
# --- load mapped dataset bdm.example() # --- perform WTT m <- bdm.wtt(ex$map) # --- plot WTT output bdm.wtt.plot(m)
# --- load mapped dataset bdm.example() # --- perform WTT m <- bdm.wtt(ex$map) # --- plot WTT output bdm.wtt.plot(m)
Plot WTT (clustering)
bdm.wtt.plot( bdm, pakde.pltt = NULL, pakde.lvls = 16, wtt.lwd = 1, plot.peaks = T, labels.cex = 1, layer = 1 )
bdm.wtt.plot( bdm, pakde.pltt = NULL, pakde.lvls = 16, wtt.lwd = 1, plot.peaks = T, labels.cex = 1, layer = 1 )
bdm |
A bdm instance as generated by |
pakde.pltt |
A colour palette to show levels in the paKDE plot. By default ( |
pakde.lvls |
The number of levels of the density heat-map (16 by default). |
wtt.lwd |
The width of the watertrack lines (as set in |
plot.peaks |
Logical value (TRUE by default). If set to TRUE and the up-stream step |
labels.cex |
If plot.peaks is TRUE, the size of the labels of the clusters (as set in |
layer |
The bdm$ptsne layer to be used (default value is |
None.
bdm.example() m <- bdm.wtt.plot(ex$map)
bdm.example() m <- bdm.wtt.plot(ex$map)