Package 'bigMap'

Title: Big Data Mapping
Description: Unsupervised clustering protocol for large scale structured data, based on a low dimensional representation of the data. Dimensionality reduction is performed using a parallelized implementation of the t-Stochastic Neighboring Embedding algorithm (Garriga J. and Bartumeus F. (2018), <arXiv:1812.09869>).
Authors: Joan Garriga [aut, cre], Frederic Bartumeus [aut]
Maintainer: Joan Garriga <[email protected]>
License: GPL-3 | file LICENSE
Version: 4.6.2
Built: 2025-03-07 05:20:14 UTC
Source: https://github.com/jgarriga65/bigmap

Help Index


Clustering statistics box-plot.

Description

Clustering statistics box-plot.

Usage

bdm.boxp(data, bdm, byVars = F, merged = T, clusters = NULL, layer = 1)

Arguments

data

A matrix of data to be plotted (either the input data matrix or any covariate matrix as well).

bdm

A bdm instance as generated by bdm.init().

byVars

A logical value. By default (byVars = FALSE) box-plots are grouped by cluster. With byVars = TRUE box-plots are grouped by input feature.

merged

A logical value. If TRUE (default value) and the !is.null(bdm$merge) the boxplots depict the clusters after merging. If merged = FALSE or is.null(bdm$merge) the boxplots correspond to the top-level clustering.

clusters

A vector with a subset of cluster ids. (Default value is clusters=NULL to plot all clusters, with a maximum of 25).

layer

The number of a layer (1 by default).

Details

If the number of clusters is large, only the first 25 clusters will be plotted. Note that the WTT algorithm numbers the clusters based on density value at the peak cell of the cluster. Thus, the numbering of the clusters is highly correlated with their relevance in terms of partial density. Therefore, in case of more than 25 clusters, the most relevant should always be included in the plot.

Value

None.

Examples

bdm.example()
bdm.boxp(ex$map, data = ex$data[, 1:4])
bdm.boxp(ex$map, data = ex$data[, 1:4], byVars = TRUE)

ptSNE cost & size plot.

Description

ptSNE cost & size plot.

Usage

bdm.cost(bdm, x.lim = NULL)

Arguments

bdm

A bdm instance as generated by bdm.init() or a list of them to make a comparative plot.

x.lim

A vector with upper and lower bounds to limit the number of epochs in the x-axis (NULL by default).

Value

None.

Examples

bdm.example()
bdm.cost(ex$map)

Class density maps

Description

Compute the class density maps of a set of classes on the embedding grid. This function returns a fuzzy mapping of the set of classes on the grid cells. The classes can be whatever set of classes of interest and must be given as a vector of point-wise discrete labels (either numeric, string or factor).

Usage

bdm.dMap(bdm, data = NULL, threads = 2, mpi.cl = NULL, layer = 1)

Arguments

bdm

A bdm instance as generated by bdm.init().

data

A vector of discret covariates or class labels. The covariate values can be of any factorizable type. By default (data=NULL) the function computes the density maps based on the clustering labels (i.e. equivalent to data=bdm.labels(bdm))

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4).

mpi.cl

MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). (By default mpi.cl = NULL a 'SOCK' (intra-node parallelization) cluster is generated).

layer

The number of the t-SNE layer (1 by default).

Details

bdm.dMap() computes the join distribution P(V=vi,C=cj)P(V=v_{i},C=c_{j}) where V=v1,,vlV={v_{1},\dots,v_{l}} is the discrete covariate and C=c1,,cgC={c_{1},\dots, c_{g}} are the grid cells of the paKDE raster. That is, this function recomputes the paKDE but keeping track of the covariate (or class) label of each data-point. This results in a fuzzy distribution of the covariate (class) at each cell.

Usually, figuring out the join distribution P(V=vi,C=cj)P(V=v_{i},C=c_{j}) entails an intensive computation. Thus bdm.dMap() performs the computation and stores the result in a dedicated element named $dMap. Afterwards the class density maps can be visualized with the bdm.dMap.plot() function.

Value

A copy of the input bdm instance with element $dMap, a matrix with a soft clustering of the grid cells.

Examples

# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)

## End(Not run)

Class density maps plot.

Description

Class density maps plot.

Usage

bdm.dMap.plot(
  bdm,
  classes = NULL,
  join = FALSE,
  class.pltt = NULL,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)

Arguments

bdm

A bdm instance as generated by bdm.init().

classes

A vector with a subset of class names or covariate values. Default value is classes=NULL. If no classes are specified (default value) all classes are plotted.

join

Logical value. If FALSE (default value), class mapping is based on the class conditional distributions. If TRUE, class mapping is based on the overall classes join distribution.

class.pltt

A colour palette to show class labels in the hard mapping. By default (class.pltt = NULL) the default palette is used.

pakde.pltt

A palette of colours to indicate the levels of the class density maps. The length of the colour palette should be at least the number of levels specified in pakde.lvls.

pakde.lvls

The number of levels of the heat-map when plotting class density maps (16 by default).

wtt.lwd

The width of the watertrack lines (as set in par()).

plot.peaks

Logical value (TRUE by default). If set to TRUE and the up-stream step bdm$wtt() is computed the peak of each cluster is depicted.

labels.cex

If plot.peaks is TRUE, the size of the labels of the clusters (as set in par()). By default labels.cex=0.0 and the labels of the clusters are not depicted.

layer

The number of the layer from which the class density maps are computed (1 by default).

Details

bdm.dMap.plot() yields a multi-plot layout where the first plot shows the dominating value of the covariate (or dominating class) in each cell, and the rest of the plots show the density map of each covariate value (or class).

The join distribution P(V=vi,C=cj)P(V=v_{i},C=c_{j}) is prone to the bias in the marginal distribution of the covariate. Therefore, the join distribution P(V=vi,C=cj)P(V=v_{i},C=c_{j}) is transformed, by default, into a conditional distribution P(cjV=vi)P(c_{j}|V=v_{i}) (where the cjc_{j} are the grid cells of the embedding and V is the covariate (or class)). Thus, the first plot shows a hard classification of grid-cells, (cells are coloured based on the dominating value of the covariate (or dominating class), i.e. the viv_{i} for which P(cjV=vi)P(c_{j}|V=v_{i}) is maximum), and the rest of the plots show the conditional distributions P(C=cjV=vi)P(C=c_{j}|V=v_{i}). This makes the plots of the different classes not directly comparable but the dominant areas of each class can be more easily identified.

However, the same plots can be depicted based on the join distribution by setting join = TRUE. This makes sense when the bias in the covariate values (or classes) is not significant. In this case the hard clustering shows the real dominance of each covariate value (or class) over the embedding area and the density maps are comparable one to each other (although, individually, they are not real density functions as they do not add up to one).

The multi-plot layout can be limited to a subset of the values of the covariate (or subset of classes) specified in parameter classes.

Value

None.

Examples

# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)
bdm.dMap.plot(m)

## End(Not run)

Example dataset

Description

Loads a mapping example.

Usage

bdm.example()

Details

The object ex is a list with elements: ex$data, a matrix with raw data; ex$labels, a vector of datapoint labels; ex$map, a bdm data mapping instance. A bdm instance is the basic object of the mapping protocol, i.e. a list to which new elements are added at each step of the mapping protocol.

This example is based on a small synthetic dataset with n = 5000 observations drawn from a 4-variate Gaussian Mixture Model (GMM) with 16 Gaussian components with random means and variances.

Value

An example dataset named ex

Examples

# --- load example dataset
bdm.example()
str(ex)

HD/LD correlation.

Description

Pair-wise distance correlation between HD and LD neighborhoods.

Usage

bdm.hlCorr(data, bdm, zSampleSize = 1000, threads = 4, mpi.cl = NULL)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A bdm instance as generated by bdm.ptsne().

zSampleSize

Number of data points to check by thread. (Default value is zSampleSize=1000).

threads

The number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4).

mpi.cl

An MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). (By default mpi.cl = NULL a 'SOCK' (intra-node parallelization) cluster is generated).

Value

A copy of the input bdm instance with new element bdm$knP.

Examples

# --- load example dataset
## Not run: 
bdm.example()
m <- bdm.hlCorr(exData[, 1:4], ex$map, threads = 4)

## End(Not run)

Embedding initialization.

Description

Computes the precision parameters for the given perplexity (i.e. the local bandwidths for the input affinity kernels) and returns them as a bdm data mapping instance. A bdm data mapping instance is the starting object of the mapping protocol, a list to which new elements are added at each step of the mapping protocol.

Usage

bdm.init(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  mpi.cl = NULL,
  threads = 4,
  labels = NULL
)

Arguments

data

A data.frame or matrix with raw input-data. The dataset must not have duplicated rows.

is.distance

Default value is is.distance = FALSE. TRUE indicates that raw data is a distance matrix.

is.sparse

Default value is is.sparse = FALSE. TRUE indicates that the raw data is a sparse matrix.

ppx

The value of perplexity to compute similarities.

mpi.cl

An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). By default mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is used.

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4).

labels

If available, a length nrow(data) vector of class labels. Label values are factorized as as.numeric(as.factor(labels)).

Value

A bdm data mapping instance. This bdm instance is the starting object of the mapping protocol: a list to which new elements are added at each step of the mapping protocol.

Examples

# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.init(ex$data, ppx = 250, labels = ex$labels)

## End(Not run)

k-ary Neighborhood Preservation

Description

A measure of matching between HD and LD neighborhoods ('Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure', Lee et. al 2015).

Usage

bdm.knp(data, bdm, k.max = NULL, sampling = 0.9, threads = 4, mpi.cl = NULL)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A bdm instance as generated by bdm.ptsne().

k.max

Maximum neighborhood size to check. (By default k.max=NULL neighborhood sizes are checked up to n-1).

sampling

Fraction of data points to check for each neighborhood size. (Default value is sampling=0.9).

threads

The number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4).

mpi.cl

An MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). (By default mpi.cl = NULL a 'SOCK' (intra-node parallelization) cluster is generated).

Value

A copy of the input bdm instance with new element bdm$knP. #'

Examples

# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m)

## End(Not run)

k-ary Neighborhood Preservation plot

Description

Log and linear plots of the k-ary Neighborhood Preservation

Usage

bdm.knp.plot(bdm, ppxfrmt = 0)

Arguments

bdm

A bdm data mapping instance, or a list of them to make a comparative plot.

ppxfrmt

Format of ppx in the legend. If ppxfrmt > 1 then ppx is shown as a fraction of the data set size and ppxfrmt is the number of decimal digits. Default value is ppxfrmt = 0 and then ppx is experessed in absolute value.

Examples

# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m, ppxfrmt = 0)

## End(Not run)

Get data-point clustering labels.

Description

Given that clusters are computed at grid-cell level, this function returns the clustering label for each data-point.

Usage

bdm.labels(bdm, merged = F, layer = 1)

Arguments

bdm

A bdm data mapping instance.

merged

Default value is merged = FALSE. If merged = TRUE and the clustering has been merged, the labels are the ids of the clusters after merging. If merged = FALSE or the clustering has not been merged, the labels indicate the ids of to the top-level clustering.

layer

The ptSNE output layer. Default value is layer = 1.

Value

A vector of data-point clustering labels.

Examples

bdm.example()
m.labels <- bdm.labels(ex$map)

Merging of clusters based on signal-to-noise-ratio.

Description

Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR) until reaching the desired number of clusters. The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering.

Usage

bdm.merge.s2nr(
  data,
  bdm,
  k = 10,
  plot.merge = T,
  ret.merge = F,
  info = T,
  layer = 1,
  ...
)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A bdm instance as generated by bdm.init().

k

The number of desired clusters. The clustering will be recursively merged until reaching this number of clusters (default value is k = 10). By setting k < 0 we can specify the number of clusters that we are willing to merge.

plot.merge

Logical value. If TRUE, the merged clustering is plotted (default value is plot.merge = TRUE)

ret.merge

Logical value. If TRUE, the function returns a copy of the input bdm instance with the merged clustering attached as bdm$merge (default value is ret.merge = FALSE)

info

Logical value. If TRUE, all merging steps are shown (default value is info = FALSE).

layer

The bdm$ptsne layer to be used (default value is layer = 1).

...

If plot.merge is TRUE, you can set the bdm.wtt.plot() parameters to control the plot.

Details

See details in bdm.optk.s2nr().

Value

None if ret.merge = FALSE. Else, a copy of the input bdm instance with new element bdm$merge.

Examples

bdm.example()
m.labels <- bdm.labels(ex$map)

Initialize parallel computing environment.

Description

Initialize parallel computing environment.

Usage

bdm.mpi.start(threads)

Arguments

threads

The number of parallel threads (in principle only limited by hardware resources, i.e. number of cores and available memory)

Value

cl A cluster instance (as created by the snow::makeCluster() function).


Stops MPI parallel computing environment.

Description

Stops MPI parallel computing environment.

Usage

bdm.mpi.stop(cl)

Arguments

cl

A cluster instance (as created by the bdm.mpi.start() function).


Multi-core t-SNE (mtSNE)

Description

Starts the multi-core t-SNE (mtSNE) algorithm.

Usage

bdm.mtsne(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  theta = 0.5,
  iters = 250,
  mpi.cl = NULL,
  threads = 4,
  infoRate = 25
)

Arguments

data

A data.frame or matrix with raw input-data. The dataset must not have duplicated rows.

is.distance

Default value is is.distance = FALSE. TRUE indicates that raw data is a distance matrix.

is.sparse

Default value is is.sparse = FALSE. TRUE indicates that the raw data is a sparse matrix.

ppx

The value of perplexity to compute similarities.

theta

Accuracy/speed trade-off factor, a value between 0.33 and 0.8. Default value is theta = 0.5. If theta < 0.33 the algorithm uses the exact computation of the gradient. The closer it is this value to 1 the faster the computation and the coarser the approximation of the gradient.

iters

Number of iters/epoch. Default value is iters = 250.

mpi.cl

An MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). By default mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is generated.

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4.

infoRate

Number of epochs to show output information. Default value is infoRate = 25.

Value

A bdm data mapping instance.

Examples

# --- load example dataset
bdm.example()
## Not run: 
# --- perform mtSNE
m <- bdm.mtsne(ex$data, ex$map, ppx = 250, iters = 250, threads = 4)
# --- plot the Cost function
bdm.cost(m)
# --- plot mtSNE output (use bdm.ptsne.plot() function)
bdm.ptsne.plot(m)

## End(Not run)

Plots the signal-to-noise-ratio as a function of the number of clusters.

Description

The function bdm.optk.sn2r() computes the S2NR that results from recursively merging clusters and, by deafult, makes a plot of these values. For large datasets this computation can take a while, so we can save this result by setting optk.ret = TRUE. If this result is saved, we can plot it again at any time using this funcion.

Usage

bdm.optk.plot(bdm)

Arguments

bdm

A bdm instance as generated by bdm.init().

Value

None.

Examples

bdm.example()
m <- bdm.optk.s2nr(ex$data, ex$map, ret.optk = TRUE)
bdm.optk.plot(m)

Find optimal number of clusters based on signal-to-noise-ratio.

Description

Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR). The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering. Merging is applied recursively until reaching a configuration of only 2 clusters and the S2NR is measured at each step.

Usage

bdm.optk.s2nr(data, bdm, info = T, plot.optk = T, ret.optk = F, layer = 1)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A clustered bdm instance (i.e. all up-stream steps performed: bdm.ptse(), bdm.pakde() and bdm.wtt().

info

Logical value. If TRUE, all merging steps are shown (default value is info = FALSE).

plot.optk

Logical value. If TRUE, this function plots the heuristic measure versus the number of clusters (default value is plot.optk = TRUE)

ret.optk

Logical value. For large datasets this computation can take a while and it might be interesting to save it. If TRUE, the function returns a copy of the bdm instance with the values of S2NR attached as bdm$optk (default value is ret.optk = FALSE).

layer

The bdm$ptsne layer to be used (default value is layer = 1).

Details

The underlying idea is that neigbouring clusters in the embedding correspond to close clusters in the high dimensional space, i.e. this merging heuristic is based on the spatial distribution of clusters. For each cluster (child cluster) we choose the neighboring cluster with steepest gradient along their common border (father cluster). Thus, we get a set of pairs of clusters (child/father) to be potentially merged. Given this set of candidates, the merging is performed recursively choosing, at each step, the pair of child/father clusters that results in a minimum loss of S2NR. Typically some clusters dominate over all of their neighboring clusters. These clusters have no father. Thus, once all posible mergings have been performed we reach a blocked state where only the dominant clusters remain. This situation identifies a hierarchy level in the clustering. When this situation is reached, the algorithm starts a new merging round, identifying the child/father relations at that level of the hierarchy. The process stops when only two clusters remain. Usually, the clustering hierarchy is clearly depicted by singular points in the S2NR function. This is a hint that the low dimensional clustering configuration is an image of a hierarchycal configuration in the high dimensional space. See bdm.optk.plot().

Value

None if ret.optk = FALSE. Else, a copy of the input bdm instance with new element bdm$optk (a matrix).

Examples

# --- load mapped dataset
bdm.example()
# --- compute optimal number of clusters and attach the computation
bdm.optk.s2nr(ex$map, data = ex$data, plot.optk = TRUE, ret.optk = FALSE)

Perplexity-adaptive kernel density estimation

Description

Starts the paKDE algorithm (second step of the mapping protocol).

Usage

bdm.pakde(
  bdm,
  ppx = 100,
  g = 200,
  g.exp = 3,
  mpi.cl = NULL,
  threads = 2,
  layer = 1
)

Arguments

bdm

A bdm data mapping instance.

ppx

The value of perplexity to compute similarities in the low-dimensional embedding. Default value is ppx = 100.

g

The resolution of the density space grid (ggg*g cells). Default value is g = 200.

g.exp

A numeric factor to avoid border effects. The grid limits will be expanded so as to enclose the density of the kernel of the most extreme embedded datapoints up to g.exp times σ\sigma. Default value is g.exp = 3, i.e. the grid limits are expanded so as to enclose the 0.9986 of the probability mass of the most extreme kernels.

mpi.cl

An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). Default value is mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated.

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4.

layer

The ptSNE output layer. Default value is layer = 1.

Details

When computing the paKDE the embedding area is discretized as a grid of size g*g cells. In order to avoid border effects, the limits of the grid are expanded by default so as to enclose at least the 0.9986 of the cumulative distribution function (3σ3 \sigma) of the kernels of the most extreme mapped points in each direction.

The presence of outliers in the embedding can lead to undesired expansion of the grid limits. We can overcome this using lower values of g.exp. By setting g.exp = 0 the grid limits will be equal to the range of the embedding.

The values g.exp = c(1, 2, 3, 4, 5, 6) enclose cdf values of 0.8413, 0.9772, 0.9986, 0.99996, 0.99999, 1.0 respectively.

Value

A copy of the input bdm instance with new element bdm$pakde (paKDE output). bdm$pakde[[layer]]$layer = 'NC' stands for not computed layers.

Examples

# --- load mapped dataset
bdm.example()
# --- run paKDE
## Not run: 
m <- bdm.pakde(ex$map, ppx = 200, g = 200, g.exp = 3, threads = 4)
# --- plot paKDE output
bdm.pakde.plot(m)

## End(Not run)

Plot paKDE (density landscape)

Description

Plot paKDE (density landscape)

Usage

bdm.pakde.plot(bdm, pakde.pltt = NULL, pakde.lvls = 16, layer = 1)

Arguments

bdm

A bdm instance as generated by bdm.init() or a list of them to make a comparative plot.

pakde.pltt

A colour palette to show levels in the paKDE plot. By default (pakde.pltt = NULL) the default palette is used.

pakde.lvls

The number of levels of the density heat-map (16 by default).

layer

The bdm$ptsne layer to be used (default value is layer = 1).

Value

None.

Examples

bdm.example()
m <- bdm.pakde.plot(ex$map)

Precision map (quantile map of betas)

Description

Precision map (quantile map of betas)

Usage

bdm.pMap(
  bdm,
  pMap.levels = 8,
  pMap.cex = 0.1,
  pMap.bg = "#000000",
  colorbar = T
)

Arguments

bdm

A bdm instance as generated by bdm.init().

pMap.levels

The number of levels of the quantile-map (8 by default).

pMap.cex

The size of the data-points (as in par()). Default value is ptsne.cex = 0.1.

pMap.bg

The background colour of the qMap plot. Default value is ptsne.bg = #FFFFFF (white).

colorbar

A logical value (TRUE by default). FALSE hides the side colorbar.

Value

None.

Examples

bdm.example()
bdm.pMap(ex$map)

Parallelized t-SNE (ptSNE)

Description

Starts the parallelized t-SNE algorithm (pt-SNE). This is the first step of the mapping protocol.

Usage

bdm.ptsne(
  data,
  bdm,
  theta = 0.5,
  Y.init = NULL,
  mpi.cl = NULL,
  threads = 4,
  layers = 2,
  info = 0
)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A bdm data mapping instance.

theta

Accuracy/speed trade-off factor, a value between 0.33 and 0.8. (Default value is theta = 0.0). If theta < 0.33 the algorithm uses the exact computation of the gradient. The closer is this value to 1 the faster is the computation but the coarser is the approximation of the gradient.

Y.init

A n *2 *layers matrix with initial mapping positions. (By default Y.init=NULL will use random initial positions).

mpi.cl

MPI (inter-node parallelization) cluster as generated by bdm.mpi.start(). (By default mpi.cl = NULL a 'SOCK' (intra-node parallelization) cluster is generated).

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory. Default value is threads = 4).

layers

Number of layers (minimum 2, maximum the number of threads). Default value is layers = 2.

info

Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is info = 0.

Value

A bdm data mapping instance.

Examples

# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run: 
# --- run ptSNE
m <- bdm.ptsne(ex$data, ex$map, threads = 10, layers = 2)
# --- plot the Cost function
bdm.cost(m)
# --- plot ptSNE output
bdm.ptsne.plot(m, class.lbls = ex$labels)

## End(Not run)

Plot ptSNE (low-dimensional embedding)

Description

Plot ptSNE (low-dimensional embedding)

Usage

bdm.ptsne.plot(
  bdm,
  ptsne.cex = 0.5,
  ptsne.bg = "#FFFFFF",
  class.lbls = NULL,
  class.pltt = NULL,
  layer = 1
)

Arguments

bdm

A bdm instance as generated by bdm.init() or a list of them to make a comparative plot.

ptsne.cex

The size of the mapped data-points in the ptSNE plot. Default value is ptsne.cex = 0.5.

ptsne.bg

The background colour of the ptSNE plot. Default value is ptsne.bg = #FFFFFF (white).

class.lbls

A vector of numeric class labels. If is.null(class.lbls) and !is.null(bdm$wtt) cluster labels are used.

class.pltt

A colour palette to show the class labels in the ptSNE plot. If ptsne.pltt = NULL (default value) the default palette is used.

layer

The bdm$ptsne layer to be used (default value is layer = 1).

Value

None.

Examples

bdm.example()
m <- bdm.ptsne.plot(ex$map, class.lbls = ex$labels)

ptSNE quantile-maps

Description

Maps quantitative variables onto the embedding space.

Usage

bdm.qMap(
  bdm,
  data,
  labels = NULL,
  subset = NULL,
  qMap.levels = 8,
  qMap.cex = 0.3,
  qMap.bg = "#FFFFFF",
  class.pltt = NULL,
  qtitle = NULL,
  cex.main = 1,
  colorbar = T,
  layer = 1
)

Arguments

bdm

A bdm instance as generated by bdm.init().

data

A matrix/data.frame to be mapped.

labels

A length nrow(bdm$data) vector of class labels to overlay onto the embedding. Label values are factorized as as.numeric(as.factor(labels)). Default value is labels = NULL.

subset

A numeric vector with the indexes of a subset of data. Data-points in the subset are heat-mapped and the rest are shown in light grey. By default all data-points are heat-mapped.

qMap.levels

The number of levels of the quantile-map (8 by default).

qMap.cex

The size of the data-points (as in par()).

qMap.bg

The background colour of the qMap plot. Default value is ptsne.bg = #FFFFFF (white).

class.pltt

If !is.null(labels) or !is.null(bdm$lbls), a colour palette to show class labels with the qMap plots. By default (qMap.pltt = NULL) the default palette is used.

qtitle

A vector of strings with titles for the plots. Default value is qtitle=NULL.

cex.main

The font size of the title (as in par()).

colorbar

A logical value (TRUE by default). FALSE hides the side colorbar.

layer

The number of a layer (1 by default).

Details

This is not a heat-map but a quantile-map plot. This function splits the range of each variable into as many quantiles as specified by levels so that the color gradient will hardly ever correspond to a constant numeric gradient. Thus, the mapping will show more evenly distributed colors though at the expense of possibly exaggerating artifacts. For variables with very extrem distributions, it will be impossible to find as many quantiles as desired and the distribution of colors will not be so homogeneous.

Value

None.

Examples

bdm.example()
bdm.qMap(ex$map, ex$data)
# --- show only components (1, 2, 4, 8) of the GMM
bdm.qMap(ex$map, ex$data, subset = which(ex$map$lbls %in% c(1, 4, 8, 16)))

Restart pt-SNE

Description

Restarts the ptSNE algorithm (runs more epochs).

Usage

bdm.restart(
  data,
  bdm,
  epochs = NULL,
  iters = NULL,
  mpi.cl = NULL,
  threads = NULL,
  layers = NULL,
  info = 0
)

Arguments

data

Input data (a matrix, a big.matrix or a .csv file name).

bdm

A bdm data mapping instance.

epochs

Number of epochs to run. Default value epochs = NULL runs 4 *log(n) epochs.

iters

Number of iters per epoch. Default value iters = NULL runs 4 *log(thread_size) iters/epoch.

mpi.cl

An MPI (inter-node parallelization) cluster as returned by bdm.mpi.start(). Default value is mpi.cl = NULL, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated.

threads

Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is threads = 4.

layers

Number of layers (minimum 2, maximum the number of threads). Default value is layers = 2.

info

Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is 0.

Value

A bdm data mapping instance.

Examples

# --- load example dataset
bdm.example()
## Not run: 
# --- restart ptSNE
m <- bdm.restart(ex$data, ex$map, epochs = 50)

## End(Not run)

Watertrack transform (WTT)

Description

Starts the WTT algorithm (third setp of the mapping protocol).

Usage

bdm.wtt(bdm, layer = 1)

Arguments

bdm

A bdm data mapping instance.

layer

The ptSNE output layer. Default value is layer = 1.

Details

This function requires the up-stream step bdm.pakde().

Value

A bdm data mapping instance.

Examples

# --- load mapped dataset
bdm.example()
# --- perform WTT
m <- bdm.wtt(ex$map)
# --- plot WTT output
bdm.wtt.plot(m)

Plot WTT (clustering)

Description

Plot WTT (clustering)

Usage

bdm.wtt.plot(
  bdm,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)

Arguments

bdm

A bdm instance as generated by bdm.init() or a list of them to make a comparative plot.

pakde.pltt

A colour palette to show levels in the paKDE plot. By default (pakde.pltt = NULL) the default palette is used.

pakde.lvls

The number of levels of the density heat-map (16 by default).

wtt.lwd

The width of the watertrack lines (as set in par()).

plot.peaks

Logical value (TRUE by default). If set to TRUE and the up-stream step bdm$wtt() is computed marks the peak of each cluster.

labels.cex

If plot.peaks is TRUE, the size of the labels of the clusters (as set in par()). By default labels.cex = 0.0 and the labels of the clusters are not depicted.

layer

The bdm$ptsne layer to be used (default value is layer = 1).

Value

None.

Examples

bdm.example()
m <- bdm.wtt.plot(ex$map)