Package 'bigMap' reference manual

Title:	Big Data Mapping
Description:	Unsupervised clustering protocol for large scale structured data, based on a low dimensional representation of the data. Dimensionality reduction is performed using a parallelized implementation of the t-Stochastic Neighboring Embedding algorithm (Garriga J. and Bartumeus F. (2018), <arXiv:1812.09869>).
Authors:	Joan Garriga [aut, cre], Frederic Bartumeus [aut]
Maintainer:	Joan Garriga <[email protected]>
License:	GPL-3 \| file LICENSE
Version:	4.6.2
Built:	2025-03-07 05:20:14 UTC
Source:	https://github.com/jgarriga65/bigmap

Clustering statistics box-plot.

Description

Clustering statistics box-plot.

Usage

bdm.boxp(data, bdm, byVars = F, merged = T, clusters = NULL, layer = 1)
bdm.boxp(data, bdm, byVars = F, merged = T, clusters = NULL, layer = 1)

Arguments

`data`	A matrix of data to be plotted (either the input data matrix or any covariate matrix as well).
`bdm`	A `bdm` instance as generated by `bdm.init()`.
`byVars`	A logical value. By default (`byVars = FALSE`) box-plots are grouped by cluster. With `byVars = TRUE` box-plots are grouped by input feature.
`merged`	A logical value. If TRUE (default value) and the `!is.null(bdm$merge)` the boxplots depict the clusters after merging. If `merged = FALSE` or `is.null(bdm$merge)` the boxplots correspond to the top-level clustering.
`clusters`	A vector with a subset of cluster ids. (Default value is `clusters=NULL` to plot all clusters, with a maximum of 25).
`layer`	The number of a layer (1 by default).

Details

If the number of clusters is large, only the first 25 clusters will be plotted. Note that the WTT algorithm numbers the clusters based on density value at the peak cell of the cluster. Thus, the numbering of the clusters is highly correlated with their relevance in terms of partial density. Therefore, in case of more than 25 clusters, the most relevant should always be included in the plot.

Value

None.

Examples


bdm.example()
bdm.boxp(ex$map, data = ex$data[, 1:4])
bdm.boxp(ex$map, data = ex$data[, 1:4], byVars = TRUE)
bdm.example()
bdm.boxp(ex$map, data = ex$data[, 1:4])
bdm.boxp(ex$map, data = ex$data[, 1:4], byVars = TRUE)

ptSNE cost & size plot.

Description

ptSNE cost & size plot.

Usage

bdm.cost(bdm, x.lim = NULL)
bdm.cost(bdm, x.lim = NULL)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()` or a list of them to make a comparative plot.
`x.lim`	A vector with upper and lower bounds to limit the number of epochs in the x-axis (NULL by default).

Value

None.

Examples


bdm.example()
bdm.cost(ex$map)
bdm.example()
bdm.cost(ex$map)

Class density maps

Description

Compute the class density maps of a set of classes on the embedding grid. This function returns a fuzzy mapping of the set of classes on the grid cells. The classes can be whatever set of classes of interest and must be given as a vector of point-wise discrete labels (either numeric, string or factor).

Usage

bdm.dMap(bdm, data = NULL, threads = 2, mpi.cl = NULL, layer = 1)
bdm.dMap(bdm, data = NULL, threads = 2, mpi.cl = NULL, layer = 1)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()`.
`data`	A vector of discret covariates or class labels. The covariate values can be of any factorizable type. By default (`data=NULL`) the function computes the density maps based on the clustering labels (i.e. equivalent to `data=bdm.labels(bdm)`)
`threads`	Number of parallel threads (according to data size and hardware resources, `i.e.` number of cores and available memory. Default value is `threads = 4`).
`mpi.cl`	MPI (inter-node parallelization) cluster as generated by `bdm.mpi.start()`. (By default `mpi.cl = NULL` a 'SOCK' (intra-node parallelization) cluster is generated).
`layer`	The number of the t-SNE layer (1 by default).

Details

bdm.dMap() computes the join distribution $P(V=v_{i},C=c_{j})$ where $V={v_{1},\dots,v_{l}}$ is the discrete covariate and $C={c_{1},\dots, c_{g}}$ are the grid cells of the paKDE raster. That is, this function recomputes the paKDE but keeping track of the covariate (or class) label of each data-point. This results in a fuzzy distribution of the covariate (class) at each cell.

Usually, figuring out the join distribution $P(V=v_{i},C=c_{j})$ entails an intensive computation. Thus bdm.dMap() performs the computation and stores the result in a dedicated element named $dMap. Afterwards the class density maps can be visualized with the bdm.dMap.plot() function.

Value

A copy of the input bdm instance with element $dMap, a matrix with a soft clustering of the grid cells.

Examples


# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)

## End(Not run)
# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)

## End(Not run)

Class density maps plot.

Description

Class density maps plot.

Usage

bdm.dMap.plot(
  bdm,
  classes = NULL,
  join = FALSE,
  class.pltt = NULL,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)
bdm.dMap.plot(
  bdm,
  classes = NULL,
  join = FALSE,
  class.pltt = NULL,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()`.
`classes`	A vector with a subset of class names or covariate values. Default value is `classes=NULL`. If no classes are specified (default value) all classes are plotted.
`join`	Logical value. If FALSE (default value), class mapping is based on the class conditional distributions. If TRUE, class mapping is based on the overall classes join distribution.
`class.pltt`	A colour palette to show class labels in the hard mapping. By default (`class.pltt = NULL`) the default palette is used.
`pakde.pltt`	A palette of colours to indicate the levels of the class density maps. The length of the colour palette should be at least the number of levels specified in `pakde.lvls`.
`pakde.lvls`	The number of levels of the heat-map when plotting class density maps (16 by default).
`wtt.lwd`	The width of the watertrack lines (as set in `par()`).
`plot.peaks`	Logical value (TRUE by default). If set to TRUE and the up-stream step `bdm$wtt()` is computed the peak of each cluster is depicted.
`labels.cex`	If `plot.peaks` is TRUE, the size of the labels of the clusters (as set in `par()`). By default `labels.cex=0.0` and the labels of the clusters are not depicted.
`layer`	The number of the layer from which the class density maps are computed (1 by default).

Details

bdm.dMap.plot() yields a multi-plot layout where the first plot shows the dominating value of the covariate (or dominating class) in each cell, and the rest of the plots show the density map of each covariate value (or class).

The join distribution $P(V=v_{i},C=c_{j})$ is prone to the bias in the marginal distribution of the covariate. Therefore, the join distribution $P(V=v_{i},C=c_{j})$ is transformed, by default, into a conditional distribution $P(c_{j}|V=v_{i})$ (where the $c_{j}$ are the grid cells of the embedding and V is the covariate (or class)). Thus, the first plot shows a hard classification of grid-cells, (cells are coloured based on the dominating value of the covariate (or dominating class), i.e. the $v_{i}$ for which $P(c_{j}|V=v_{i})$ is maximum), and the rest of the plots show the conditional distributions $P(C=c_{j}|V=v_{i})$ . This makes the plots of the different classes not directly comparable but the dominant areas of each class can be more easily identified.

However, the same plots can be depicted based on the join distribution by setting join = TRUE. This makes sense when the bias in the covariate values (or classes) is not significant. In this case the hard clustering shows the real dominance of each covariate value (or class) over the embedding area and the density maps are comparable one to each other (although, individually, they are not real density functions as they do not add up to one).

The multi-plot layout can be limited to a subset of the values of the covariate (or subset of classes) specified in parameter classes.

Value

None.

Examples


# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)
bdm.dMap.plot(m)

## End(Not run)
# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.dMap(ex$map, threads = 4)
bdm.dMap.plot(m)

## End(Not run)

Example dataset

Description

Loads a mapping example.

Usage

bdm.example()
bdm.example()

Details

The object ex is a list with elements: ex$data, a matrix with raw data; ex$labels, a vector of datapoint labels; ex$map, a bdm data mapping instance. A bdm instance is the basic object of the mapping protocol, i.e. a list to which new elements are added at each step of the mapping protocol.

This example is based on a small synthetic dataset with n = 5000 observations drawn from a 4-variate Gaussian Mixture Model (GMM) with 16 Gaussian components with random means and variances.

Value

An example dataset named ex

Examples


# --- load example dataset
bdm.example()
str(ex)
# --- load example dataset
bdm.example()
str(ex)

HD/LD correlation.

Description

Pair-wise distance correlation between HD and LD neighborhoods.

Usage

bdm.hlCorr(data, bdm, zSampleSize = 1000, threads = 4, mpi.cl = NULL)
bdm.hlCorr(data, bdm, zSampleSize = 1000, threads = 4, mpi.cl = NULL)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A `bdm` instance as generated by `bdm.ptsne()`.
`zSampleSize`	Number of data points to check by thread. (Default value is `zSampleSize=1000`).
`threads`	The number of parallel threads (according to data size and hardware resources, `i.e.` number of cores and available memory. Default value is `threads = 4`).
`mpi.cl`	An MPI (inter-node parallelization) cluster as generated by `bdm.mpi.start()`. (By default `mpi.cl = NULL` a 'SOCK' (intra-node parallelization) cluster is generated).

Value

A copy of the input bdm instance with new element bdm$knP.

Examples


# --- load example dataset
## Not run: 
bdm.example()
m <- bdm.hlCorr(exData[, 1:4], ex$map, threads = 4)

## End(Not run)

# --- load example dataset
## Not run: 
bdm.example()
m <- bdm.hlCorr(exData[, 1:4], ex$map, threads = 4)

## End(Not run)

Embedding initialization.

Description

Computes the precision parameters for the given perplexity (i.e. the local bandwidths for the input affinity kernels) and returns them as a bdm data mapping instance. A bdm data mapping instance is the starting object of the mapping protocol, a list to which new elements are added at each step of the mapping protocol.

Usage

bdm.init(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  mpi.cl = NULL,
  threads = 4,
  labels = NULL
)
bdm.init(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  mpi.cl = NULL,
  threads = 4,
  labels = NULL
)

Arguments

`data`	A `data.frame` or `matrix` with raw input-data. The dataset must not have duplicated rows.
`is.distance`	Default value is `is.distance = FALSE`. TRUE indicates that raw data is a distance matrix.
`is.sparse`	Default value is `is.sparse = FALSE`. TRUE indicates that the raw data is a sparse matrix.
`ppx`	The value of perplexity to compute similarities.
`mpi.cl`	An MPI (inter-node parallelization) cluster as returned by `bdm.mpi.start()`. By default `mpi.cl = NULL`, i.e. a 'SOCK' (intra-node parallelization) cluster is used.
`threads`	Number of parallel threads (according to data size and hardware resources, `i.e.` number of cores and available memory. Default value is `threads = 4`).
`labels`	If available, a length `nrow(data)` vector of class labels. Label values are factorized as `as.numeric(as.factor(labels))`.

Value

A bdm data mapping instance. This bdm instance is the starting object of the mapping protocol: a list to which new elements are added at each step of the mapping protocol.

Examples


# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.init(ex$data, ppx = 250, labels = ex$labels)

## End(Not run)
# --- load example dataset
bdm.example()
## Not run: 
m <- bdm.init(ex$data, ppx = 250, labels = ex$labels)

## End(Not run)

k-ary Neighborhood Preservation

Description

A measure of matching between HD and LD neighborhoods ('Multi-scale similarities in stochastic neighbour embedding: Reducing dimensionality while preserving both local and global structure', Lee et. al 2015).

Usage

bdm.knp(data, bdm, k.max = NULL, sampling = 0.9, threads = 4, mpi.cl = NULL)
bdm.knp(data, bdm, k.max = NULL, sampling = 0.9, threads = 4, mpi.cl = NULL)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A `bdm` instance as generated by `bdm.ptsne()`.
`k.max`	Maximum neighborhood size to check. (By default `k.max=NULL` neighborhood sizes are checked up to n-1).
`sampling`	Fraction of data points to check for each neighborhood size. (Default value is `sampling=0.9`).
`threads`	The number of parallel threads (according to data size and hardware resources, `i.e.` number of cores and available memory. Default value is `threads = 4`).
`mpi.cl`	An MPI (inter-node parallelization) cluster as generated by `bdm.mpi.start()`. (By default `mpi.cl = NULL` a 'SOCK' (intra-node parallelization) cluster is generated).

Value

A copy of the input bdm instance with new element bdm$knP. #'

Examples


# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m)

## End(Not run)

# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m)

## End(Not run)

k-ary Neighborhood Preservation plot

Description

Log and linear plots of the k-ary Neighborhood Preservation

Usage

bdm.knp.plot(bdm, ppxfrmt = 0)
bdm.knp.plot(bdm, ppxfrmt = 0)

Arguments

`bdm`	A `bdm` data mapping instance, or a list of them to make a comparative plot.
`ppxfrmt`	Format of `ppx` in the legend. If `ppxfrmt > 1` then `ppx` is shown as a fraction of the data set size and `ppxfrmt` is the number of decimal digits. Default value is `ppxfrmt = 0` and then `ppx` is experessed in absolute value.

Examples


# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m, ppxfrmt = 0)

## End(Not run)

# --- load example dataset
bdm.example()
## Not run: 
# --- compte the kNP
m <- bdm.knp(ex$data, ex$map, threads = 4)
# --- plot the kNP
bdm.knp.plot(m, ppxfrmt = 0)

## End(Not run)

Get data-point clustering labels.

Description

Given that clusters are computed at grid-cell level, this function returns the clustering label for each data-point.

Usage

bdm.labels(bdm, merged = F, layer = 1)
bdm.labels(bdm, merged = F, layer = 1)

Arguments

`bdm`	A `bdm` data mapping instance.
`merged`	Default value is `merged = FALSE`. If `merged = TRUE` and the clustering has been merged, the labels are the ids of the clusters after merging. If `merged = FALSE` or the clustering has not been merged, the labels indicate the ids of to the top-level clustering.
`layer`	The ptSNE output layer. Default value is `layer = 1`.

Value

A vector of data-point clustering labels.

Examples


bdm.example()
m.labels <- bdm.labels(ex$map)
bdm.example()
m.labels <- bdm.labels(ex$map)

Merging of clusters based on signal-to-noise-ratio.

Description

Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR) until reaching the desired number of clusters. The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering.

Usage

bdm.merge.s2nr(
  data,
  bdm,
  k = 10,
  plot.merge = T,
  ret.merge = F,
  info = T,
  layer = 1,
  ...
)
bdm.merge.s2nr(
  data,
  bdm,
  k = 10,
  plot.merge = T,
  ret.merge = F,
  info = T,
  layer = 1,
  ...
)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A `bdm` instance as generated by `bdm.init()`.
`k`	The number of desired clusters. The clustering will be recursively merged until reaching this number of clusters (default value is `k = 10`). By setting `k < 0` we can specify the number of clusters that we are willing to merge.
`plot.merge`	Logical value. If TRUE, the merged clustering is plotted (default value is `plot.merge = TRUE`)
`ret.merge`	Logical value. If TRUE, the function returns a copy of the input `bdm` instance with the merged clustering attached as `bdm$merge` (default value is `ret.merge = FALSE`)
`info`	Logical value. If TRUE, all merging steps are shown (default value is `info = FALSE`).
`layer`	The `bdm$ptsne` layer to be used (default value is `layer = 1`).
`...`	If `plot.merge` is TRUE, you can set the `bdm.wtt.plot()` parameters to control the plot.

Details

See details in bdm.optk.s2nr().

Value

None if ret.merge = FALSE. Else, a copy of the input bdm instance with new element bdm$merge.

Examples


bdm.example()
m.labels <- bdm.labels(ex$map)
bdm.example()
m.labels <- bdm.labels(ex$map)

Initialize parallel computing environment.

Description

Initialize parallel computing environment.

Usage

bdm.mpi.start(threads)
bdm.mpi.start(threads)

Arguments

threads

The number of parallel threads (in principle only limited by hardware resources, i.e. number of cores and available memory)

Value

cl A cluster instance (as created by the snow::makeCluster() function).

Stops MPI parallel computing environment.

Description

Stops MPI parallel computing environment.

Usage

bdm.mpi.stop(cl)
bdm.mpi.stop(cl)

Arguments

`cl`	A cluster instance (as created by the bdm.mpi.start() function).

Multi-core t-SNE (mtSNE)

Description

Starts the multi-core t-SNE (mtSNE) algorithm.

Usage

bdm.mtsne(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  theta = 0.5,
  iters = 250,
  mpi.cl = NULL,
  threads = 4,
  infoRate = 25
)
bdm.mtsne(
  data,
  is.distance = F,
  is.sparse = F,
  ppx = 100,
  theta = 0.5,
  iters = 250,
  mpi.cl = NULL,
  threads = 4,
  infoRate = 25
)

Arguments

`data`	A `data.frame` or `matrix` with raw input-data. The dataset must not have duplicated rows.
`is.distance`	Default value is `is.distance = FALSE`. TRUE indicates that raw data is a distance matrix.
`is.sparse`	Default value is `is.sparse = FALSE`. TRUE indicates that the raw data is a sparse matrix.
`ppx`	The value of perplexity to compute similarities.
`theta`	Accuracy/speed trade-off factor, a value between 0.33 and 0.8. Default value is `theta = 0.5`. If `theta < 0.33` the algorithm uses the exact computation of the gradient. The closer it is this value to 1 the faster the computation and the coarser the approximation of the gradient.
`iters`	Number of iters/epoch. Default value is `iters = 250`.
`mpi.cl`	An MPI (inter-node parallelization) cluster as generated by `bdm.mpi.start()`. By default `mpi.cl = NULL`, i.e. a 'SOCK' (intra-node parallelization) cluster is generated.
`threads`	Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is `threads = 4`.
`infoRate`	Number of epochs to show output information. Default value is `infoRate = 25`.

Value

A bdm data mapping instance.

Examples


# --- load example dataset
bdm.example()
## Not run: 
# --- perform mtSNE
m <- bdm.mtsne(ex$data, ex$map, ppx = 250, iters = 250, threads = 4)
# --- plot the Cost function
bdm.cost(m)
# --- plot mtSNE output (use bdm.ptsne.plot() function)
bdm.ptsne.plot(m)

## End(Not run)
# --- load example dataset
bdm.example()
## Not run: 
# --- perform mtSNE
m <- bdm.mtsne(ex$data, ex$map, ppx = 250, iters = 250, threads = 4)
# --- plot the Cost function
bdm.cost(m)
# --- plot mtSNE output (use bdm.ptsne.plot() function)
bdm.ptsne.plot(m)

## End(Not run)

Plots the signal-to-noise-ratio as a function of the number of clusters.

Description

The function bdm.optk.sn2r() computes the S2NR that results from recursively merging clusters and, by deafult, makes a plot of these values. For large datasets this computation can take a while, so we can save this result by setting optk.ret = TRUE. If this result is saved, we can plot it again at any time using this funcion.

Usage

bdm.optk.plot(bdm)
bdm.optk.plot(bdm)

Arguments

bdm

A bdm instance as generated by bdm.init().

Value

None.

Examples


bdm.example()
m <- bdm.optk.s2nr(ex$data, ex$map, ret.optk = TRUE)
bdm.optk.plot(m)
bdm.example()
m <- bdm.optk.s2nr(ex$data, ex$map, ret.optk = TRUE)
bdm.optk.plot(m)

Find optimal number of clusters based on signal-to-noise-ratio.

Description

Performs a recursive merging of clusters based on minimum loss of signal-to-noise-ratio (S2NR). The S2NR is the explained/unexplained variance ratio measured in the high dimensional space based on the given low dimensional clustering. Merging is applied recursively until reaching a configuration of only 2 clusters and the S2NR is measured at each step.

Usage

bdm.optk.s2nr(data, bdm, info = T, plot.optk = T, ret.optk = F, layer = 1)
bdm.optk.s2nr(data, bdm, info = T, plot.optk = T, ret.optk = F, layer = 1)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A clustered `bdm` instance (`i.e.` all up-stream steps performed: `bdm.ptse(), bdm.pakde() and bdm.wtt()`.
`info`	Logical value. If TRUE, all merging steps are shown (default value is `info = FALSE`).
`plot.optk`	Logical value. If TRUE, this function plots the heuristic measure versus the number of clusters (default value is `plot.optk = TRUE`)
`ret.optk`	Logical value. For large datasets this computation can take a while and it might be interesting to save it. If TRUE, the function returns a copy of the `bdm` instance with the values of S2NR attached as `bdm$optk` (default value is `ret.optk = FALSE`).
`layer`	The `bdm$ptsne` layer to be used (default value is `layer = 1`).

Details

The underlying idea is that neigbouring clusters in the embedding correspond to close clusters in the high dimensional space, i.e. this merging heuristic is based on the spatial distribution of clusters. For each cluster (child cluster) we choose the neighboring cluster with steepest gradient along their common border (father cluster). Thus, we get a set of pairs of clusters (child/father) to be potentially merged. Given this set of candidates, the merging is performed recursively choosing, at each step, the pair of child/father clusters that results in a minimum loss of S2NR. Typically some clusters dominate over all of their neighboring clusters. These clusters have no father. Thus, once all posible mergings have been performed we reach a blocked state where only the dominant clusters remain. This situation identifies a hierarchy level in the clustering. When this situation is reached, the algorithm starts a new merging round, identifying the child/father relations at that level of the hierarchy. The process stops when only two clusters remain. Usually, the clustering hierarchy is clearly depicted by singular points in the S2NR function. This is a hint that the low dimensional clustering configuration is an image of a hierarchycal configuration in the high dimensional space. See bdm.optk.plot().

Value

None if ret.optk = FALSE. Else, a copy of the input bdm instance with new element bdm$optk (a matrix).

Examples


# --- load mapped dataset
bdm.example()
# --- compute optimal number of clusters and attach the computation
bdm.optk.s2nr(ex$map, data = ex$data, plot.optk = TRUE, ret.optk = FALSE)
# --- load mapped dataset
bdm.example()
# --- compute optimal number of clusters and attach the computation
bdm.optk.s2nr(ex$map, data = ex$data, plot.optk = TRUE, ret.optk = FALSE)

Perplexity-adaptive kernel density estimation

Description

Starts the paKDE algorithm (second step of the mapping protocol).

Usage

bdm.pakde(
  bdm,
  ppx = 100,
  g = 200,
  g.exp = 3,
  mpi.cl = NULL,
  threads = 2,
  layer = 1
)
bdm.pakde(
  bdm,
  ppx = 100,
  g = 200,
  g.exp = 3,
  mpi.cl = NULL,
  threads = 2,
  layer = 1
)

Arguments

`bdm`	A `bdm` data mapping instance.
`ppx`	The value of perplexity to compute similarities in the low-dimensional embedding. Default value is `ppx = 100`.
`g`	The resolution of the density space grid ( $g*g$ cells). Default value is `g = 200`.
`g.exp`	A numeric factor to avoid border effects. The grid limits will be expanded so as to enclose the density of the kernel of the most extreme embedded datapoints up to `g.exp` times $\sigma$ . Default value is `g.exp = 3`, `i.e.` the grid limits are expanded so as to enclose the 0.9986 of the probability mass of the most extreme kernels.
`mpi.cl`	An MPI (inter-node parallelization) cluster as returned by `bdm.mpi.start()`. Default value is `mpi.cl = NULL`, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated.
`threads`	Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is `threads = 4`.
`layer`	The ptSNE output layer. Default value is `layer = 1`.

Details

When computing the paKDE the embedding area is discretized as a grid of size g*g cells. In order to avoid border effects, the limits of the grid are expanded by default so as to enclose at least the 0.9986 of the cumulative distribution function ( $3 \sigma$ ) of the kernels of the most extreme mapped points in each direction.

The presence of outliers in the embedding can lead to undesired expansion of the grid limits. We can overcome this using lower values of g.exp. By setting g.exp = 0 the grid limits will be equal to the range of the embedding.

The values g.exp = c(1, 2, 3, 4, 5, 6) enclose cdf values of 0.8413, 0.9772, 0.9986, 0.99996, 0.99999, 1.0 respectively.

Value

A copy of the input bdm instance with new element bdm$pakde (paKDE output). bdm$pakde[[layer]]$layer = 'NC' stands for not computed layers.

Examples


# --- load mapped dataset
bdm.example()
# --- run paKDE
## Not run: 
m <- bdm.pakde(ex$map, ppx = 200, g = 200, g.exp = 3, threads = 4)
# --- plot paKDE output
bdm.pakde.plot(m)

## End(Not run)
# --- load mapped dataset
bdm.example()
# --- run paKDE
## Not run: 
m <- bdm.pakde(ex$map, ppx = 200, g = 200, g.exp = 3, threads = 4)
# --- plot paKDE output
bdm.pakde.plot(m)

## End(Not run)

Plot paKDE (density landscape)

Description

Plot paKDE (density landscape)

Usage

bdm.pakde.plot(bdm, pakde.pltt = NULL, pakde.lvls = 16, layer = 1)
bdm.pakde.plot(bdm, pakde.pltt = NULL, pakde.lvls = 16, layer = 1)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()` or a list of them to make a comparative plot.
`pakde.pltt`	A colour palette to show levels in the paKDE plot. By default (`pakde.pltt = NULL`) the default palette is used.
`pakde.lvls`	The number of levels of the density heat-map (16 by default).
`layer`	The `bdm$ptsne` layer to be used (default value is `layer = 1`).

Value

None.

Examples


bdm.example()
m <- bdm.pakde.plot(ex$map)
bdm.example()
m <- bdm.pakde.plot(ex$map)

Precision map (quantile map of betas)

Description

Precision map (quantile map of betas)

Usage

bdm.pMap(
  bdm,
  pMap.levels = 8,
  pMap.cex = 0.1,
  pMap.bg = "#000000",
  colorbar = T
)
bdm.pMap(
  bdm,
  pMap.levels = 8,
  pMap.cex = 0.1,
  pMap.bg = "#000000",
  colorbar = T
)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()`.
`pMap.levels`	The number of levels of the quantile-map (8 by default).
`pMap.cex`	The size of the data-points (as in `par()`). Default value is `ptsne.cex = 0.1`.
`pMap.bg`	The background colour of the qMap plot. Default value is `ptsne.bg = #FFFFFF` (white).
`colorbar`	A logical value (TRUE by default). FALSE hides the side colorbar.

Value

None.

Examples


bdm.example()
bdm.pMap(ex$map)
bdm.example()
bdm.pMap(ex$map)

Parallelized t-SNE (ptSNE)

Description

Starts the parallelized t-SNE algorithm (pt-SNE). This is the first step of the mapping protocol.

Usage

bdm.ptsne(
  data,
  bdm,
  theta = 0.5,
  Y.init = NULL,
  mpi.cl = NULL,
  threads = 4,
  layers = 2,
  info = 0
)
bdm.ptsne(
  data,
  bdm,
  theta = 0.5,
  Y.init = NULL,
  mpi.cl = NULL,
  threads = 4,
  layers = 2,
  info = 0
)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A `bdm` data mapping instance.
`theta`	Accuracy/speed trade-off factor, a value between 0.33 and 0.8. (Default value is `theta = 0.0`). If `theta < 0.33` the algorithm uses the exact computation of the gradient. The closer is this value to 1 the faster is the computation but the coarser is the approximation of the gradient.
`Y.init`	A `n 2 layers` matrix with initial mapping positions. (By default `Y.init=NULL` will use random initial positions).
`mpi.cl`	MPI (inter-node parallelization) cluster as generated by `bdm.mpi.start()`. (By default `mpi.cl = NULL` a 'SOCK' (intra-node parallelization) cluster is generated).
`threads`	Number of parallel threads (according to data size and hardware resources, `i.e.` number of cores and available memory. Default value is `threads = 4`).
`layers`	Number of layers (`minimum` 2, `maximum` the number of threads). Default value is `layers = 2`.
`info`	Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is `info = 0`.

Value

A bdm data mapping instance.

Examples


# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run: 
# --- run ptSNE
m <- bdm.ptsne(ex$data, ex$map, threads = 10, layers = 2)
# --- plot the Cost function
bdm.cost(m)
# --- plot ptSNE output
bdm.ptsne.plot(m, class.lbls = ex$labels)

## End(Not run)
# --- load example dataset
bdm.example()
# --- perform ptSNE
## Not run: 
# --- run ptSNE
m <- bdm.ptsne(ex$data, ex$map, threads = 10, layers = 2)
# --- plot the Cost function
bdm.cost(m)
# --- plot ptSNE output
bdm.ptsne.plot(m, class.lbls = ex$labels)

## End(Not run)

Plot ptSNE (low-dimensional embedding)

Description

Plot ptSNE (low-dimensional embedding)

Usage

bdm.ptsne.plot(
  bdm,
  ptsne.cex = 0.5,
  ptsne.bg = "#FFFFFF",
  class.lbls = NULL,
  class.pltt = NULL,
  layer = 1
)
bdm.ptsne.plot(
  bdm,
  ptsne.cex = 0.5,
  ptsne.bg = "#FFFFFF",
  class.lbls = NULL,
  class.pltt = NULL,
  layer = 1
)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()` or a list of them to make a comparative plot.
`ptsne.cex`	The size of the mapped data-points in the ptSNE plot. Default value is `ptsne.cex = 0.5`.
`ptsne.bg`	The background colour of the ptSNE plot. Default value is `ptsne.bg = #FFFFFF` (white).
`class.lbls`	A vector of numeric class labels. If `is.null(class.lbls)` and `!is.null(bdm$wtt)` cluster labels are used.
`class.pltt`	A colour palette to show the class labels in the ptSNE plot. If `ptsne.pltt = NULL` (default value) the default palette is used.
`layer`	The `bdm$ptsne` layer to be used (default value is `layer = 1`).

Value

None.

Examples


bdm.example()
m <- bdm.ptsne.plot(ex$map, class.lbls = ex$labels)
bdm.example()
m <- bdm.ptsne.plot(ex$map, class.lbls = ex$labels)

ptSNE quantile-maps

Description

Maps quantitative variables onto the embedding space.

Usage

bdm.qMap(
  bdm,
  data,
  labels = NULL,
  subset = NULL,
  qMap.levels = 8,
  qMap.cex = 0.3,
  qMap.bg = "#FFFFFF",
  class.pltt = NULL,
  qtitle = NULL,
  cex.main = 1,
  colorbar = T,
  layer = 1
)
bdm.qMap(
  bdm,
  data,
  labels = NULL,
  subset = NULL,
  qMap.levels = 8,
  qMap.cex = 0.3,
  qMap.bg = "#FFFFFF",
  class.pltt = NULL,
  qtitle = NULL,
  cex.main = 1,
  colorbar = T,
  layer = 1
)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()`.
`data`	A `matrix/data.frame` to be mapped.
`labels`	A length `nrow(bdm$data)` vector of class labels to overlay onto the embedding. Label values are factorized as `as.numeric(as.factor(labels))`. Default value is `labels = NULL`.
`subset`	A numeric vector with the indexes of a subset of data. Data-points in the subset are heat-mapped and the rest are shown in light grey. By default all data-points are heat-mapped.
`qMap.levels`	The number of levels of the quantile-map (8 by default).
`qMap.cex`	The size of the data-points (as in `par()`).
`qMap.bg`	The background colour of the qMap plot. Default value is `ptsne.bg = #FFFFFF` (white).
`class.pltt`	If `!is.null(labels)` or `!is.null(bdm$lbls)`, a colour palette to show class labels with the qMap plots. By default (`qMap.pltt = NULL`) the default palette is used.
`qtitle`	A vector of strings with titles for the plots. Default value is `qtitle=NULL`.
`cex.main`	The font size of the title (as in `par()`).
`colorbar`	A logical value (TRUE by default). FALSE hides the side colorbar.
`layer`	The number of a layer (1 by default).

Details

This is not a heat-map but a quantile-map plot. This function splits the range of each variable into as many quantiles as specified by levels so that the color gradient will hardly ever correspond to a constant numeric gradient. Thus, the mapping will show more evenly distributed colors though at the expense of possibly exaggerating artifacts. For variables with very extrem distributions, it will be impossible to find as many quantiles as desired and the distribution of colors will not be so homogeneous.

Value

None.

Examples


bdm.example()
bdm.qMap(ex$map, ex$data)
# --- show only components (1, 2, 4, 8) of the GMM
bdm.qMap(ex$map, ex$data, subset = which(ex$map$lbls %in% c(1, 4, 8, 16)))
bdm.example()
bdm.qMap(ex$map, ex$data)
# --- show only components (1, 2, 4, 8) of the GMM
bdm.qMap(ex$map, ex$data, subset = which(ex$map$lbls %in% c(1, 4, 8, 16)))

Restart pt-SNE

Description

Restarts the ptSNE algorithm (runs more epochs).

Usage

bdm.restart(
  data,
  bdm,
  epochs = NULL,
  iters = NULL,
  mpi.cl = NULL,
  threads = NULL,
  layers = NULL,
  info = 0
)
bdm.restart(
  data,
  bdm,
  epochs = NULL,
  iters = NULL,
  mpi.cl = NULL,
  threads = NULL,
  layers = NULL,
  info = 0
)

Arguments

`data`	Input data (a matrix, a big.matrix or a .csv file name).
`bdm`	A `bdm` data mapping instance.
`epochs`	Number of epochs to run. Default value `epochs = NULL` runs `4 *log(n)` epochs.
`iters`	Number of iters per epoch. Default value `iters = NULL` runs `4 *log(thread_size)` iters/epoch.
`mpi.cl`	An MPI (inter-node parallelization) cluster as returned by `bdm.mpi.start()`. Default value is `mpi.cl = NULL`, i.e. a 'SOCK' (intra-node parallelization) cluster is automatically generated.
`threads`	Number of parallel threads (according to data size and hardware resources, i.e. number of cores and available memory). Default value is `threads = 4`.
`layers`	Number of layers (`minimum` 2, `maximum` the number of threads). Default value is `layers = 2`.
`info`	Output information: 1 yields inter-round results, 0 disables intermediate results. Default value is 0.

Value

A bdm data mapping instance.

Examples


# --- load example dataset
bdm.example()
## Not run: 
# --- restart ptSNE
m <- bdm.restart(ex$data, ex$map, epochs = 50)

## End(Not run)
# --- load example dataset
bdm.example()
## Not run: 
# --- restart ptSNE
m <- bdm.restart(ex$data, ex$map, epochs = 50)

## End(Not run)

Watertrack transform (WTT)

Description

Starts the WTT algorithm (third setp of the mapping protocol).

Usage

bdm.wtt(bdm, layer = 1)
bdm.wtt(bdm, layer = 1)

Arguments

`bdm`	A `bdm` data mapping instance.
`layer`	The ptSNE output layer. Default value is `layer = 1`.

Details

This function requires the up-stream step bdm.pakde().

Value

A bdm data mapping instance.

Examples


# --- load mapped dataset
bdm.example()
# --- perform WTT
m <- bdm.wtt(ex$map)
# --- plot WTT output
bdm.wtt.plot(m)
# --- load mapped dataset
bdm.example()
# --- perform WTT
m <- bdm.wtt(ex$map)
# --- plot WTT output
bdm.wtt.plot(m)

Plot WTT (clustering)

Description

Plot WTT (clustering)

Usage

bdm.wtt.plot(
  bdm,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)
bdm.wtt.plot(
  bdm,
  pakde.pltt = NULL,
  pakde.lvls = 16,
  wtt.lwd = 1,
  plot.peaks = T,
  labels.cex = 1,
  layer = 1
)

Arguments

`bdm`	A `bdm` instance as generated by `bdm.init()` or a list of them to make a comparative plot.
`pakde.pltt`	A colour palette to show levels in the paKDE plot. By default (`pakde.pltt = NULL`) the default palette is used.
`pakde.lvls`	The number of levels of the density heat-map (16 by default).
`wtt.lwd`	The width of the watertrack lines (as set in `par()`).
`plot.peaks`	Logical value (TRUE by default). If set to TRUE and the up-stream step `bdm$wtt()` is computed marks the peak of each cluster.
`labels.cex`	If `plot.peaks` is TRUE, the size of the labels of the clusters (as set in `par()`). By default `labels.cex = 0.0` and the labels of the clusters are not depicted.
`layer`	The `bdm$ptsne` layer to be used (default value is `layer = 1`).

Value

None.

Examples


bdm.example()
m <- bdm.wtt.plot(ex$map)
bdm.example()
m <- bdm.wtt.plot(ex$map)

Package 'bigMap'

Help Index

Clustering statistics box-plot.

Description

Usage

Arguments

Details

Value

Examples

ptSNE cost & size plot.

Description

Usage

Arguments

Value

Examples

Class density maps

Description

Usage

Arguments

Details

Value

Examples

Class density maps plot.

Description

Usage

Arguments

Details

Value

Examples

Example dataset

Description

Usage

Details

Value

Examples

HD/LD correlation.

Description

Usage

Arguments

Value

Examples

Embedding initialization.

Description

Usage

Arguments

Value

Examples

k-ary Neighborhood Preservation

Description

Usage

Arguments

Value

Examples

k-ary Neighborhood Preservation plot

Description

Usage

Arguments

Examples

Get data-point clustering labels.

Description

Usage

Arguments

Value

Examples

Merging of clusters based on signal-to-noise-ratio.

Description

Usage

Arguments

Details

Value

Examples

Initialize parallel computing environment.

Description

Usage

Arguments

Value

Stops MPI parallel computing environment.

Description

Usage

Arguments