# This is a demo for leukemia patient dataset from Golub et al # # This leukemia patient expression dataset (the learning set, available from http://www.ncbi.nlm.nih.gov/pubmed/10521349) contains an expression matrix of 3,051 genes X 38 samples, involving two types of leukemia: 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL). These 27 ALL are further subtyped into 19 B-cell ALL (ALL_B) and 8 T-cell ALL (ALL_T). ############################################################################### # (I) Load the package and import data library(supraHex) data(Golub) data <- Golub # a matrix of 3,051 genes expressed in 38 samples # (II) Train the supra-hexagonal map with input data only sMap <- sPipeline(data)Start at 2016-06-24 11:45:40First, define topology of a map grid (2016-06-24 11:45:40)...Second, initialise the codebook matrix (331 X 38) using 'linear' initialisation, given a topology and input data (2016-06-24 11:45:40)...Third, get training at the rough stage (2016-06-24 11:45:40)...1 out of 2 (2016-06-24 11:45:40)updated (2016-06-24 11:45:40)2 out of 2 (2016-06-24 11:45:40)updated (2016-06-24 11:45:40)Fourth, get training at the finetune stage (2016-06-24 11:45:40)...1 out of 5 (2016-06-24 11:45:40)updated (2016-06-24 11:45:40)2 out of 5 (2016-06-24 11:45:40)updated (2016-06-24 11:45:40)3 out of 5 (2016-06-24 11:45:40)updated (2016-06-24 11:45:41)4 out of 5 (2016-06-24 11:45:41)updated (2016-06-24 11:45:41)5 out of 5 (2016-06-24 11:45:41)updated (2016-06-24 11:45:41)Next, identify the best-matching hexagon/rectangle for the input data (2016-06-24 11:45:41)...Finally, append the response data (hits and mqe) into the sMap object (2016-06-24 11:45:41)...Below are the summaries of the training results:dimension of input data: 3051x38 xy-dimension of map grid: xdim=21, ydim=21 grid lattice: hexa grid shape: suprahex dimension of grid coord: 331x2 initialisation method: linear dimension of codebook matrix: 331x38 mean quantization error: 23.6263000610265Below are the details of trainology:training algorithm: batch alpha type: invert training neighborhood kernel: gaussian trainlength (x input data length): 2 at rough stage; 5 at finetune stage radius (at rough stage): from 6 to 1.5 radius (at finetune stage): from 1.5 to 1End at 2016-06-24 11:45:41Runtime in total is: 1 secsvisHexMulComp(sMap,title.rotate=10,colormap="darkgreen-lightgreen-lightpink-darkred") sWriteData(sMap, data, filename="Output_Golub.txt")## As you have seen, a figure displays the multiple components of trained map in a sample-specific manner. You also see that a .txt file has been saved in your disk. The output file has 1st column for your input data ID (an integer; otherwise the row names of input data matrix), and 2nd column for the corresponding index of best-matching hexagons (i.e. gene clusters). You can also force the input data to be output; type ?sWriteData for details. # (III) Visualise the map, including built-in indexes, data hits/distributions, distance between map nodes, and codebook matrix visHexMapping(sMap, mappingType="indexes") ## As you have seen, the smaller hexagons in the supra-hexagonal map are indexed as follows: start from the center, and then expand circularly outwards, and for each circle increase in an anti-clock order. visHexMapping(sMap, mappingType="hits") ## As you have seen, the number represents how many input data vectors are hitting each hexagon, the size of which is proportional to the number of hits. visHexMapping(sMap, mappingType="dist") ## As you have seen, map distance tells how far each hexagon is away from its neighbors, and the size of each hexagon is proportional to this distance. visHexPattern(sMap, plotType="lines") ## As you have seen, line plot displays the patterns associated with the codebook matrix. If multple colors are given, the points are also plotted. When the pattern involves both positive and negative values, zero horizental line is also shown. visHexPattern(sMap, plotType="bars") ## As you have seen, bar plot displays the patterns associated with the codebook matrix. When the pattern involves both positive and negative values, the zero horizental line is in the middle of the hexagon; otherwise at the top of the hexagon for all negative values, and at the bottom for all positive values. # (IV) Perform partitioning operation on the map to obtain continuous clusters (i.e. gene meta-clusters) as they are different from gene clusters in an individual map node sBase <- sDmatCluster(sMap)visDmatCluster(sMap, sBase) sWriteData(sMap, data, sBase, filename="Output_base_Golub.txt")## As you have seen, each cluster is filled with the same continuous color, and the cluster index is marked in the seed node. Although different clusters are coded using different colors (randomly generated), it is unavoidable to have very similar colors filling in neighbouring clusters. In other words, neighbouring clusters are visually indiscernible. In this confusing situation, you can rerun the command visDmatCluster(sMap, sBase) until neighbouring clusters are indeed filled with very different colors. An output .txt file has been saved in your disk. This file has 1st column for your input data ID (an integer; otherwise the row names of input data matrix), and 2nd column for the corresponding index of best-matching hexagons (i.e. gene clusters), and 3rd column for the cluster bases (i.e. gene meta-clusters). You can also force the input data to be output; type ?sWriteData for details. # (V) Reorder the sample-specific components of the map to delineate relationships between samples sReorder <- sCompReorder(data,metric="pearson") # see Figure 8Start at 2016-06-24 11:45:47First, define topology of a map grid (2016-06-24 11:45:47)...Second, initialise the codebook matrix (117 X 38) using 'linear' initialisation, given a topology and input data (2016-06-24 11:45:47)...Third, get training at the rough stage (2016-06-24 11:45:47)...1 out of 1178 (2016-06-24 11:45:47)118 out of 1178 (2016-06-24 11:45:47)236 out of 1178 (2016-06-24 11:45:47)354 out of 1178 (2016-06-24 11:45:47)472 out of 1178 (2016-06-24 11:45:47)590 out of 1178 (2016-06-24 11:45:47)708 out of 1178 (2016-06-24 11:45:47)826 out of 1178 (2016-06-24 11:45:47)944 out of 1178 (2016-06-24 11:45:47)1062 out of 1178 (2016-06-24 11:45:47)1178 out of 1178 (2016-06-24 11:45:47)Fourth, get training at the finetune stage (2016-06-24 11:45:47)...1 out of 4712 (2016-06-24 11:45:47)472 out of 4712 (2016-06-24 11:45:48)944 out of 4712 (2016-06-24 11:45:48)1416 out of 4712 (2016-06-24 11:45:48)1888 out of 4712 (2016-06-24 11:45:49)2360 out of 4712 (2016-06-24 11:45:49)2832 out of 4712 (2016-06-24 11:45:49)3304 out of 4712 (2016-06-24 11:45:50)3776 out of 4712 (2016-06-24 11:45:50)4248 out of 4712 (2016-06-24 11:45:50)4712 out of 4712 (2016-06-24 11:45:51)Next, identify the best-matching hexagon/rectangle for the input data (2016-06-24 11:45:51)...Finally, append the response data (hits and mqe) into the sMap object (2016-06-24 11:45:51)...Below are the summaries of the training results:dimension of input data: 38x38 xy-dimension of map grid: xdim=13, ydim=9 grid lattice: rect grid shape: sheet dimension of grid coord: 117x2 initialisation method: linear dimension of codebook matrix: 117x38 mean quantization error: 0.0621784272542971Below are the details of trainology:training algorithm: sequential alpha type: invert training neighborhood kernel: gaussian trainlength (x input data length): 31 at rough stage; 124 at finetune stage radius (at rough stage): from 2 to 1 radius (at finetune stage): from 1 to 1End at 2016-06-24 11:45:51Runtime in total is: 4 secsvisCompReorder(sMap,sReorder,title.rotate=15,colormap="darkgreen-lightgreen-lightpink-darkred") ## As you have seen, reordered components of trained map is displayed. Each component illustrates a sample-specific map and is placed within a two-dimensional rectangular lattice. Across components/samples, genes with similar expression patterns are mapped onto the same position of the map. Geometric locations of components delineate relationships between components/samples, that is, samples with the similar expression profiles are placed closer to each other. # (VI) Build and visualise the bootstrapped tree D <- t(data)rownames(D) <- paste(rownames(D), 1:nrow(D), sep=".") # temporally make sure the row names are unique tree_bs <- visTreeBootstrap(D, nodelabels.arg=list(cex=0.7))Start at 2016-06-24 11:45:51First, build the tree (using nj algorithm and euclidean distance) from input matrix (38 by 3051)...Second, perform bootstrap analysis with 100 replicates...Finally, visualise the bootstrapped tree...Finish at 2016-06-24 11:45:53Runtime in total is: 2 secs## As you have seen, neighbour-joining tree is constructed based on pairwise euclidean distance matrices between samples. The robustness of tree branching is evaluated using bootstraping. In internal nodes (also color-coded), the number represents the proportion of bootstrapped trees that support the observed internal branching. The higher the number, the more robust the tree branching. 100 means that the internal branching is always observed by resampling characters/genes. # (VII) Visualise the matrix using heatmap # The samples are ordered according to the neighbour-joining tree flag <- match(tree_bs$tip.label, rownames(D)) rownames(D) <- sub("\\.\\d+$", "", rownames(D)) # restore the original names D <- D[flag,] # prepare colors for the column sidebar of heatmap # color for AML/ALL types types <- sub("_.*","",rownames(D)) lvs <- unique(types) lvs_color <- visColormap(colormap="darkblue-darkorange")(length(lvs)) col_types <- sapply(types, function(x) lvs_color[x==lvs]) # color for ALL subtypes subtypes <- sub(".*_","",rownames(D)) lvs <- unique(subtypes) lvs_color <- visColormap(colormap="gray-black")(length(lvs)) col_subtypes <- sapply(subtypes, function(x) lvs_color[x==lvs]) # combine both color vectors ColSideColors <- cbind(col_subtypes,col_types) colnames(ColSideColors) <- c("ALL subtypes", "AML/ALL types") # heatmap embeded with sidebars annotating samples visHeatmapAdv(t(D), Rowv=T, Colv=F, dendrogram="none", colormap="darkgreen-lightgreen-lightpink-darkred", ColSideColors=ColSideColors, ColSideHeight=0.4, ColSideLabelLocation="left", labRow=NA)

Computational Genomics Group, Department of Computer Science, University of Bristol, UK