Ideas not planned in Unity.

From before, more suited for R&D.

.1. Continue Gene Ontology evaluation and visualization started by Hong Li.

This will link functional genomics to expression analysis in a visual and analytical manner (cluster comparison etc.) But do this in a manner that is good for general “Meta Data” mining. Adding descriptions, or hiarchical data, etc Additionally, we want to integrate gene pathway information if that is available also.

There are at least 3 levels of knowledge we want integrated into our expression analysis.

Mesh (literature hits)

Gene Ontology (function descriptions, chemical etc.)

Pathway information.

2. N-fold validation of a RadViz classifier

. We need a mechanism such as re-laying out RadVizN- times (N-fold) and automatically (or guided by the opr) defining a classification region and then place the N-percent of points left out to see where they are classified. Since RadViz is a visually classifier we need a way to estimate its accuracy, like any standard classifier.

3. Map RadViz layout to all Classifier outputs

As discussed several times,most classifier outputs can be made "gradual" that is one can make a classifiying decision but also it is possible to find out how "close" a point is to the other classes. This information can be displayed nicely in the RadViz "pie wedge" classification paradigm.

4.  Cortviz- method for showing “Correlation signatures” of genes within experiments. This should be further investigated and studied. CortViz is essentially the Pearson Correlation matrix of all genes with all genes, but arranged in a manner sorted by genes that correlate highly together.

5. Enhance RadViz class layout by FuzzViz

– lines perpendicular to the perimeter representing t-statistic, or significance etc. This gives a visually representation of the layout statistic and when a class layout is done, will have perpendicular radial lines sorted from high to low for each class pie shape wedge.

6.  Enhance RadViz class layout by t-statistic. That is the weights of Radviz should be modified by the calculated layout statistic. Thus possibly enhancing the class separation.

7. Principle Components –Radviz clustering ( A new idea mapping pc coefficients to radViz weights see other document for details)

One of the problems with PCA is that the coefficients of the “real” dimensions making up the PC’s are not usually shown, thus being somewhat a “black box” clusterer.

One could do a PCA and then use the coefficients of each PC for each “real”dimension as weights in a Radviz layout. The coefficients-weights would be sorted Hi to Low , with negative coefficients on the opposite side or RadViz enhanced to use negative weights.

Thus if one wanted to “see” 3 PC’s one would have 3 groups of all the real dimensions around the radviz layout, but with spring forces equivalent to the PC coefficients. This should give clustering similar to a 3D PCA analysis, but with the dimensions layed out and shown in the “importance” order. This should help understanding PC clustering better.

PC layout example dimensions would be duplicated for each PC, and ordered according to size of coefficients. Weights would be equal to coefficients.

8. Dataset chunking or multi-way formatting. Pivot and flattening are subsets of this general idea. All the grouping operations we did for Hypnion were subsets of this data chunking. Transforming “heterogeneous “ data in this manner can greatly enhance the power of patchgrid and radviz. Essentially this is reorganizing a bunch of tables into a flat file. This is probably best done by designing a database with the “clean” tables and then re-querying the data into different view for analysis. EZ can help us with this idea.

9. More and better Statistics (see last Micro Array Stats proposal)

10. Gene Correlation Searching/Analysis – based on our three correlation methods (Pearson , Jacknife and Cosine) we could build a database of gene correlation over different experiments. For example in one experiment we have looked at the two genes U11863(1713) and. U39400(2011) and found a very high negative correlation(-0.976). A database query system could be built for customers with different correlation values for different experiments.

11..Relevance Network
·Show gene to gene correlations across an experiment. The correlation of all genes to all genes thresholded in PatchGrid would show this as Gene Correlation Signature (see CortViz). But an alternative method would be as a Relevance network overlayed in RadViz. A nice option when we are displaying say all 7000 genes in RadViz would be to connect all genes to each other that have say greater than .9 correlation. This would essentially be a Relevance network. Doing all 7k correlations is time consuming (but we have done it), but we could limit it to the top significant genes (or the Purs genes). This would be a pretty easy enhancement to our RadViz layout. One could as show Functional Similarity in this same relevance radviz network. The functional similarity could be based on Gene Ontology. For this we would need the link to that database to show the connections. It seems that Xpogen is using distance metrics for relevance such as correlation or Euclidean distance. Duplicating that will be fairly easy.

12·Bi-clustering Relevance network (Phils idea)
Cluster genes and patients separately and connect highly over expressed and under expressed genes to the appropriate patient clusters. This may help in pathway mapping. This is related to network models for genetic pathways and was touched on by Shamir and Thorsson.

..peh 3-01-02