AnVil Informatics – Application Systems – 6-14-01

This document briefly outlines some specific applications, which could be implemented currently using techniques developed with AnVil technologies. These applications are to be applied to specific problems in the bioinformatic or cheminformatic arenas.

For the most part the techniques have been tried or implemented using various tools including AnVil’s Java platform, the new DS and other commercial or public domain packages.

Outline of Specific Application Systems

1.  Specific applied Classifiers - AnVil can evaluate and produce the best possible classifier from 100 hundreds of possible models or classifiers. For example we can generate most possible “state of the Art classifiers” such as “Support Vector machines” , Random Forests or standard back-prop Neural Networks. The key points are that to get the best possible classifier “data enhancement” is usually necessary. In many cases a standard back-propogation NN may be the best or unfortunately it may be impossible to get the accuracy desired.

2.  A Radviz Gradual Visual Classifier – Training, Validation and testing any kind of data can be made much more “user friendly” with interaction and the “graduated classification” Radviz system.

3.  Clustering and Specific Cluster Comparison – which clustering technique best clusters your data for example to some known functional description. AnVil’s proprietary cluster comparison methods will tell you which is best. A myriad of clustering techniques can be applied such as Kohonen, K-means, Hiarchical, PCA and variations.

4.  Micro Array Gene Expression Disease Classifier– filtering, selection, Multi-Disease Classifier. – This system can find the optimum gene set to diagnose two or more disease states using various gene selection mechanisms such as Absent-Present user selectiont (dendrogrammatic?) and RadViz differential selection, PURS (Principle Uncorrelated Record Selection) and 3-way differential clustering.

5.  Quality Control MetaData Analysis System – Using a full statistical analysis, multi-correlation and Multi-class analysis quality control data is analyzed and visualized to get a high level overview of your data.

6.  Data Cleaning, Impute Missings, Extraction and Organizing System - This system is used to take a customers unformatted, dirty, and unorganized data and produce a coherent database system which can be easily data mined.

7.  Partial Subset Visual Classifier – On systems where classification accuracy is not high enough, or a “black box” classifier is not satisfactory this system uses “association rules” and RadViz to find “subset” areas in a dataset that can be classified with a high accuracy.

8. Gene Correlation Searching/Analysis – based on our threee correlation methods (Pearson , Jacknife and Cosine) we could build a database of gene correlation over different experiments. For example in one experiment we have looked at the two genes U11863(1713) and. U39400(2011) and found a very high negative correlation(-0.976).

A database query system could be built for customers with different correlation values for different experiments.

Most of the above “systems” are “virtual” meaning that currently they can only be implemented with a lot of “manual” work, but some or all of them could be implemented (in varying degrees) into a “turnkey” system.

How much software development is necessary to implement some of these systems? Each system would take at least a week to come up with a realistic estimate of the effort involved.

Patrick Hoffman