NAmICS Newsletter #9September 1994

The North American Chapter of the

International Chemometrics Society

Newsletter # 8NAmICSJuly 1994

1

Page 1
NAmICS Newsletter #8July 1994

ICS

D.L. Massart, President
Free University of Brussels
Institute of Pharmacy
Laarbeeklaan 103
B-1090 Brussels, Belgium

W. Wegscheider, Secretary
Institute for Analyt-, Micro- & Radiochemistry
Graz University of Technology
Technikerstrasse 4
A-8010 Graz, Austria
Wegscheider @ rech.tu-graz.ada.at

S.D. Brown, Course Accreditation
Departmentof Chemistry and Biochemistry
University of Delaware
Newark, DE 19716, USA

B. Vandeginste, Chemometric Abstracts
Unilever Research Laboratory
PO Box 114
S130 AC Vlaardingen, The Netherlands

NAmICS

D.B. Dahlberg, President
Department of Chemistry
Lebanon Valley College
Annville, PA 17003

Barry M. Wise, President-Elect
4154 Laurel Drive,
West Richland, WA 99352

D.M. Schnur, Secretary
Monsanto Company- U3E
800 N. Lindbergh Boulevard
St. Louis, MO 63167

Illman and Blackburn, Editors-In-Chief

Deborah L. Illman
4715 N.E. 100th, Seattle, WA 98125

Marlana B. Blackburn
Chemistry Department, Box 4076
College of St. Catherine
St. Paul MN 55105-1794

P.D. Wentzell, Treasurer - Canada
Department of Chemistry
Dalhousie University
Halifax, Nova Scotia B3H 4J3

Charles H. Lockmueller, Treasurer - USA
Department of Chemistry
Duke University
Durham, NC 27708-0346

Software Reviews

Take a spin with Pirouette

Review of Infometrix product Pirouette v 1.2

by
Marlana Blackburn

The developers of Pirouette aim high; they seek to design a powerful, yet userfriendly software tool that can tackle the most frequent types of chemometric investigations. Infometrix achieves this ambitious goal by judiciously selecting techniques and then carefully implementing them. Much thought has gone into the development of this high quality product and it shows.

Pirouette's three modules each contain two algorithms. The exploration module implements hierarchical clustering and principal component analysis; the classification module offers K-nearest neighbors and SIMCA; the calibration module contains PLS and PCR routines. The underlying chemometric theory is solid. Necessary options are available and pertinent diagnostics calculated. Several types of preprocessing and many different transformations are furnished. Prediction based on previously developed classification or calibration models is straight-forward.

[continued next page]

______

From the Editor's Desk

It's a pleasure to bring you the eighth edition of the newsletter of the North American Chapter of the International Chemometrics Society. Dave Duewer and Dora Schnur roped me into, er, ah, asked me to guest-edit this issue. It's the one you've all been waiting for, yes, the Election Issue! There's quite a line-up of candidates waiting for you to cast your vote, so don't delay in returning your ballot (p. 20). All opinions expressed herein are solely those of contributing individuals; their institutions bear none of the blame.

Deborah Illman, Guest Editor

In this issue:

Candidate's Statements, 18  Ballot, 20  Miss Prim, 3
Seasholtz waxes philosophic, 4  Happy Birthday NAmICS, 5
Education, 8  Letters, 9  Vendor Information on List-Serve, 10
Calendar, 12  Chemometrics On-Line Conference, 13

Page 1
NAmICS Newsletter #8July 1994

Software Reviews: Piroutte, cont.

Pirouette's menu-driven interface is uncluttered and intuitively organized. A rudimentary spreadsheet holds the data. Generating and managing data subsets is easy and efficient. One of Pirouette's several strengths is its clever graphic environment. The screen is divided into four windows containing "objects" (i.e., either raw data or computed results) selected by the user. For example, it is possible simultaneously to view eigenvalues, scores, and loadings in a PCA exploration. Changing the format of results is as easy as clicking on a toolbar button. Scores, for instance, can be displayed as a text table, a 2D scatterplot, a lineplot, or a 3D scatterplot. Windows can be zoomed to full screen or resized. The toolbar also provides the means to spin 3D plots, magnify 2D plots, and identify points in plots.

Pirouette provides what it calls "array plots" for certain objects. Consider a SIMCA model containing three classes. The scores window will contain three miniature score plots; each can be successively zoomed to fill the quadrant by doubleclicking. These miniatures can be surprisingly informative. Other miniatures (called multiplots) of raw data show up to 231 pairwise variable plots (i.e., 22 variables worth); linear correlations are immediately visible even in the reduced form. This variety of data views and the advantageous use of color facilitate the tedious, yet necessary, process of examining a large data set.

The installation procedure is automatic. In a few cases, some customization of config.sys and autoexec.bat might be necessary but these matters are spelled out in the manual. The program accesses at most 16 MB of memory and requires at minimum a 386 computer with 4 MB of memory and 5 MB of hard disk space with an EGA or VGA adapter and mouse. A math co processor is strongly recommended, as is more memory. I ran Pirouette on a 386sx (20 MHz, 4 MB, math coprocessor) and a 486dx (66 MHz, 16 MB); the times given below correspond to the flashier hardware unless explicitly noted. The program's worksheet can hold up to 8000 samples or variables with the limit of the combination being determined by the available memory. Extracting 10 principal components from a 75 sample/66 variable data set took less than 15 sec.

Pirouette employs data linking in two imaginative ways. First, in SIMCA, PCR, PLS, the number of model factors is linked to related objects (plots of modeling power, residuals, predictions, and leverage, etc). Thus, with a plot of eigenvalues v. number of factors in one window and up to three linked objects in other windows, clicking on the desired number of factors in the eigenvalue plot triggers an immediate update of the results in the remaining windows. The other type of data linking allows a user to select a subset of samples or variables in one view of the data and see those selections highlighted in another view.

This greatly simplifies the inspection and/or deletion of outliers. They can be highlighted in the residuals plot using a rubberband box, examined in the prediction plot, and then, with a single keystroke, excluded to form a new subset. This feature, coupled with the program's computational speed, makes it realistic to investigate and compare many subsets. For example, for my 75x66 data matrix, I could delete a few variables, rerun PCA, and compare the eigenvalue plots in less than 20 sec.

Besides its own data format, Pirouette supports ASCII and WKS formats for both input and output. I had no problem getting data into the program. I occasionally exported data in WKS format, loaded the file into a spreadsheet program, sorted the data, and so forth, and then reread the file back into Pirouette. Bundled with Pirouette is MasterKey, a utility which translates data files produced by a variety of commercial instruments into Pirouette, ASCII, and WKS formats. To export only certain results, the user can choose to save the contents of an active window. This is handy for transferring results to a plotting package or a report generating program. Pirouette's flexibility in this important area of data handling is laudable.

Software Reviews: Piroutte, cont.

The Pirouette manual is outstanding. Besides THOROUGHLY documenting every feature of the program and presenting tutorials on each technique, it also discusses the theory behind the methods and algorithms in a readable and informative style. The wellorganized text includes many explanatory figures, a detailed index, and several excellent appendices.

The one area where I am critical of Pirouette is its printing capabilities. For the record, I have used (or tried to use) the following printers: HP PaintJet (color), HP LaserJetII, Apple LaserWriterIINT, and two other postscript printers whose names appear in Pirouette's Printer Setup menu. The LaserJetII performed properly. The PaintJet was occasionally flaky. I never succeeded in getting output from any postscript device. All printing takes place in the foreground. Given the complexity of the images being printed, it is not surprising that the process is rather slow on aging hardware; it took my 386sx about 2 min/page for the LaserJetII. (I can't give a time for the 486dx because only postscript printers were available on that system.)

It is also distressing that offline devices are not recognized as such. For those working in a postscript environment, the otherwise fine performance of the software is compromised by printing difficulties. The program offers the option of printing to a file or saving TIFF images, which is a fast way around printing out of Pirouette. It took less than 15 sec for the 486 .tif filewrite. Perhaps Windows users imagine saving TIFFs, switching into a Windows program that recognizes the format, printing via the Print Manager, and switching back to Pirouette. Good idea except for the switching part. Pirouette CAN run inside Windows (but only in standard mode and some applications don't like this) but it cannot task switch. Infometrix is aware of the postscript printing problems and is taking steps to address them. A Windows version of the program is due by the end of the year. Its release should make both the print speed and printer driver issues moot. I found the staff at Infometrix EXTREMELY responsive and knowledgeable. My phone calls and email messages were dealt with in a timely and professional fashion.

Overall, Pirouette is a very impressive product. It IS expensive (list $4000 with a 40% discount for academic users) but let's face it, good tools are never cheap! A free, almost fully functional demo is available so interested parties can investigate the program for themselves. Since Infometrix will customize a demo containing your own data, this is a norisk way to get a real feel for Pirouette and see how it works with your system.

For further information:
Infometrix, Inc.
2200 Sixth Avenue, Suite 833
Seattle, WA 98121
phone: 2064414696
fax: 2064410841
internet:

______

Ask Miss Prim

Dear Miss Prim,

My name is Eddie an my nayburhood is gonna becum a enterprise zone. I wanna start a kemmometrics kumpany wit my pals on da street. What shud we do?

Signed,

Eddie an da Latin Squares

Gentle (sic) Reader:

Perhaps you should look into a more lucrative business like statistical consulting. There are already gangs of unemployed chemometricians roaming the country looking for jobs. In fact, there is an international crisis, with svante or eighty such gangs throughout the world. These people are mean (not average), sum are squared, and many are in analysis. Do a target transformation on your goals.

[Questions for Miss Prim (Clare Gerlach) may be sent in care of the Editor-in-Chief.]

Page 1
NAmICS Newsletter #8July 1994

A rose by any other name…

Mary Beth Seasholtz, , (517)636-3646

The field of chemometrics is fortunate enough to have progressed to the point where there are multiple generations of ideas. As a graduate student in 1989 eager to learn the tools of the trade, I was confronted with two sets of ‘generations’ of equations describing PCR (and who knows how many for PLS, but that is another story!). This newsletter seemed a good place to present a few lines which demonstrate the equivalence of the two approaches, and to touch on the historical context which led to the move. Please forgive the omission to the multitude of appropriate references – Deborah only gave me one page.

The older of the two approaches begins with assuming R = TPT where T has orthogonal columns and P has orthonormal columns. In the mid 1960’s when chemometrics was born, there were a few methods available for calculating T and P. One was the not so well behaved NIPALS algorithm. Alternatively, T could be obtained by solving for the eigenvalues and eigenvectors of RRT (a square symmetric matrix), and then P could be estimated given R and T. The symmetric eigenvalue problem was studied for many years; the most famous book on the subject was published in 1965 by J.H. Wilkinson (The Algebraic Eigenvalue Problem). However, computational algorithms were not in high demand as computers of the 60’s certainly were not what they are today. The calibration problem then was (~ indicates truncation). Solving for x via the normal equations gives . For prediction, the unknown measurement must first be converted to scores by , giving
(1)

In 1969 Gene Golub made the singular value decomposition (svd) an algorithmic reality. It was long known that an arbitrary matrix could be written as the product of three matrices, R = USVT, where U = eigenvectors of RRT, V = eigenvectors of RTR and RRT (they are the same). But, until Gene and his coworkers came on the scene there was not a direct calculation (you had to go through the covariance matrices as described above). With the advent of the widespread availability of the svd (and other useful code) through facilities like LINPACK, EISPACK, Numerical Recipes and Matlab, the PCR story has since evolved and a new word is being used: pseudoinverse. The calibration equation now reads c = Rb, and , where pseudoinverse. Prediction is simply
(2)

Equation (2) sure looks different from (1)! Well, recall T could be calculated from an eigenvector problem of RRT … in fact T = US and P = V. Making these substitutions into (1) give which is equation (2), after reduction using standard linear algebra rules.

As can be seen, it was because of some relatively new technology in the area of numerical linear algebra which gave rise to the new look for PCR. In addition to all the other things that keep us busy, I think we must be as diligent as we can to continue to bring into chemometrics new developments from disciplines such as applied mathematics, statistics and numerical analysis.

Page 1
NAmICS Newsletter #8July 1994

See what too much tequila can do?

by
Bruce R. Kowalski
Endowed Professor of Chemistry
University of Washington

June 10, 1994

Happy 20th Birthday to the Chemometrics Society! It was on June 10, 1974 that the Laboratory for Chemometrics met with Svante Wold in Seattle over some great Mexican food and too much tequila and formed the Society. Our focus was on improving communication between chemists, statisticians and mathematicians. We also wanted all chemists to be about 10% chemometricians to insure that experiments would be designed optimally and all information would be extracted from chemical measurements. Well, the Mexican restaurant is no longer in business but the Society and field of chemometrics is alive and doing very well.

I was reading Chemometrics Society Newsletter Number 1 (I have a complete set) published in January, 1976, and it reported 101 members worldwide with half of them owning the program ARTHUR which some of you may remember. The newsletter announced that the second FACSS meeting had an attendance of 200 at the "Chemometrics in Analytical Chemistry" session with papers from Wold, Deming, Horlick, Duewer and Kowalski. Also announced was the "Chemometrics: Theory and Applications" session at the Summer 1976 ACS meeting in San Francisco that later produced the first book on chemometrics. The newsletter ended with comments, requests and suggestions from Richard Cramer, Ken Loach and Harold Martens. So much for ancient history.

Two journals, thousands of papers and reviews and dozens of books later we find ourselves today with rich areas of application, powerful chemometrics tools and essentially infinite computer power. We are very busy scientists. Also, scientists, statisticians, and mathematicians and even chemical engineers have discovered chemometrics and the race is on. What will our science be like in the next century, the year 2000?

Allow me to make a few predictions. You can use leaveoneout crossvalidation to estimate the PRESS, SEP or RHSCV if you choose. First and foremost we should all see the necessity to have chemometrics permeate the formal education of all chemists, not only with graduate level courses, training courses and workshops but also at the beginning levels of chemical education. The old "scientific method" that relies on a lot of theory and few definite measurements must die. It should be replaced with equal amounts of theory, experimentation, measurements and simulation and emphasize the multivariate nature of the world around us. There is no place for univariate thinking in our multivariate, dynamic world.

Next, chemometrics will no longer be just a collection of our data analysis methods. The tools of chemometrics will spawn new measurement theories that will guide chemists in all areas of research. To this end a young chemometrician, Karl Booksh, and I offer a special report in the August issue of ANALYTICAL CHEMISTRY titled "Theory of Analytical Chemistry." I invite you to read this paper and incorporate it in your research and education activities. I also encourage you to expand this theory and move it into areas of chemistry beyond chemical analysis.

Finally, the tools of chemometrics will move from mathematics and software to firmware so as to be transparent to the user and very easy to use and hard to abuse. Our current software, while a great improvement over ARTHUR and SIMCA, is still too difficult to use. The younger generation of chemists have good backgrounds in linear algebra and have little difficulty with multivariate methods. However, the older generation that will be with us into the next century doesn’t understand our methods and therefore prefer to separate one peak from all the rest or correlate one molecular property at a time to molecular activity thereby missing the most important part of nature, covariance. We must accept the responsibility to make it easy for all chemists to incorporate multivariate methods into their work. Use this as an analogy. We are all expert users of TVs, VCRs, cellular phones and the like, but how many of us are truly familiar with the complex subsystems of these devices. To really be of use, our methods must be integrated into instruments and experiments to the point of being transparent.