2010年9月30日

Cluster Analysis by B. Everitt et al

日期:2010/09/29

One of the most useful things that humans can do when making decisions about life or market is to form objects into groups. The caveman must have had to decide which animals were dangerous and which foods were edible, and the investor must decide between value stocks and fundamental socks, or between utilities and industrials, the canals versus the railroads, the nifty fifty versus the old favorites, the stocks that old man sage buys versus the ones that the Druck likes to buy after hearing a technology conference, the stocks touted by the wildman, or those in group 1 in value 1 etc., the stocks when bonds are up et al.

Methods for grouping data are given in the Excellent book Cluster Analysis by Brian Everitt, et al of which I have the fourth edition from 2000. It's the very model of a modern book with a list of software to do all the things for free and price enhanced graphs showing all the things that the programs do, and examples from all fields ranging from biology, genetics, pol science, taxonomy, astronomy, psychology, et al.

It's a lot easier to visualize a cluster then to find one. The basic idea is to find objects that are pretty homogeneous within themselves but disparate among the others. That's a lot easier said then done.

First you have to measure the distance between the objects. The usual measures are squared distance, city block distance, generalized Minkowski distance which is the same as the usual geometric distance but scaled to the third or fourth power instead of the second, or Pearson correlations themselves between the values of the variables for just two objects.

Then you have to form the groups. The usual method is to start with the nearest two, then to build up from there. That's called an agglomeration method. Compare this to the divisive methods which starts with the largest group, and then successively removes the ones that are furthest away. One of the thorniest problems is how to handle more than one variable. Methods based on scaling from the extremes then the standardized values are recommended.

A section that shows how to fit the data based on kernel estimates using kernel function for each pair of observations based on a rectangular, triangular or gaussian distribution is particularly helpful. Also interesting is the preliminary methods for discovering groups using one dimensional and two dimensional histograms. A nice section using factor analysis and principal components analysis, and a related method I've never come across in all my years of reading statistics books called projection pursuit is recommended as a way to reduce the number of variables to a manageable and non-correlated set is also given.

Indeed the book is filled with everything you could ever want to know about grouping data, including multidimensional scaling, similarity measures, weighting techniques, standardization procedures, missing value treatments, mixture models.

Everything is there in a fairly accessible form except how all these methods relate to the current worthless non-predictive fad of artificial intelligence relating to neural networks and its extension. Doubtless the current work in the field has shown how these methods converge to the usual clustering methods based on such things as clustering with constraints, or fuzzy clustering.

Much of the work in clustering comes from such institutes as the Rotterdam Institute of Agriculture and the institute of psychiatry at Kings College in London so it's not surprising that no examples are given from our own field, where grouping is so helpful and necessary. How often do we look at a scatter diagram of two variables, and note that there are two modes in the data, or that two regression lines would fit the data much better than one. If only we knew which of the two groups that the various observations belonged to. And if we only knew how to scale such things as currencies and gold to each other in considering the similarities.

Let's take our own humble attempts to group clusters where we put the four comoves and counter moves of bonds and stocks into four colors yellow for stocks up bonds down, blue for stocks down and bonds up, and green for both up, and red for both down. Haven't seen any reds recently until today. And the colors themselves are just one of the many ways of handling binary splits.

Here are some data to practice clustering on from the real world.


date stocks bonds

Sep 28 4.0 0.28

Sep 27 -5.5 1.2

Sep 24 22.8 -1.0

Sep 23 -9.4 0.01

Sep 22 -4.9 0.2

Sep 21 -1.9 1.1

Sep 20 16.0 0.17

The rest of data for the last 9 months is on our site. It's a good exercise to form groups from such data and maybe even to come up with something useful from it.

沒有留言: