suptitle ( xl + " clustering for " + dataname, fontweight = 'bold', fontsize = 16 ) if save : plt. tick_params ( axis = 'y', which = 'major', labelsize = 15 ) plt. tick_params ( axis = 'x', which = 'major', labelsize = xticksize ) ax. columns ), p = numclust, truncate_mode = "lastp", get_leaves = True, count_sort = 'ascending', show_contracted = True ) #myInd =, B) if c='g'] get_cluster_classes ( B ) ax = plt. transpose ()) # computing the distance data_link = linkage ( data_dist, metric = 'correlation', method = 'complete' ) #method="complete") # computing the linkage B = dendrogram ( data_link, labels = list ( aml. transpose () xl = "x-axis" else : aml = df xl = "y-axis" data_dist = pdist ( aml. Which is just a fancy way of saying we compute all the distances between points in cluster $u$ and cluster $v$, then find the max distance (in other words, a point in cluster $u$, $u$ that achieves that max distance from a point in cluster $v$, $v$).ĭef get_clust_graph ( df, numclust, transpose = False, dataname = None, save = False, xticksize = 8 ): if transpose = True : aml = df. I selected "complete" which uses the Voor Hees algorithm: You could choose, for example, the distance between the centroids of the clusters, or the distance between the closest two points in the clusters. Since the goal of the clustering is to minimize distance between points in the same cluster, the purpose of the linkage algorithm here is to compute the distance between clusters.
This is equivalent to the $L^2$-norm as the distance metric between the points. In this case, we have used the default setting (Euclidean distance) for the p-dist function. I strongly encourage everyone to check out the SciPy docs for pdist and linkage for details and try different hyperparameters to see what you get! I know that the linkage() function in scipy doesn't like 2d matrices (some bug I read about), but once the 2d matrix is read in, I convert it to a compressed matrix with squareform(distMatrix) as shown here: Use Distance Matrix in ()?.Īnyway, any ideas? I really have a hard time believing that scipy has the bug, but I'm running out of options.Then we compute the distance matrix and the linkage matrix using SciPy libraries. I know the distance values are the same - I print them all ordered by size to a single string and calculate an md5 hash on that string, and the hash is always the same.
Here's the trick: when I change the ordering scheme - maybe sort by size, or even a random number generator - python calculates different numbers of clusters for the same cutoff. save out the distance so that it can be written to a 2d matrix
My code looks more or less like this: var set = set.OrderBy(x => x.GetHashCode()) Here's the problem: I need to impose some kind of ordering on them so that they have an ordering along each axis.
I have a C# program that does some analysis, prints out a full 2D matrix of distance values, and then launches a scipy python process (anaconda, fwiw) to do hierarchical clustering.