FAQ Database Discussion Community


DBSCAN returns partial clusters

python,cluster-analysis,dbscan
I am trying to implement the code for DBSCAN here: http://en.wikipedia.org/wiki/DBSCAN The portion I am confused about is expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >=...

How can I get the new order of column and row in a heatmap after clusting using the pheatmap

r,cluster-analysis,pheatmap
I draw a heatmap using the 'pheatmap' package, and clusted with the rows and cloumns. However, for some reason, I need to get the row order and the column order in the heatmap. Is there a convient way to do that? This is a example of pheatmap. test = matrix(rnorm(200),...

knn predictions with Clustering

r,cluster-analysis,knn
I have a 60.000 obs/40 Variable dataset on which I used Clara, mainly due to memory constrains. library(cluster) library(dplyr) mutate(kddnew, Att=ifelse(Class=="normal","normal", "attack")) ds <- dat[,c(-20,-21,-40) clus <- clara(ds, 3, samples=500, sampsize=100, pamLike=TRUE) This returned a table with medoids. Now I'm trying to use knn to do a prediction like this:...

R - cluster analysis on binary weblog data

r,cluster-analysis,k-means
I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based...

Find all k-nearest neighbors

algorithm,cluster-analysis,computational-geometry,hamming-distance
Problem: I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to...

Pairwise matching between 2 sets of points

r,cluster-analysis
I need a method to identify the best combination of pairwise fits between two sets of points such that the overall distance between clustered pairs is minimised. It seems possibly suited to k-means (with 'n' pairs if a max/min cluster size constraint of 2 is possible) but I'm not aware...

Spectral clustering with Similarity matrix constructed by jaccard coefficient

machine-learning,cluster-analysis,pca,eigenvalue,eigenvector
I have a categorical dataset, I am performing spectral clustering on it. But I do not get very good output. I choose the eigen vectors corresponding to largest eigen values as my centroids for k-means. Please find below the process I follow: 1. Create a symmetric similarity matrix (m*m) using...

Removing cycles in weighted directed graph

algorithm,graph,cluster-analysis
This is a follow-up question on my other posts. Algorithm for clustering with size constraints I'm working on a clustering algorithm, After some reclustering, now I have this set of points that none of them are in their optimal cluster but could not be reassigned individually, since it'll violate the...

Running k-medoids algorithm in ELKI

cluster-analysis,data-mining,elki
I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI): The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does...

Clustering a large, very sparse, binary matrix in R

r,performance,matrix,cluster-analysis,sparse-matrix
I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete: d <- dist(inputMatrix, method="binary") hc <- hclust(d, method="complete") The first step doesn't finish, so I'm...

calculating similarity between two profiles for number of common features

machine-learning,cluster-analysis,similarity,unsupervised-learning
I am working on a clustering problem of social network profiles and each profile document is represented by number of times the 'term of interest occurs' in the profile description. To do clustering effectively, I am trying to find the correct similarity measure (or distance function) between two of the...

Force or boost words in carrot2 clustering labels

solr,cluster-analysis,carrot2
I am using Carrot2 to cluster query results from Solr. Is is possible to force (or at least boost) the occurrence of certain words in the labels, in either Lingo, STC or k-means? With Lingo, this is already possible with the option "Title word boost", which gives more weight to...

Is there a way to access or export the label numbers in an r plot?

r,plot,cluster-analysis,k-means
I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want...

Initial centroids for a 3-Means Clustering algorithm

cluster-analysis,k-means
If you want to cluster point data inside a bounding box with a 3-Means Clustering algorithm, what are 3 good (= result in few iterations) initial centroids in the average case without looking at the point data? (e.g.: what is a good distribution of the 3 centroids inside a box)

Scikit-learn KMeans clustering - fit cluster with X features, predict cluster membership with X-1 features?

python,scikit-learn,cluster-analysis,k-means
I am currently trying to solve some kind of a regression task (predict a value of 'count' field) using a KMeans clustering. The idea is trivial: Fit a cluster on my test dataset: k_means = cluster.KMeans(n_clusters=4, n_init = 20, init='random') k_means.fit(df[['DistanceToMidnight','season','DayType','weather','temp','atemp','humidity','windspeed','count']]) *notice that I do use 'count' in clustering. Then...

Why isn't optimal_count giving the right result?

python,cluster-analysis,igraph
I'm trying to understand python-igraph and specifically the community_walktrap function. I created the following example: import numpy as np import igraph mat = np.zeros((200,200)) + 50 mat[20:30,20:30] = 2 mat[80:90,80:90] = 2 g = igraph.Graph.Weighted_Adjacency(mat.tolist(), mode=igraph.ADJ_DIRECTED) wl = g.community_walktrap(weights=g.es['weight']) I would have assumed the optimal count of communities to be...

Clustering Categorical data using jaccard similarity

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and (n=22) features Approach My approach is...

Clustering Textentities with Radpiminer

cluster-analysis,k-means,rapidminer
I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ... i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example: word e is with Cloudtag A and B but not C ... so e is a good seperator to...

Self organising map visualisation result interpretation

r,machine-learning,cluster-analysis,som,unsupervised-learning
Using the R Kohonen package, I have obtained a "codes" plot which shows the codebook vectors. I would like to ask, shouldn't the codebook vectors of neighbouring nodes be similar? Why are the top 2 nodes on the left so different? Is there a way to organise it in a...

How to cluster large datasets

algorithm,data-structures,cluster-analysis
I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content. What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do...

Different clustering algorithms to cluster timeseries events

algorithm,cluster-analysis,k-means,hierarchical-clustering
I have a very large input file with the following format: ID \t time \t duration \t Description \t status The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI,...

Matlab clustering toolbox

matlab,cluster-analysis
I have just download and install matlab clustering toolbox (http://www.mathworks.com/matlabcentral/fileexchange/7486-clustering-toolbox) However, when I run first demo file which is motorcycle clustering example, I am getting the following error. Undefined function 'isnan' for input arguments of type 'struct'. Error in internal.stats.removenan (line 54) wasnan = wasnan | any(isnan(y),2); Error in statremovenan...

Summarize variable variations in clusters (k-means) using R

r,cluster-analysis,k-means
I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in...

Assign class to data frame after clustering

r,cluster-analysis,data-mining,k-means
I used k-means cluster algorithm on a data-frame df1 and the result is shown in the picture below. library(ade4) df1 <- data.frame(x=runif(100), y=runif(100)) plot(df1) km <- kmeans(df1, centers=3) kmeansRes<-factor(km$cluster) s.class(df1,fac=kmeansRes, add.plot=TRUE, col=rainbow(nlevels(kmeansRes))) Is there a possibility to add to the data frame information from which cluster does the observation come...

How to explain a higher percentage of point variability using kmeans clustering? [closed]

r,statistics,cluster-analysis,k-means
I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

Clustering based on pearson correlation

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan
I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok. Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs). One way to check if...

Subspace clustering using CLIQUE in ELKI

java,cluster-analysis,elki,clique
I am trying to detect dense subspaces from a high dimensional dataset. For this I want to use ELKI library. But there are very few documentations and examples of ELKI library. I tried the following- Database db=makeSimpleDatabase("D:/sample.csv", 600); ListParameterization params = new ListParameterization(); params.addParameter(CLIQUE.TAU_ID, "0.1"); params.addParameter(CLIQUE.XSI_ID, 20); // setup algorithm...

Algorithm for clustering with minimum size constraints

algorithm,cluster-analysis
I have a set of data clustering into k groups, each cluster has a minimum size constraint of m I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched...

Cluster centroids on simplekmeans clustering

machine-learning,cluster-analysis,weka
I am currently trying to interpret a set of results gleaned from running SimpleKMeans clustering on the Diabetes.arff data set. http://i.stack.imgur.com/T4eho.jpg - link to clustered instances (figure 1) So far I can understand that the clustered instances (figure 1) show that 500 variables have been classified as tested negative and...

Algorithm for clustering names

algorithm,cluster-analysis,spell-checking,levenshtein-distance
I have people names (first name, last name and surname) in db column. The data is not full, for example some rows have only first name, last name or surname. are in different order (surname, last name) incorrectly spelled I need an algorithm to display a set of rows in...

Clustering cosine similarity matrix

python,math,scikit-learn,cluster-analysis,data-mining
A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B...

Got java heap size error when trying to cluster 15980 documents via carrot2workbench

solr,cluster-analysis,k-means,workbench,carrot
My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0 My Solr Index: 15980 documents My Problem: Cluster all documents with the kmeans algorithm When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000...

Weka Simple K means handling nominal attributes

cluster-analysis,weka,k-means
I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes. I read that it calculates modes for such attributes. I want to know how the similarity is calculated. Lets take an example: Consider a dataset with 3...

How to cluster a set of strings?

machine-learning,cluster-analysis,k-means,hierarchical-clustering
My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

R: Hierarchical clustering

r,cluster-analysis,hierarchical-clustering,hclust
Let's say we have the following dataset set.seed(144) dat <- matrix(rnorm(100), ncol=5) The following function creates all possible combinations of columns and removes the first (combinations <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,]) # Var1 Var2 Var3 Var4 Var5 # 2 TRUE FALSE FALSE FALSE FALSE # 3 FALSE TRUE FALSE FALSE...

Given a dataset with Normal values and outliers, is there any standard way to find a normalised value of epsilon for implementing DBSCAN.

python-2.7,cluster-analysis,hierarchical-clustering,outliers,dbscan
I am working on my personal implementation of DBSCAN on some data, but I have problems when I have to find epsilon dynamically for every kind of data set I have to use, because average value of epsilon before implementing DBSCAN considers the outliers as well, and hence the resultant...

R clustering- silhouette with observation labels

r,plot,cluster-analysis
I do hierarchical clustering with the cluster package in R. Using the silhouette function, I can get the silhouette plot of my cluster output for any given height (h) cut-off in the dendrogram. # run hierarchical clustering if(!require("cluster")) { install.packages("cluster"); require("cluster") } tmp <- matrix(c( 0, 20, 20, 20, 40,...

Mahout clustering: How to retrieve the name of a named vector

cluster-analysis,mahout
I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster. I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and...

My observations are less than the feature vector of each. Any solution to overcome this?

matlab,machine-learning,cluster-analysis,sift,feature-extraction
I'm using GMM to fit my data to 256 Gaussians. I'm using Matlab's fitgmdist to achieve this. gmm{i} = fitgmdist(model_feats, gaussians, 'Options',statset('MaxIter',1000), ... 'CovType','diagonal', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd); I am using RootSIFT to extract the features of each image. This produces a vector of 1x128 for each image. Now I have...

K-Means Clustering a list of US addresses based on drive time

excel,matlab,cluster-analysis,k-means,geo
I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based...

In DBSCAN, what does eps represent actually?

cluster-analysis,data-mining,dbscan
Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood. What does eps...

ELKI: Running DBSCAN on custom Objects in Java

java,cluster-analysis,dbscan,elki
I'm trying to use ELKI from within JAVA to run DBSCAN. For testing I used a FileBasedDatabaseConnection. Now I would like to run DBSCAN with my custom Objects as parameters. My objects have the following structure: public class MyObject { private Long id; private Float param1; private Float param2; //...

How to do column wise intersection with itertools

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for...

How to create consensus matrix?

matlab,cluster-analysis
I need to create a consensus matrix. Let say I have a matrix A as following. 1 1 2 2 3 1 2 2 2 3 1 1 2 3 3 Each row represents one clustering method, and each value represent one specific cluster. For example, A(1,1) means that by...

kNN classifier in multi-label settings with WEKA

machine-learning,cluster-analysis,knn
WEKA has profound support for kNN classifiers (many different distances and etc.) Unfortunately WEKA doesn't support multi-label problems. One of the solutions can be to use binary relevance approach. I am not sure whether it's a correct workaround? What do you think?...

Visualization of multi-dimensional data clusters in R

r,plot,cluster-analysis
For a set of documents, I have a feature matrix of size 30 X 32 where rows represent documents and columns = features. So basically 30 documents and 32 features for each of them. After running a PSO Algorithm, I have been able to find some cluster centroids (that I...

Point cloud, cluster, blob detection

matlab,opencv,cluster-analysis,point-clouds
I have a binary image full noises. I detected the objects circled in red using median filter B = medfilt2(A, [m n])(Matlab) or medianBlur(src, dst, ksize)(openCV). Could you suggest other methods to detect those objects in a more "academic" way, e.g probabilistic method, clustering, etc?

Clustering Categorical data-set with distance based approach

python,machine-learning,cluster-analysis,k-means
I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features ROCK: From what I understand ROCK does is that 1. It calculates a similarity matrix (m*m) using Jaccard cooficients. 2. Then a threshold value is provided by...

OMP parallel for reduction

c++,cluster-analysis,openmp
I'm trying to write a k-means clustering class. I want to make my function parallel. void kMeans::findNearestCluster() { short closest; int moves = 0; #pragma omp parallel for reduction(+:moves) for(int i = 0; i < n; i++) { float min_dist=FLT_MAX; for(int k=0; k < clusters; k++) { float dist_sum =...