I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ... i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example: word e is with Cloudtag A and B but not C ... so e is a good seperator to...

Problem: I have N (~100m) strings each D (e.g. 100) characters long and with a low alphabet (eg 4 possible characters). I would like to find the k-nearest neighbors for every one of those N points ( k ~ 0.1D). Adjacent strings define via hamming distance. Solution doesn't have to...

I am trying to implement the code for DBSCAN here: http://en.wikipedia.org/wiki/DBSCAN The portion I am confused about is expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >=...

I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI): The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does...

I am currently trying to interpret a set of results gleaned from running SimpleKMeans clustering on the Diabetes.arff data set. http://i.stack.imgur.com/T4eho.jpg - link to clustered instances (figure 1) So far I can understand that the clustered instances (figure 1) show that 500 variables have been classified as tested negative and...

I have a 60.000 obs/40 Variable dataset on which I used Clara, mainly due to memory constrains. library(cluster) library(dplyr) mutate(kddnew, Att=ifelse(Class=="normal","normal", "attack")) ds <- dat[,c(-20,-21,-40) clus <- clara(ds, 3, samples=500, sampsize=100, pamLike=TRUE) This returned a table with medoids. Now I'm trying to use knn to do a prediction like this:...

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content. What would be the best way to approach this? I tried using k-means but it does not seem suitable because it needs all documents at once in order to do...

I am currently trying to solve some kind of a regression task (predict a value of 'count' field) using a KMeans clustering. The idea is trivial: Fit a cluster on my test dataset: k_means = cluster.KMeans(n_clusters=4, n_init = 20, init='random') k_means.fit(df[['DistanceToMidnight','season','DayType','weather','temp','atemp','humidity','windspeed','count']]) *notice that I do use 'count' in clustering. Then...

I have a very large input file with the following format: ID \t time \t duration \t Description \t status The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI,...

If you want to cluster point data inside a bounding box with a 3-Means Clustering algorithm, what are 3 good (= result in few iterations) initial centroids in the average case without looking at the point data? (e.g.: what is a good distribution of the 3 centroids inside a box)

WEKA has profound support for kNN classifiers (many different distances and etc.) Unfortunately WEKA doesn't support multi-label problems. One of the solutions can be to use binary relevance approach. I am not sure whether it's a correct workaround? What do you think?...

I'm trying to write a k-means clustering class. I want to make my function parallel. void kMeans::findNearestCluster() { short closest; int moves = 0; #pragma omp parallel for reduction(+:moves) for(int i = 0; i < n; i++) { float min_dist=FLT_MAX; for(int k=0; k < clusters; k++) { float dist_sum =...

I am working on a clustering problem of social network profiles and each profile document is represented by number of times the 'term of interest occurs' in the profile description. To do clustering effectively, I am trying to find the correct similarity measure (or distance function) between two of the...

My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

I'm trying to understand python-igraph and specifically the community_walktrap function. I created the following example: import numpy as np import igraph mat = np.zeros((200,200)) + 50 mat[20:30,20:30] = 2 mat[80:90,80:90] = 2 g = igraph.Graph.Weighted_Adjacency(mat.tolist(), mode=igraph.ADJ_DIRECTED) wl = g.community_walktrap(weights=g.es['weight']) I would have assumed the optimal count of communities to be...

I draw a heatmap using the 'pheatmap' package, and clusted with the rows and cloumns. However, for some reason, I need to get the row order and the column order in the heatmap. Is there a convient way to do that? This is a example of pheatmap. test = matrix(rnorm(200),...

I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in...

Using the R Kohonen package, I have obtained a "codes" plot which shows the codebook vectors. I would like to ask, shouldn't the codebook vectors of neighbouring nodes be similar? Why are the top 2 nodes on the left so different? Is there a way to organise it in a...

For a set of documents, I have a feature matrix of size 30 X 32 where rows represent documents and columns = features. So basically 30 documents and 32 features for each of them. After running a PSO Algorithm, I have been able to find some cluster centroids (that I...

I am trying to detect dense subspaces from a high dimensional dataset. For this I want to use ELKI library. But there are very few documentations and examples of ELKI library. I tried the following- Database db=makeSimpleDatabase("D:/sample.csv", 600); ListParameterization params = new ListParameterization(); params.addParameter(CLIQUE.TAU_ID, "0.1"); params.addParameter(CLIQUE.XSI_ID, 20); // setup algorithm...

I have just download and install matlab clustering toolbox (http://www.mathworks.com/matlabcentral/fileexchange/7486-clustering-toolbox) However, when I run first demo file which is motorcycle clustering example, I am getting the following error. Undefined function 'isnan' for input arguments of type 'struct'. Error in internal.stats.removenan (line 54) wasnan = wasnan | any(isnan(y),2); Error in statremovenan...

I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok. Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs). One way to check if...

I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based...

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes. I read that it calculates modes for such attributes. I want to know how the similarity is calculated. Lets take an example: Consider a dataset with 3...

I have a categorical dataset, I am performing spectral clustering on it. But I do not get very good output. I choose the eigen vectors corresponding to largest eigen values as my centroids for k-means. Please find below the process I follow: 1. Create a symmetric similarity matrix (m*m) using...

A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B...

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched...

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster. I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and...

I am working on my personal implementation of DBSCAN on some data, but I have problems when I have to find epsilon dynamically for every kind of data set I have to use, because average value of epsilon before implementing DBSCAN considers the outliers as well, and hence the resultant...

I need to create a consensus matrix. Let say I have a matrix A as following. 1 1 2 2 3 1 2 2 2 3 1 1 2 3 3 Each row represents one clustering method, and each value represent one specific cluster. For example, A(1,1) means that by...

I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features ROCK: From what I understand ROCK does is that 1. It calculates a similarity matrix (m*m) using Jaccard cooficients. 2. Then a threshold value is provided by...

I'm using GMM to fit my data to 256 Gaussians. I'm using Matlab's fitgmdist to achieve this. gmm{i} = fitgmdist(model_feats, gaussians, 'Options',statset('MaxIter',1000), ... 'CovType','diagonal', 'SharedCov',false, 'Regularize',0.01, 'Start',cInd); I am using RootSIFT to extract the features of each image. This produces a vector of 1x128 for each image. Now I have...

I'm trying to use ELKI from within JAVA to run DBSCAN. For testing I used a FileBasedDatabaseConnection. Now I would like to run DBSCAN with my custom Objects as parameters. My objects have the following structure: public class MyObject { private Long id; private Float param1; private Float param2; //...

I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want...

I have people names (first name, last name and surname) in db column. The data is not full, for example some rows have only first name, last name or surname. are in different order (surname, last name) incorrectly spelled I need an algorithm to display a set of rows in...

I do hierarchical clustering with the cluster package in R. Using the silhouette function, I can get the silhouette plot of my cluster output for any given height (h) cut-off in the dendrogram. # run hierarchical clustering if(!require("cluster")) { install.packages("cluster"); require("cluster") } tmp <- matrix(c( 0, 20, 20, 20, 40,...

This is a follow-up question on my other posts. Algorithm for clustering with size constraints I'm working on a clustering algorithm, After some reclustering, now I have this set of points that none of them are in their optimal cluster but could not be reassigned individually, since it'll violate the...

Let's say we have the following dataset set.seed(144) dat <- matrix(rnorm(100), ncol=5) The following function creates all possible combinations of columns and removes the first (combinations <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,]) # Var1 Var2 Var3 Var4 Var5 # 2 TRUE FALSE FALSE FALSE FALSE # 3 FALSE TRUE FALSE FALSE...

I am using Carrot2 to cluster query results from Solr. Is is possible to force (or at least boost) the occurrence of certain words in the labels, in either Lingo, STC or k-means? With Lingo, this is already possible with the option "Title word boost", which gives more weight to...

Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood. What does eps...

I have a binary image full noises. I detected the objects circled in red using median filter B = medfilt2(A, [m n])(Matlab) or medianBlur(src, dst, ksize)(openCV). Could you suggest other methods to detect those objects in a more "academic" way, e.g probabilistic method, clustering, etc?

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for...

I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and (n=22) features Approach My approach is...

I need a method to identify the best combination of pairwise fits between two sets of points such that the overall distance between clustered pairs is minimised. It seems possibly suited to k-means (with 'n' pairs if a max/min cluster size constraint of 2 is possible) but I'm not aware...

I used k-means cluster algorithm on a data-frame df1 and the result is shown in the picture below. library(ade4) df1 <- data.frame(x=runif(100), y=runif(100)) plot(df1) km <- kmeans(df1, centers=3) kmeansRes<-factor(km$cluster) s.class(df1,fac=kmeansRes, add.plot=TRUE, col=rainbow(nlevels(kmeansRes))) Is there a possibility to add to the data frame information from which cluster does the observation come...

I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based...

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0 My Solr Index: 15980 documents My Problem: Cluster all documents with the kmeans algorithm When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000...

I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

I have a large, sparse binary matrix (roughly 39,000 x 14,000; most rows have only a single "1" entry). I'd like to cluster similar rows together, but my initial plan takes too long to complete: d <- dist(inputMatrix, method="binary") hc <- hclust(d, method="complete") The first step doesn't finish, so I'm...