FAQ Database Discussion Community


How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

python,scikit-learn,k-means
In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper...

K-Means Clustering a list of US addresses based on drive time

excel,matlab,cluster-analysis,k-means,geo
I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based...

Cannot handle any class attribute! kmeans java

java,weka,k-means
I want execute a k-means algorithm i use for this weka in eclipse i have this code public class demo { public demo() throws Exception { // TODO Auto-generated constructor stub BufferedReader breader = null; breader = new BufferedReader(new FileReader( "D:/logiciels/weka-3-7-12/weka-3-7-12/data/iris.arff")); Instances Train = new Instances(breader); Train.setClassIndex(Train.numAttributes() - 1); SimpleKMeans...

exception:Not enough training instances (required: 1, provided: 0)! in weka

java,exception,weka,data-mining,k-means
i l have a dataset of movie in file moviedata.arff @relation movie @attribute annee numeric @attribute Action numeric @attribute Adventure numeric @attribute Drama numeric @attribute Romance numeric @attribute Comedy numeric @attribute Documentary numeric @attribute Sci-Fi numeric @attribute Triller numeric @attribute Crime numeric @attribute Musical numeric @attribute Children numeric @attribute Animation...

Assign class to data frame after clustering

r,cluster-analysis,data-mining,k-means
I used k-means cluster algorithm on a data-frame df1 and the result is shown in the picture below. library(ade4) df1 <- data.frame(x=runif(100), y=runif(100)) plot(df1) km <- kmeans(df1, centers=3) kmeansRes<-factor(km$cluster) s.class(df1,fac=kmeansRes, add.plot=TRUE, col=rainbow(nlevels(kmeansRes))) Is there a possibility to add to the data frame information from which cluster does the observation come...

How to do column wise intersection with itertools

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for...

kmeans scatter plot: plot different colors per cluster

python,numpy,matplotlib,scipy,k-means
I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color. sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences km = KMeans(n_clusters=5,...

How to do prediction with weka

weka,k-means,prediction
i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a...

Evaluating K-means accuracy

matlab,k-means
I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes. I need help with the following; What function/code can I use to evaluate how well...

Different clustering algorithms to cluster timeseries events

algorithm,cluster-analysis,k-means,hierarchical-clustering
I have a very large input file with the following format: ID \t time \t duration \t Description \t status The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI,...

Returning extra data as part of return type

scala,k-means
Below is an implementation of kmeans algorithm : package com object Functions { def distance(l1: (Array[Double], Array[Double])) = { val t = l1._1.zip(l1._2) t.map(m => Math.abs(m._1 - m._2)).sum } } package com import com.Functions._ case class Point(label : String, points : Array[Double]) object KMeans2 extends Application { val points =...

kmeans gives wrong cluster in R

r,k-means
I have a dataset of 2 columns in R, and am trying to using kmeans to cluster the data set. The command I use is kk <- kmeans(ageincome, center=4, iter.max=500, nstart=100) When I plot the result, what I observe from the plot is that R only cluster the data set...

Implement k-medoids algorithm by supplying the distances between data objects and the medoids

c++,algorithm,k-means
I've been trying to implement k-medoids in C++. So far, I've come up with implementing k-medoids by supplying the number of clusters (or the number of seeds), as described in Wikipedia's k-medoids page. Now, what I'm trying to do is to implement it by supplying the distances, instead of the...

Clustering Categorical data-set with distance based approach

python,machine-learning,cluster-analysis,k-means
I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features ROCK: From what I understand ROCK does is that 1. It calculates a similarity matrix (m*m) using Jaccard cooficients. 2. Then a threshold value is provided by...

Clustering Categorical data using jaccard similarity

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and (n=22) features Approach My approach is...

Can't run k-means with SPSS Modeler 16

k-means,spss
I'm using IBM SPSS modeler 16.0 to analyze my data that have four fields and all of them are retrived from a database as string and converted to numbers with the node replace using to_number(). When I connect my node to k-means node to create the clusters using that data...

Initial centroids for a 3-Means Clustering algorithm

cluster-analysis,k-means
If you want to cluster point data inside a bounding box with a 3-Means Clustering algorithm, what are 3 good (= result in few iterations) initial centroids in the average case without looking at the point data? (e.g.: what is a good distribution of the 3 centroids inside a box)

k-means core function for temporal geo data

k-means,core,geo,temporal
I want cluster geo data (lat,long,timestamp) with k-means. I'm searching for a good core function, I can't find good paper or other sources for that. To time I multiplicate the time and the space distance: public static double dis(GeoData input1, GeoData input2) { double timeDis = Math.abs( input1.getTime() - input2.getTime()...

How to cluster a set of strings?

machine-learning,cluster-analysis,k-means,hierarchical-clustering
My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

How to get meaningful results of kmeans in scikit-learn

python,machine-learning,scikit-learn,k-means
I have a dataset which looks like this: {'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'} This is already converted to dict from csv Then i use DictVectorizer to convert it from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer() d = vec.fit_transform(data).toarray() Then i try...

Weka Simple K means handling nominal attributes

cluster-analysis,weka,k-means
I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes. I read that it calculates modes for such attributes. I want to know how the similarity is calculated. Lets take an example: Consider a dataset with 3...

Scikit-learn: How to run KMeans on a one-dimensional array?

python,scikit-learn,data-mining,k-means
I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is...

OpenCV quantization sample code not running

python,opencv,k-means,quantization
I'm running the quantization sample code found in the OpenCV documentation, and it's throwing Traceback (most recent call last): File "QuantizeTest.py", line 13, in <module> ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) TypeError: an integer is required Here's the code itself: import numpy as np import cv2 img = cv2.imread('Sample.jpg') Z = img.reshape((-1,3)) # convert to...

Combining clustering algorithms in MapReduce

java,algorithm,hadoop,k-means
For my college project, I initially thought to implement a combined clustering algorithm on MapReduce. I have finished with KMeans. Now my questions are: Can any other clustering algorithm be combined with Kmeans on MapReduce? If so, which algorithm and what is the procedure? If combining is not possible, how...

Summarize variable variations in clusters (k-means) using R

r,cluster-analysis,k-means
I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in...

Octave: Kmeans clustering not working on an image matrix

matlab,image-processing,machine-learning,octave,k-means
I have tried the following code. img=imread("test1.jpg"); gimg=rgb2gray(img); imshow(gimg); bw = gimg < 255; L = bwlabel(bw); imshow(label2rgb(L, @jet, [.7 .7 .7])) s = regionprops(L, 'PixelIdxList', 'PixelList'); s(1).PixelList(1:4, :) idx = s(1).PixelIdxList; sum_region1 = sum(gimg(idx)); x = s(1).PixelList(:, 1); y = s(1).PixelList(:, 2); xbar = sum(x .* double(gimg(idx))) / sum_region1...

R - cluster analysis on binary weblog data

r,cluster-analysis,k-means
I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based...

k-means with selected initial centers

python,numpy,scikit-learn,k-means
I am trying to k-means clustering with selected initial centroids. It says here that to specify your initial centers: init : {‘k-means++’, ‘random’ or an ndarray} If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. My code in Python: X = np.array([[-19.07480000,...

k means plot with the original data in r

r,plot,k-means
Currently I am exploring kmeans function. I have a simple text file (test.txt) with the following entries. The data can be split into 2 clusters. 1 2 3 8 9 10 How to plot the results of kmeans function ( using plot function ) along with the original data? I...

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

matlab,video,svm,k-means,sift
I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF. for f= 1:20 f offset = 0; c=strcat('running',num2str(f),'.mat'); load(c) pix=video3Dm; % Generate descriptors at locations given by subs matrix for i=1:100...

How to explain a higher percentage of point variability using kmeans clustering? [closed]

r,statistics,cluster-analysis,k-means
I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

Is there a way to access or export the label numbers in an r plot?

r,plot,cluster-analysis,k-means
I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want...

Clustering based on pearson correlation

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan
I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok. Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs). One way to check if...

plot a document tfidf 2D graph

python,numpy,scipy,scikit-learn,k-means
I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph....

Got java heap size error when trying to cluster 15980 documents via carrot2workbench

solr,cluster-analysis,k-means,workbench,carrot
My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0 My Solr Index: 15980 documents My Problem: Cluster all documents with the kmeans algorithm When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000...

Displaying kmean result with specific colors to specific clusters

matlab,k-means
I applied k-mean clustering on a preprocessed image using the following matlab code %B - input image C=rgb2gray(B); [idx centroids]=kmeans(double(C(:)),4); imseg = zeros(size(C,1),size(C,2)); for i=1:max(idx) imseg(idx==i)=i; end i=mat2gray(imseg); % i - output image Every time I display the output, color assigned to the output images changes. How can I give...

Scikit-learn KMeans clustering - fit cluster with X features, predict cluster membership with X-1 features?

python,scikit-learn,cluster-analysis,k-means
I am currently trying to solve some kind of a regression task (predict a value of 'count' field) using a KMeans clustering. The idea is trivial: Fit a cluster on my test dataset: k_means = cluster.KMeans(n_clusters=4, n_init = 20, init='random') k_means.fit(df[['DistanceToMidnight','season','DayType','weather','temp','atemp','humidity','windspeed','count']]) *notice that I do use 'count' in clustering. Then...

Is it possible for K Mean cluster has no member?

k-means,centroid
I'm currently using K Mean for clustering files. Some question occur to me, is it possible that the cluster has no member at all? If so, what is happen to the centroid of the cluster? Is it equal as the value before? Thanks...

Clustering Textentities with Radpiminer

cluster-analysis,k-means,rapidminer
I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ... i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example: word e is with Cloudtag A and B but not C ... so e is a good seperator to...