For my college project, I initially thought to implement a combined clustering algorithm on MapReduce. I have finished with KMeans. Now my questions are: Can any other clustering algorithm be combined with Kmeans on MapReduce? If so, which algorithm and what is the procedure? If combining is not possible, how...

I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based...

If you want to cluster point data inside a bounding box with a 3-Means Clustering algorithm, what are 3 good (= result in few iterations) initial centroids in the average case without looking at the point data? (e.g.: what is a good distribution of the 3 centroids inside a box)

I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF. for f= 1:20 f offset = 0; c=strcat('running',num2str(f),'.mat'); load(c) pix=video3Dm; % Generate descriptors at locations given by subs matrix for i=1:100...

I've been trying to implement k-medoids in C++. So far, I've come up with implementing k-medoids by supplying the number of clusters (or the number of seeds), as described in Wikipedia's k-medoids page. Now, what I'm trying to do is to implement it by supplying the distances, instead of the...

I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is...

I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features ROCK: From what I understand ROCK does is that 1. It calculates a similarity matrix (m*m) using Jaccard cooficients. 2. Then a threshold value is provided by...

I applied k-mean clustering on a preprocessed image using the following matlab code %B - input image C=rgb2gray(B); [idx centroids]=kmeans(double(C(:)),4); imseg = zeros(size(C,1),size(C,2)); for i=1:max(idx) imseg(idx==i)=i; end i=mat2gray(imseg); % i - output image Every time I display the output, color assigned to the output images changes. How can I give...

I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok. Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs). One way to check if...

I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for...

I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ... i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example: word e is with Cloudtag A and B but not C ... so e is a good seperator to...

Currently I am exploring kmeans function. I have a simple text file (test.txt) with the following entries. The data can be split into 2 clusters. 1 2 3 8 9 10 How to plot the results of kmeans function ( using plot function ) along with the original data? I...

i l have a dataset of movie in file moviedata.arff @relation movie @attribute annee numeric @attribute Action numeric @attribute Adventure numeric @attribute Drama numeric @attribute Romance numeric @attribute Comedy numeric @attribute Documentary numeric @attribute Sci-Fi numeric @attribute Triller numeric @attribute Crime numeric @attribute Musical numeric @attribute Children numeric @attribute Animation...

I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph....

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a...

I'm currently using K Mean for clustering files. Some question occur to me, is it possible that the cluster has no member at all? If so, what is happen to the centroid of the cluster? Is it equal as the value before? Thanks...

I am currently trying to solve some kind of a regression task (predict a value of 'count' field) using a KMeans clustering. The idea is trivial: Fit a cluster on my test dataset: k_means = cluster.KMeans(n_clusters=4, n_init = 20, init='random') k_means.fit(df[['DistanceToMidnight','season','DayType','weather','temp','atemp','humidity','windspeed','count']]) *notice that I do use 'count' in clustering. Then...

I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based...

I have a dataset which looks like this: {'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'} This is already converted to dict from csv Then i use DictVectorizer to convert it from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer() d = vec.fit_transform(data).toarray() Then i try...

In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper...

I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want...

I'm using IBM SPSS modeler 16.0 to analyze my data that have four fields and all of them are retrived from a database as string and converted to numbers with the node replace using to_number(). When I connect my node to k-means node to create the clusters using that data...

I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in...

I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color. sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences km = KMeans(n_clusters=5,...

I want cluster geo data (lat,long,timestamp) with k-means. I'm searching for a good core function, I can't find good paper or other sources for that. To time I multiplicate the time and the space distance: public static double dis(GeoData input1, GeoData input2) { double timeDis = Math.abs( input1.getTime() - input2.getTime()...

I have tried the following code. img=imread("test1.jpg"); gimg=rgb2gray(img); imshow(gimg); bw = gimg < 255; L = bwlabel(bw); imshow(label2rgb(L, @jet, [.7 .7 .7])) s = regionprops(L, 'PixelIdxList', 'PixelList'); s(1).PixelList(1:4, :) idx = s(1).PixelIdxList; sum_region1 = sum(gimg(idx)); x = s(1).PixelList(:, 1); y = s(1).PixelList(:, 2); xbar = sum(x .* double(gimg(idx))) / sum_region1...

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0 My Solr Index: 15980 documents My Problem: Cluster all documents with the kmeans algorithm When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000...

I want execute a k-means algorithm i use for this weka in eclipse i have this code public class demo { public demo() throws Exception { // TODO Auto-generated constructor stub BufferedReader breader = null; breader = new BufferedReader(new FileReader( "D:/logiciels/weka-3-7-12/weka-3-7-12/data/iris.arff")); Instances Train = new Instances(breader); Train.setClassIndex(Train.numAttributes() - 1); SimpleKMeans...

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes. I read that it calculates modes for such attributes. I want to know how the similarity is calculated. Lets take an example: Consider a dataset with 3...

I have a very large input file with the following format: ID \t time \t duration \t Description \t status The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI,...

Below is an implementation of kmeans algorithm : package com object Functions { def distance(l1: (Array[Double], Array[Double])) = { val t = l1._1.zip(l1._2) t.map(m => Math.abs(m._1 - m._2)).sum } } package com import com.Functions._ case class Point(label : String, points : Array[Double]) object KMeans2 extends Application { val points =...

My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and (n=22) features Approach My approach is...

I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes. I need help with the following; What function/code can I use to evaluate how well...

I'm running the quantization sample code found in the OpenCV documentation, and it's throwing Traceback (most recent call last): File "QuantizeTest.py", line 13, in <module> ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) TypeError: an integer is required Here's the code itself: import numpy as np import cv2 img = cv2.imread('Sample.jpg') Z = img.reshape((-1,3)) # convert to...

I have a dataset of 2 columns in R, and am trying to using kmeans to cluster the data set. The command I use is kk <- kmeans(ageincome, center=4, iter.max=500, nstart=100) When I plot the result, what I observe from the plot is that R only cluster the data set...

I used k-means cluster algorithm on a data-frame df1 and the result is shown in the picture below. library(ade4) df1 <- data.frame(x=runif(100), y=runif(100)) plot(df1) km <- kmeans(df1, centers=3) kmeansRes<-factor(km$cluster) s.class(df1,fac=kmeansRes, add.plot=TRUE, col=rainbow(nlevels(kmeansRes))) Is there a possibility to add to the data frame information from which cluster does the observation come...

I am trying to k-means clustering with selected initial centroids. It says here that to specify your initial centers: init : {‘k-means++’, ‘random’ or an ndarray} If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. My code in Python: X = np.array([[-19.07480000,...