FAQ Database Discussion Community

Combining clustering algorithms in MapReduce

For my college project, I initially thought to implement a combined clustering algorithm on MapReduce. I have finished with KMeans. Now my questions are: Can any other clustering algorithm be combined with Kmeans on MapReduce? If so, which algorithm and what is the procedure? If combining is not possible, how...

R - cluster analysis on binary weblog data

r,cluster-analysis,k-means
I have a web data that looks similar to the sample below. It simply has the user and binary value for whether that user cliked on a particular link within a website. I wanted to do some clustering of this data. My main goal is to find similar users based...

Initial centroids for a 3-Means Clustering algorithm

cluster-analysis,k-means
If you want to cluster point data inside a bounding box with a 3-Means Clustering algorithm, what are 3 good (= result in few iterations) initial centroids in the average case without looking at the point data? (e.g.: what is a good distribution of the 3 centroids inside a box)

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

matlab,video,svm,k-means,sift
I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF. for f= 1:20 f offset = 0; c=strcat('running',num2str(f),'.mat'); load(c) pix=video3Dm; % Generate descriptors at locations given by subs matrix for i=1:100...

Implement k-medoids algorithm by supplying the distances between data objects and the medoids

c++,algorithm,k-means
I've been trying to implement k-medoids in C++. So far, I've come up with implementing k-medoids by supplying the number of clusters (or the number of seeds), as described in Wikipedia's k-medoids page. Now, what I'm trying to do is to implement it by supplying the distances, instead of the...

Scikit-learn: How to run KMeans on a one-dimensional array?

python,scikit-learn,data-mining,k-means
I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is...

Clustering Categorical data-set with distance based approach

python,machine-learning,cluster-analysis,k-means
I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features ROCK: From what I understand ROCK does is that 1. It calculates a similarity matrix (m*m) using Jaccard cooficients. 2. Then a threshold value is provided by...

Displaying kmean result with specific colors to specific clusters

matlab,k-means
I applied k-mean clustering on a preprocessed image using the following matlab code %B - input image C=rgb2gray(B); [idx centroids]=kmeans(double(C(:)),4); imseg = zeros(size(C,1),size(C,2)); for i=1:max(idx) imseg(idx==i)=i; end i=mat2gray(imseg); % i - output image Every time I display the output, color assigned to the output images changes. How can I give...

Clustering based on pearson correlation

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan
I have a use case where I have traffic data for every 15 minutes for 1 month. This data is collected for various resources in netwrok. Now I need to group resources which are similar(based on traffic usage pattern over 00 hours to 23:45 hrs). One way to check if...

How to explain a higher percentage of point variability using kmeans clustering? [closed]

r,statistics,cluster-analysis,k-means
I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

How to do column wise intersection with itertools

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for...

cluster-analysis,k-means,rapidminer
I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ... i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example: word e is with Cloudtag A and B but not C ... so e is a good seperator to...

k means plot with the original data in r

r,plot,k-means
Currently I am exploring kmeans function. I have a simple text file (test.txt) with the following entries. The data can be split into 2 clusters. 1 2 3 8 9 10 How to plot the results of kmeans function ( using plot function ) along with the original data? I...

exception:Not enough training instances (required: 1, provided: 0)! in weka

java,exception,weka,data-mining,k-means
i l have a dataset of movie in file moviedata.arff @relation movie @attribute annee numeric @attribute Action numeric @attribute Adventure numeric @attribute Drama numeric @attribute Romance numeric @attribute Comedy numeric @attribute Documentary numeric @attribute Sci-Fi numeric @attribute Triller numeric @attribute Crime numeric @attribute Musical numeric @attribute Children numeric @attribute Animation...

plot a document tfidf 2D graph

python,numpy,scipy,scikit-learn,k-means
I would like to plot a 2d graph with the x-axis as term and y-axis as TFIDF score (or document id) for my list of sentences. I used scikit learn's fit_transform() to get the scipy matrix but i do not know how to use that matrix to plot the graph....

How to do prediction with weka

weka,k-means,prediction
i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a...

Is it possible for K Mean cluster has no member?

k-means,centroid
I'm currently using K Mean for clustering files. Some question occur to me, is it possible that the cluster has no member at all? If so, what is happen to the centroid of the cluster? Is it equal as the value before? Thanks...

Scikit-learn KMeans clustering - fit cluster with X features, predict cluster membership with X-1 features?

python,scikit-learn,cluster-analysis,k-means
I am currently trying to solve some kind of a regression task (predict a value of 'count' field) using a KMeans clustering. The idea is trivial: Fit a cluster on my test dataset: k_means = cluster.KMeans(n_clusters=4, n_init = 20, init='random') k_means.fit(df[['DistanceToMidnight','season','DayType','weather','temp','atemp','humidity','windspeed','count']]) *notice that I do use 'count' in clustering. Then...

K-Means Clustering a list of US addresses based on drive time

excel,matlab,cluster-analysis,k-means,geo
I have 8 traveling consultants that need to visit 155 groups across the continental united states. Is there a way to find the optimal 8 regions based of drive time using k-means clustering? I see there are some methods implemented already for other data sets, but they are not based...

How to get meaningful results of kmeans in scikit-learn

python,machine-learning,scikit-learn,k-means
I have a dataset which looks like this: {'dns_query_count': '11', 'http_hostnames_count': '7', 'dest_port_count': '3', 'ip_count': '11', 'signature_count': '0', 'src_ip': '10.0.64.42', 'http_user_agent_count': '2'} This is already converted to dict from csv Then i use DictVectorizer to convert it from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer() d = vec.fit_transform(data).toarray() Then i try...

How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

python,scikit-learn,k-means
In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper...

Is there a way to access or export the label numbers in an r plot?

r,plot,cluster-analysis,k-means
I have a plot where x is a test a and y is another test b. Each students are tested two times. Each dot represents one students "post minus pre" score on x and on y. As you can see, I assigned labels to the plot, but I want...

Can't run k-means with SPSS Modeler 16

k-means,spss
I'm using IBM SPSS modeler 16.0 to analyze my data that have four fields and all of them are retrived from a database as string and converted to numbers with the node replace using to_number(). When I connect my node to k-means node to create the clusters using that data...

Summarize variable variations in clusters (k-means) using R

r,cluster-analysis,k-means
I have a df that I got after implementing k-means clustering on my original dataset. I have 4 different clusters here and what I would like to know is how much is the variation of the 4 variables (V1 to V4) in each cluster. In other words, what variation in...

kmeans scatter plot: plot different colors per cluster

python,numpy,matplotlib,scipy,k-means
I am trying to do a scatter plot of a kmeans output which clusters sentences of the same topic together. The problem i am facing is plotting points that belongs to each cluster a certain color. sentence_list=["Hi how are you", "Good morning" ...] #i have 10 setences km = KMeans(n_clusters=5,...

k-means core function for temporal geo data

k-means,core,geo,temporal
I want cluster geo data (lat,long,timestamp) with k-means. I'm searching for a good core function, I can't find good paper or other sources for that. To time I multiplicate the time and the space distance: public static double dis(GeoData input1, GeoData input2) { double timeDis = Math.abs( input1.getTime() - input2.getTime()...

Octave: Kmeans clustering not working on an image matrix

matlab,image-processing,machine-learning,octave,k-means
I have tried the following code. img=imread("test1.jpg"); gimg=rgb2gray(img); imshow(gimg); bw = gimg < 255; L = bwlabel(bw); imshow(label2rgb(L, @jet, [.7 .7 .7])) s = regionprops(L, 'PixelIdxList', 'PixelList'); s(1).PixelList(1:4, :) idx = s(1).PixelIdxList; sum_region1 = sum(gimg(idx)); x = s(1).PixelList(:, 1); y = s(1).PixelList(:, 2); xbar = sum(x .* double(gimg(idx))) / sum_region1...

Got java heap size error when trying to cluster 15980 documents via carrot2workbench

solr,cluster-analysis,k-means,workbench,carrot
My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0 My Solr Index: 15980 documents My Problem: Cluster all documents with the kmeans algorithm When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000...

Cannot handle any class attribute! kmeans java

java,weka,k-means
I want execute a k-means algorithm i use for this weka in eclipse i have this code public class demo { public demo() throws Exception { // TODO Auto-generated constructor stub BufferedReader breader = null; breader = new BufferedReader(new FileReader( "D:/logiciels/weka-3-7-12/weka-3-7-12/data/iris.arff")); Instances Train = new Instances(breader); Train.setClassIndex(Train.numAttributes() - 1); SimpleKMeans...

Weka Simple K means handling nominal attributes

cluster-analysis,weka,k-means
I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes. I read that it calculates modes for such attributes. I want to know how the similarity is calculated. Lets take an example: Consider a dataset with 3...

Different clustering algorithms to cluster timeseries events

algorithm,cluster-analysis,k-means,hierarchical-clustering
I have a very large input file with the following format: ID \t time \t duration \t Description \t status The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI,...

Returning extra data as part of return type

scala,k-means
Below is an implementation of kmeans algorithm : package com object Functions { def distance(l1: (Array[Double], Array[Double])) = { val t = l1._1.zip(l1._2) t.map(m => Math.abs(m._1 - m._2)).sum } } package com import com.Functions._ case class Point(label : String, points : Array[Double]) object KMeans2 extends Application { val points =...

How to cluster a set of strings?

machine-learning,cluster-analysis,k-means,hierarchical-clustering
My dataset looks something like this ['', 'ABCDH', '', '', 'H', 'HHIH', '', '', '', '', '', '', '', '', '', '', '', 'FECABDAI', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'FABHJJFFFFEEFGEE', 'FFFF', '', '', '', '', '', '', '',...

Clustering Categorical data using jaccard similarity

python-2.7,machine-learning,cluster-analysis,data-mining,k-means
I am trying to build a clustering algorithm for categorical data. I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others. I have (m) training set and (n=22) features Approach My approach is...

Evaluating K-means accuracy

matlab,k-means
I created a 3-dimensional random data sets with 4 defined patterns/classes in MATLAB. I applied the K-means algorithm on the data to see how well K-means can classify my samples based on created 4 patterns/classes. I need help with the following; What function/code can I use to evaluate how well...

OpenCV quantization sample code not running

python,opencv,k-means,quantization
I'm running the quantization sample code found in the OpenCV documentation, and it's throwing Traceback (most recent call last): File "QuantizeTest.py", line 13, in <module> ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) TypeError: an integer is required Here's the code itself: import numpy as np import cv2 img = cv2.imread('Sample.jpg') Z = img.reshape((-1,3)) # convert to...

kmeans gives wrong cluster in R

r,k-means
I have a dataset of 2 columns in R, and am trying to using kmeans to cluster the data set. The command I use is kk <- kmeans(ageincome, center=4, iter.max=500, nstart=100) When I plot the result, what I observe from the plot is that R only cluster the data set...

Assign class to data frame after clustering

r,cluster-analysis,data-mining,k-means
I used k-means cluster algorithm on a data-frame df1 and the result is shown in the picture below. library(ade4) df1 <- data.frame(x=runif(100), y=runif(100)) plot(df1) km <- kmeans(df1, centers=3) kmeansRes<-factor(km\$cluster) s.class(df1,fac=kmeansRes, add.plot=TRUE, col=rainbow(nlevels(kmeansRes))) Is there a possibility to add to the data frame information from which cluster does the observation come...

k-means with selected initial centers

python,numpy,scikit-learn,k-means
I am trying to k-means clustering with selected initial centroids. It says here that to specify your initial centers: init : {‘k-means++’, ‘random’ or an ndarray} If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. My code in Python: X = np.array([[-19.07480000,...