FAQ Database Discussion Community


How to handle data with names in scikit-learn?

python,python-2.7,scikit-learn
I am about to experiment with clustering algorithms to cluster file attributes (e.g. access time). Does scikit support clustering of named data, i.e., how can I retrieve the file names after the clustering algorithm run? Is there a way to store metadata with the training data, e.g., the file names?...

NearestNeighbors tradeoff - run faster with less accurate results

python,scikit-learn,smooth,nearest-neighbor
I'm working with a medium size dataset (shape=(14013L, 46L)). I want to smooth each sample with its knn. I'm training my model with: NearestNeighbors(n_neighbors, algorithm='ball_tree', metric=sklearn.metrics.pairwise.cosine_distances) And the smooth is as follows: def smooth(x,nbrs,data,alpha): """ input: alpha: the smoothing factor nbrs: trained NearestNeighbors from sklearn data: the original data (since...

Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)

python,machine-learning,scikit-learn,decision-tree
I'm using decision tree classifier from the scikit-learn package in python 3.4, and I want to get the corresponding leaf node id for each of my input data point. For example, my input might look like this: array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7,...

Scikit learn, fitting a gaussian to a histogram

python,scikit-learn
In scikit-learn fitting a gaussian peak using GMM seems to work with discrete data data points. Is there a way of using GMM with data which has already been binned, or aggregated into a histogram? For example, the following code is a work-around which converts the binned data into discrete...

Python Scikit Decision Tree with variable number of outputs

python,scikit-learn,decision-tree
I'm looking to setup a multi-output decision tree using the Python SciKit library. The problem I'm facing however is that it's not a simple "n_outputs" classification. Some samples will have 3 outputs, some 4, some 5. I'm not sure what the best way is to convey this to the library....

load data from csv into Scikit learn SVM

python,csv,numpy,scikit-learn
I want to train a SVM to perform a classification of samples. I have a csv file with me that has 3 columns with headers: feature 1,feature 2, class label and 20 rows(= number of samples). Now I quote from the Scikit-Learn documentation " As other classifiers, SVC, NuSVC and...

Error importing scikit-learn modules

python,scikit-learn
I'm trying to call a function from the cluster module, like so: import sklearn db = sklearn.cluster.DBSCAN() and I get the following error: AttributeError: 'module' object has no attribute 'cluster' Tab-completing in IPython, I seem to have access to the base, clone, externals, re, setup_module, sys, and warning modules. Nothing...

Avarage values of precision, recall and fscore for each label

scikit-learn,classification,cross-validation,precision-recall
I'm cross validating a sklearn classifier model and want to quickly obtain average values of precision, recall and f-score. How can I obtain those values? I don't want to code the cross validation by myself, instead I'm using the function cross_validation.cross_val_score. Is it possible to use this function to obtain...

Using Cross-Validation on a Scikit-Learn Classifer

python,scikit-learn,cross-validation
I have a working classifier with a dataset split in a train set (70%) and a test set (30%). However, I'd like to implement a validation set as well (so that: 70% train, 20% validation and 10% test). The sets should be randomly chosen and the results should be averaged...

RandomForestClassfier.fit(): ValueError: could not convert string to float

python,scikit-learn,random-forest
Given is a simple CSV file: A,B,C Hello,Hi,0 Hola,Bueno,1 Obviously the real dataset is far more complex than this, but this one reproduces the error. I'm attempting to build a random forest classifier for it, like so: cols = ['A','B','C'] col_types = {'A': str, 'B': str, 'C': int} test =...

ImportError: No module named sklearn.cross_validation

python,scikit-learn
I am using python 2.7 in Ubuntu 14.04. I installed scikit-learn, numpy and matplotlib with these commands: sudo apt-get install build-essential python-dev python-numpy \ python-numpy-dev python-scipy libatlas-dev g++ python-matplotlib \ ipython But when I import these packages: from time import time import logging import matplotlib.pyplot as plt from sklearn.cross_validation import...

Add Features to An Sklearn Classifier

python,machine-learning,nlp,scikit-learn
I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline: data =...

How to properly install sklearn on Eclipse

python-2.7,scikit-learn
I recently came across a blog(http://stronginference.com/ScipySuperpack/) on how to install sklearn . I successfully installed it and it was stored on the path: /usr/local/lib/python2.7/site-packages/sklearn I then went to the properties of my eclipse; under the Interpreter-Python and added the path to the PYTHONPATH. I could import sklearn but when I...

Accessing transformer functions in `sklearn` pipelines

python,scikit-learn
According to sklearn.pipeline.Pipeline documentation, The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline. The following example creates...

Using the predict_proba() function of RandomForestClassifier in the safe and right way

python,machine-learning,scikit-learn,random-forest
I'm using Scikit-learn to apply machine learning algorithm on my datasets. Sometimes I need to have the probabilities of labels/classes instated of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam. For such...

How to make a binary decomposition of pandas.Series column

python,pandas,machine-learning,scikit-learn
I want to decompose a pandas.Series into several other columns (number of column = number of values), save that factorization and use it with other DataFrame or Series. Something like pandas.get_dummies which will remember mapping and can handle NaN. Example. Given the following DataFrame: A B 0 a 0 1...

NaiveBayes classifier handling different data types in python

python,scikit-learn,gaussian,naivebayes
I am trying to implement Naive Bayes classifier in Python. My attributes are of different data types : Strings, Int, float, Boolean, Ordinal I could use Gaussian Naive Bayes classifier (Sklearn.naivebayes : Python package) , But I do not know how the different data types are to be handled. The...

How to specify the prior probability for scikit-learn's Naive Bayes

python,syntax,machine-learning,scikit-learn
I'm using the scikit-learn machine learning library (Python) for a machine learning project. One of the algorithms I'm using is the Gaussian Naive Bayes implementation. One of the attributes of the GaussianNB() function is the following: class_prior_ : array, shape (n_classes,) I want to alter the class prior manually since...

sklearn.cross_validation.StratifiedShuffleSplit - error: “indices are out-of-bounds”

python,pandas,scikit-learn
I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here import pandas as pd import numpy as np # UCI's wine dataset wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv") # separate target variable from dataset target = wine['quality'] data = wine.drop('quality',axis...

Find the tf-idf score of specific words in documents using sklearn

python,scikit-learn,tf-idf
I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem. But how do I find the TF-IDF score of a specific term in...

Calling scikit-learn functions from C++

python,c++,opencv,boost,scikit-learn
Is there way to call scikit-learn's functions from c++? I have the rest of most my code in C++ with opencv. I would like to be able use the classifiers scikit-learn provides. As far as I understand, there's no easy way - I need to use boost::python or swig. I...

Plotting a ROC curve in scikit yields only 3 points

python,validation,machine-learning,scikit-learn,roc
TLDR: scikit's roc_curve function is only returning 3 points for a certain dataset. Why could this be, and how do we control how many points to get back? I'm trying to draw a ROC curve, but consistently get a "ROC triangle". lr = LogisticRegression(multi_class = 'multinomial', solver = 'newton-cg') y...

Can one train estimators in a scikit-learn pipeline simultaneously?

python,machine-learning,scipy,scikit-learn,pipeline
Is it possible to do the following in scikit-learn? We train an estimator A using the given mapping from features to targets, then we use the same data (or mapping) to train another estimator B, then we use outputs of the two trained estimators (A and B) as inputs for...

Differences between each F1-score values in sklearns.metrics.classification_report and sklearns.metrics.f1_score with a binary confusion matrix

python,scikit-learn,confusion-matrix
I have (true) boolean values and predicted boolean values like: y_true = np.array([True, True, False, False, False, True, False, True, True, False, True, False, False, False, False, False, True, False, True, True, True, True, False, False, False, True, False, True, False, False, False, False, True, True, False, False, False, True,...

Scikit learn SGDClassifier: precision and recall change the values each time

scikit-learn,classification,precision-recall
I have a question about the precision and recall values in scikit learn. I am using the function SGDClassifier to classify my data. To evaluate the performances, I am using the precision and recall function precision_recall_fscore_support but each time that I run the program I have different values in the...

What's the most pythonic way to load a matrix in ijv/coo/triplet format?

python,pandas,scipy,scikit-learn
My input file is in ijv/coo/triplet format with string column names, eg: Apple,Google,1 Apple,Banana,5 Microsoft,Orange,2 Should result in this 2x3 matrix: [[1,5,0], [0,0,2]] I can read it manually by putting the column names to dictionaries and create a scipy sparse coo_matrix with that dict mapping to IDs. I would like...

scikit-learn pipeline

python,scikit-learn,pipeline,feature-selection
Each sample in My (iid) dataset looks like this: x = [a_1,a_2...a_N,b_1,b_2...b_M] I also have the label of each sample (This is supervised learning) The a features are very sparse (namely bag-of-words representation), while the b features are dense (integers,there are ~45 of those) I am using scikit-learn, and...

Only ignore stop words for ngram_range=1

python,nlp,scikit-learn
I am using CountVectorizer from sklearn...looking to provide a list of stop words and apply the count vectorizer for ngram_range of (1,3). From what I can tell, if a word - say "me" - is in the list of stop words, then it doesn't get seen for higher ngrams i.e.,...

Problems obtaining most informative features with scikit learn?

python,pandas,machine-learning,nlp,scikit-learn
Im triying to obtain the most informative features from a textual corpus. From this well answered question I know that this task could be done as follows: def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10): labelid = list(classifier.classes_).index(classlabel) feature_names = vectorizer.get_feature_names() topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:] for coef, feat in topn: print classlabel, feat,...

using OneHotEncoder with sklearn_pandas DataFrameMapper

scikit-learn
I am trying to use sklearn_pandas DataFrameMapper. This takes in the column names along with the preprocessing Transformation function that is required for that column. Like so, mapper = sklearn_pandas.DataFrameMapper([ ('hour',None), ('season',sklearn.preprocessing.OneHotEncoder()), ('holiday',None) ]) season is an int64 col in my pandas DataFrame. This gives me the following error -...

How to use meshgrid with large arrays in Matplotlib?

python,arrays,matplotlib,scipy,scikit-learn
I have trained a machine learning binary classifier on a 100x85 array in sklearn. I would like to be able to vary 2 of the features in the array, say column 0 and column 1, and generate contour or surface graph, showing how the predicted probability of falling in one...

sklearn Imputer() returned features does not fit in fit function

python,machine-learning,scikit-learn
I have a feature matrix with missing values NaNs, so I need to initialize those missing values first. However, the last line complains and throws out the following line of error: Expected sequence or array-like, got Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0). I checked, it seems the reason is that train_fea_imputed...

Scikit Learn Logistic Regression confusion

python-3.x,scikit-learn,logistic-regression
I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression # Create a sample dataframe data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]] columns=data.pop(0)...

How to get the number of components needed in PCA with all extreme variance?

scikit-learn,pca
I am trying to get the number of components needed to be used for classification. I have read a similar question Finding the dimension with highest variance using scikit-learn PCA and the scikit documents about this: http://scikit-learn.org/dev/tutorial/statistical_inference/unsupervised_learning.html#principal-component-analysis-pca However, this still did not solve my question. All of my PCA components...

Skimage Python33 Canny

python-3.x,numpy,scikit-learn,python-3.3
Long story short, I'm just simply trying to get a canny edged image of image.jpg. The documentation is very spotty so I'm getting very confused. If anyone can help that'd be greatly appreciated. from scipy import misc import numpy as np from skimage import data from skimage import feature from...

scikit-learn: Random forest class_weight and sample_weight parameters

python,scikit-learn
I have a class imbalance problem and been experimenting with a weighted Random Forest using the implementation in scikit-learn (>= 0.16). I have noticed that the implementation takes a class_weight parameter in the tree constructor and sample_weight parameter in the fit method to help solve class imbalance. Those two seem...

Undefined symbols in Scipy and Scikit-learn on RedHat

python,scipy,scikit-learn,atlas
I'm trying to install Scikit-Learn on a 64-bit Red Hat Enterprise 6.6 server on which I don't have root privileges. I've done a fresh installation of Python 2.7.9, Numpy 1.9.2, Scipy 0.15.1, and Scikit-Learn 0.16.1. The Atlas BLAS installation on the server is 3.8.4. I can install scikit-learn, but when...

Python LSA with Sklearn

python,scikit-learn,lsa
I'm currently trying to implement LSA with Sklearn to find synonyms in multiple Documents. Here is my Code: #import the essential tools for lsa from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from sklearn.metrics.pairwise import cosine_similarity #other imports from os import listdir #load data datafolder =...

Problems vectorizing specific columns with scikit learn DictVectorizer?

python,python-2.7,pandas,machine-learning,scikit-learn
I would like to understand how to do a simple prediction task I am playing with this dataset, also is here in a different format. Wich is about the students performance in some course, I would like to vectorize some columns of the dataset in order to not use all...

How to calculate tf-idf for a list of dict?

python,scipy,scikit-learn
I have a list of texts where each text is stored as a dict with its id as key and texts data as its value. How can I calculate tf-idf for this data. E.g.: {1: 'This is cat', 2: 'Is this the first document?', 3: 'And the third one.'} ...

Why is TfidfVectorizer in scikit-learn showing this behavior?

python-2.7,scikit-learn,tf-idf
While creating TfidfVectorizer object if I pass explicitly even the default value for token_pattern arguement it throws error when I do fit_transform. Following is the error: ValueError: empty vocabulary; perhaps the documents only contain stop words I am doing this because eventually I want to pass a different value for...

I'm not sure how to interpret accuracy of this classification with Scikit Learn

python,machine-learning,scikit-learn,classification,text-classification
I am trying to classify text data, with Scikit Learn, with the method shown here. (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) except I am loading my own dataset. I'm getting results, but I want to find the accuracy of the classification results. from sklearn.datasets import load_files text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore',...

load_files in scikit-learn not loading all files in directory

python,machine-learning,dataset,scikit-learn,classification
I have a folder called 'emails' with two subfolders named after the label corresponding to the classification of files they have (spam or notspam emails, all are .txt files). There are 3000 files across the two subfolders. Using load_files: data = load_files('emails', shuffle='False') print len(data) print len(data.target) This prints '5'...

SciKit-learn for data driven regression of oscillating data

python,time-series,scikit-learn,regression,prediction
Long time lurker first time poster. I have data that roughly follows a y=sin(time) distribution, but also depends on other variables than time. In terms of correlations, since the target y-variable oscillates there is almost zero statistical correlation with time, but y obviously depends very strongly on time. The goal...

Identifying a sklearn-model's classes

python,scikit-learn,svm
The documentation on SVMs implies, that an attribute called 'classes_' exists, which hopefully reveals how the model represents classes internally. I would like to get that information in order to interpret the output from functions like 'predict_proba', which generates probabilities of classes for a number of samples. Hopefully, knowing that...

Recover named features from L1 regularized logistic regression

python,machine-learning,scikit-learn
I have the following pipeline: sg = Pipeline([('tfidf', TfidfVectorizer()), ('normalize', Normalizer()), ('l1', LogisticRegression(penalty="l1", dual=False))]) and after peforming the fitting, I want to extract the tokens that correnponds to the non-zero weights. How can I do this?...

Why does not GridSearchCV give best score ? - Scikit Learn

python,r,machine-learning,scikit-learn,regression
I have a dataset with 158 rows and 10 columns. I try to build multiple linear regression model and try to predict future value. I used GridSearchCV for tunning parameters. Here is my GridSearchCV and Regression function : def GridSearch(data): X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, ground_truth_data, test_size=0.3, random_state =...

python sklearn cross_validation /number of labels does not match number of samples

python,scikit-learn,cross-validation
Doing a course on machine learning, and I want to split the data into train and test sets. I want to split it up, use Decisiontree on it for training, and then print out the score of my test set. The cross validation parameters in my code were given. Does...

Approach where features are combination of text(labels) and numerical

python,scikit-learn
I'm trying to figure out a good approach for a data set that includes text, which are really more like labels and numeric data. For example, in the data set, I have city, state, lat/lon and I want to classify. This is supervised, I have labels (y) for the data....

could i re-initilize the sklearn library

python,scikit-learn
http://screencloud.net/v/cPBi I had problem in importing the sklearn neighbors library (called "LSHForest"). the online example here did exactly the same I did when importing the LSHForest, but mine is not working :( Not really sure what is possibility wrong. do I have to reinstall ubuntu (because i heared that reinstall...

How do I use the ML sklearn pipeline to predict?

scikit-learn
I have created an ML pipeline using sklearn_pandas and sklearn. It looks like this. features = ['ColA','ColB','ColC'] labels = 'ColD' mapper = sklearn_pandas.DataFrameMapper([ ('ColB',sklearn.preprocessing.StandardScaler()), ('ColC',sklearn.preprocessing.StandardScaler()) ]) pipe = sklearn.pipeline.Pipeline([ ('featurize',mapper), ('imputer',imputer), ('logreg',sklearn.linear_model.LogisticRegression()) ]) cross_val_score = sklearn_pandas.cross_val_score(pipe,traindf[features],traindf[labels],'log_loss') I like the model and the log loss...

One Hot Encoding for representing corpus sentences in python

python,machine-learning,nlp,scikit-learn,one-hot
I am a starter in Python and Scikit-learn library. I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding. I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term. basically,...

Controlling the posterior probabilty threshold for LDA and QDA in scikit-learn

scikit-learn
Consider the following use case (this is cribbed completely from An Introduction to Statistical Learning by James, et al). You're attempting to predict whether or not a credit card owner will default based on various personal data. You're using Linear Discriminant Analysis (or, for purpose of this question, Quadratic Discriminant...

How to get feature names corresponding to scores for chi square feature selection in scikit

python,scikit-learn,chi-squared
I am using Scikit for feature selection, but I want to get the score values for all the unigrams in the text. I get the scores, but I how do I map these to actual feature names. from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_selection import SelectKBest, chi2 Texts=["should schools have uniform","schools...

python: How to get real feature name from feature_importances

python,scikit-learn,classification,feature-selection
I am using Python's sklearn random forest (ensemble.RandomForestClassifier) to do classification and am using feature_importances_ to find significant feature for the classifier. Now my code is: for trip in database: venue_feature_start.append(Counter(trip['POI'])) # Counter(trip['POI']) is like Counter({'school':1, 'hospital':1, 'bus station':2}),actually key is the feature feat_loc_vectorizer = DictVectorizer() feat_loc_vectorizer.fit(venue_feature_start) feat_loc_orig_mat = feat_loc_vectorizer.transform(venue_feature_start)...

Can't import GMM function from sckits.learn

scikit-learn
I'm getting error ImportError: No module named gmm when I'm using from scikits.learn.gmm import GMM.. I installed scikits using windows installer and no error.. How I can fix it?

Output the subset of instances used to train each base_estimator of a BaggingClassifier

python,pandas,machine-learning,scikit-learn
I am using decision stumps with a BaggingClassifier to classify some data: def fit_ensemble(attributes,class_val,n_estimators): # max depth is 1 decisionStump = DecisionTreeClassifier(criterion = 'entropy', max_depth = 1) ensemble = BaggingClassifier(base_estimator = decisionStump, n_estimators = n_estimators, verbose = 3) return ensemble.fit(attributes,class_val) def predict_all(fitted_classifier, instances): for i, instance in enumerate(instances): instances[i] =...

Multi-Class Classification in WEKA

machine-learning,scikit-learn,classification,weka,libsvm
I am trying to implement Multiclass classification in WEKA. I have lot of rows, say bank transactions, and one is tagged as Food,Medicine,Rent,etc. I want to develop a classifier which can be trained with the previous data I have and predict the class it can belong to for future transactions....

Can I use scikit-learn with Django framework?

django,data,scikit-learn,analysis
I would like to make a Web app in which some data is analyzed. Due to it, I read that Django is a good option for doing web apps and that scikit-learn was used for machine learning. Therefore, before starting, does anyone know if it is possible to use that...

Lasso Generalized linear model in Python

python,statistics,scikit-learn,statsmodels,cvxopt
I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python. Matlab provides the nice function : lassoglm(X,y, distr) where distr can be poisson, binomial etc. I had a look at both statmodels and scikit-learn but I did not find any...

Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn

python,numpy,encoding,machine-learning,scikit-learn
I have three inputs: x=(A, B, C); and an output y. It needs to be the case that A+B+C=1 and 0<=A<=1, 0<=B<=1, 0<=C<=1. I want to find the x that maximizes y. My approach is to use a regression routine in scikit-learn to train a model f on my inputs...

Normalization in Sklearn KNN

python-2.7,scikit-learn,classification,knn
I want to use KNN Algorithm in Sklearn. In KNN it's standard to do data normalization to remove the more effect that features with a larger range have on the distance. What I wanted to know, is that is this automatically done in Sklearn or I should normalize the data...

Does scikit-learn perform “real” multivariate regression (multiple dependent variables)?

python,machine-learning,scikit-learn,linear-regression,multivariate-testing
I would like to predict multiple dependent variables using multiple predictors. If I understood correctly, in principle one could make a bunch of linear regression models that each predict one dependent variable, but if the dependent variables are correlated, it makes more sense to use multivariate regression. I would like...

Clustering cosine similarity matrix

python,math,scikit-learn,cluster-analysis,data-mining
A few questions on stackoverflow mention this problem, but I haven't found a concrete solution. I have a square matrix which consists of cosine similarities (values between 0 and 1), for example: | A | B | C | D A | 1.0 | 0.1 | 0.6 | 0.4 B...

Bag of words representation using sklearn plus Snowballstemmer

python,python-3.x,scikit-learn,nltk
I have a list with songs, something like list2 = ["first song", "second song", "third song"...] Here is my code: from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import stopwords vectorizer = CountVectorizer(stop_words=stopwords.words('english')) bagOfWords = vectorizer.fit(list2) bagOfWords = vectorizer.transform(list2) And it's working, but I want to stem a list of my words....

Scikit: Remove feature row if present in all documents

python,machine-learning,scikit-learn
I am doing text classification. I have around 32K (spam & ham ) files. import numpy as np import pandas as pd import sklearn.datasets as dataset from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import BernoulliNB from sklearn.preprocessing import LabelEncoder import re from sklearn.feature_selection import SelectKBest from sklearn.feature_selection...

Differences in sklearn's RidgeCV options

scikit-learn
I'm a bit confused about what appear to be bigger-than-expected differences under certain arguments for RidgeCV. The variations that are confusing to me are below: from sklearn.linear_model import RidgeCV from sklearn.datasets import load_boston from sklearn.preprocessing import scale boston = scale(load_boston().data) target = load_boston().target alphas = np.linspace(0,200) fit0 = RidgeCV(alphas=alphas, store_cv_values=True,...

How to change the function a random forest uses to make decisions from individual trees?

scikit-learn,classification,random-forest,ensemble-learning
Random Forests use 'a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees'. Is there a way to, instead of using the class that is the mode, run another random forest on the outputs produced by...

Classification with scikit-learn KNN using multi-dimensional features (input dimension error)

python,machine-learning,scikit-learn
I am using sklearn's nearest neighbor for a classification problem. My features are patches of the shape (3600, 2, 5). For example: a = [[5,5,5,5,5], [5,5,5,5,5]] b = [[5,5,5,5,5], [5,5,5,5,5]] features = [] for i in xrange(len(a)): features.append([a[i], b[i]]) #I have 3600 of these in reality. neigh = KNeighborsClassifier() neigh.fit(train_features,...

Understand SciKit Learn CV Validation Scores

python,machine-learning,scikit-learn
I'm trying to understand the output of cv_validation_scores, when running a GridSearchCV. The documentation does not adequately explain this. When I print grid_search.grid_scores_, I get a list with items, like this: [mean: 0.60000, std: 0.18002, params: {'tfidf__binary': True, tfidf__ngram_range': (1, 1).... which makes sense. However, when I try to unpack...

How to pass float argument in predict function of scikit linear regression?

python,numpy,scikit-learn,linear-regression
I am using scikit linear regression - single variable to predict y from x. The argument is in float datatype. How can i transform the float into numpy array to predict the output ? import matplotlib.pyplot as plt import pandas import numpy as np from sklearn import linear_model import sys...

How to list all scikit-learn classifiers that support predict_proba()

python,scikit-learn
I need a list of all scikit-learn classifiers that support the predict_proba() method. Since the documentation provides no easy way of getting that information, how can get this programatically?

Efficient element-wise function computation in Python

python,numpy,scikit-learn,vectorization
I have the following optimization problem. Given two np.arrays X,Y and a function K I would like to compute as fast as possible the matrix incidence gram_matrix where the (i,j)-th element is computed as K(X[i],Y[j]). Here there an implementation using nested for-loops, which are acknowledged to be the slowest to...

Scikit-Learn: How to retrieve prediction probabilities for a KFold CV?

python,scikit-learn,classification
I have a dataset that consists of images and associated descriptions. I've split these into two separate datasets with their own classifiers (visual and textual) and now I want to combine the predictions of these two classifiers to form a final prediction. However, my classes are binary, either 1 or...

Provide Starting Positions to t-distributed Stochastic Neighbor Embedding (TSNE) in scikit-learn

python,scikit-learn
I've been looking at using scikit learns' TSNE method to visualize high dimensional data in 2D. However, I have some idea of where the starting positions should be in 2D space but I don't see any way of specifying this information. Any ideas how I might be able to provide...

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

python,scikit-learn,random-forest,cross-validation
I'm running GridSearch CV to optimize the parameters of a classifier in scikit. Once I'm done, I'd like to know which parameters were chosen as the best. Whenever I do so I get a AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_', and can't tell why, as it seems to be...

Scikit : How to resolve this usecase

python,machine-learning,scikit-learn
I am very new to scikit and have a usecase which I am trying to solve through scikit python library. I have CSV file like this: Label , userId , message , user_like,user_dislike 1 , 1, "this is good message", 4,5 0, 1, "This is bad message",3,4 1, 2, "this...

Port Python Code to Android

android,python,numpy,scikit-learn
I am having some Python code that heavily relies on numpy/scipy and scikit-learn. What would be the best way to get it running on an Android device? I have read about a few ways to get Python code running on Android, mostly Pygame and Kivy but I am not sure...

How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

python,scikit-learn,k-means
In Python sklearn KMeans (see documentation), I was wondering what happens internally when passing an ndarray of shape (n, n_features) to the init parameter, When n<n_clusters Does it drop the given centroids and just starts a kmeans++ initialization which is the default choice for the init parameter ? (PDF paper...

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

python,scikit-learn,cross-validation
I have a matrix with 20 columns. The last column are 0/1 labels. The link to the data is: https://www.dropbox.com/s/8v4lomociw1xz0d/data_so.csv?dl=0 I am trying to run random forest on the dataset, using cross validation. I use two methods of doing this: 1) using sklearn.cross_validation.cross_val_score 2) using sklearn.cross_validation.train_test_split I am getting different...

TypeError: get_params() missing 1 required positional argument: 'self'

python,scikit-learn
I was trying to use scikit-learn package with python-3.4 to do a grid search, from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import precision_score, recall_score, accuracy_score from sklearn.preprocessing import LabelBinarizer import numpy...

Scikit learn cross validation split

python,scikit-learn,classification,cross-validation
I'm currently using cross_validation.cross_val_predict to obtain the predictions made by a LogisticRegression classifier. My question is: what percentage of the data makes up the training set and what percentage makes up the test set? Is it an 80%-20% split? I checked the website and other questions on stackoverflow but did...

How to interpret scikit's learn confusion matrix and classification report?

machine-learning,nlp,scikit-learn,svm,confusion-matrix
I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows: from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2)) from sklearn.cross_validation...

How to do LabelEncoding or categorical value in Apache Spark

apache-spark,scikit-learn
I am having dataset contains String columns . How can I encode the string based columns like the one we do in scikit-learn LabelEncoder

'verbose' argument in scikit-learn

python,arguments,scikit-learn,verbosity,verbose
Many scikit-learn functions have a verbose argument that, according to their documentation, "[c]ontrols the verbosity: the higher, the more messages" (e.g., GridSearchCV). Unfortunately, no guidance is provided on which integers are allowed (e.g., can a user set verbosity to 100?) and what level of verbosity corresponds to which integers. I...

Create classes in a loop

python,class,scikit-learn
I want to define a class and then make a dynamic number of copies of that class. Right now, I have this: class xyz(object): def __init__(self): self.model_type = ensemble.RandomForestClassifier() self.model_types = {} self.model = {} for x in range(0,5): self.model_types[x] = self.model_type def fit_model(): for x in range(0,5): self.model[x] =...

Memory issue sklearn pairwise_distances calculation

python,out-of-memory,fork,scikit-learn,cosine-similarity
I have a large data frame where its index is movie_id and column headers represent tag_id. Each row is represent movie to tag relevance 639755209030196 691838465332800 \ 46126718359 0.042 0.245 46130382440 0.403 0.3 46151724544 0.032 0.04 Then I do following: data = df.values similarity_matrix = 1 - pairwise_distances(data, data, 'cosine',...

Generating Difficult Classification Data Sets using scikit-learn

scikit-learn
I am trying to generate a range of synthetic data sets using make_classification in scikit-learn, with varying sample sizes, prevalences (i.e., proportions of the positive class), and accuracies. Varying the sample size and prevalence is fairly straightforward, but I am having difficult generating any data sets that have less than...

Save and reuse TfidfVectorizer in scikit learn

python,nlp,scikit-learn,pickle,text-mining
I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals,...

How does the class_weight parameter in scikit-learn work?

python,scikit-learn
I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates. The Situation I want to use logistic regression to do binary classification on a very unbalanced data set. The classes are labelled 0 (negative) and 1 (positive) and the observed data is in...

scikit learn documentation in PDF

python-2.7,pdf,scikit-learn,html-to-pdf
Does anyone have any idea how I can get the scikit learn documentation (http://scikit-learn.org/stable/documentation.html) specifically the user guide, tutorials and examples, in PDF format? If they are not readily available, is there a way to convert them to PDF programmatically? I looked around for html to pdf conversion api services...

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

scikit-learn
Just trying to do a simple linear regression but I'm baffed by this error for: regr = LinearRegression() regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values) which produces: ValueError: Found arrays with inconsistent numbers of samples: [ 1 999] These selections must have the same dimensions, and they should be numpy arrays, so what...

scikit : Wrong prediction for this case

python,machine-learning,scikit-learn
I have written a sample code below import numpy as np import pandas as pd import csv from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB text = ["this is dog" , "this is bull dog" , "this is jack"] countVector = CountVectorizer() countmatrix = countVector.fit_transform(text) print...

How can i resample “roc_curve” (fpr,tpr )?

python,scikit-learn
I'm looking to resample "roc_curve" (sklearn) output. When i plot fpr,tpr in Ipython it is fine, but sometimes i want to export it (mostly for client), but it's hard to understand because it's not linear. For example fpr =[0,0.1,0.4,0.9,1] tpr =[0,0.3,0.4,0.5,1] How can i resample fpr to be linear every...

Principal component analysis using sklearn and panda

python,pandas,scikit-learn,pca,principal-components
I have tried to reproduce the results from the PCA tutorial on here (PCA-tutorial) but I've got some problems. From what I understand I am following the steps to apply PCA as they should be. But my results are not similar with the ones in the tutorial (or maybe they...

Error with cross validation on a multilabel classification

python,scikit-learn,svm,cross-validation,multilabel-classification
I'm using "multiclass.OneVsRestClassifier" and "cross_validation.StratifiedKFold". When I do cross validation on a multi-label problem, it´s fails. Is it possible to perform cross-validation on a multilabel problem scikit-learn? I think the problem is in the tuples of class label lists Eg ([1], [2], [2], [1], [1,2], [3], [1,2,3]. ..) code in...

Python thread locking/class variable initialisation confusion

python,multithreading,class,locking,scikit-learn
I have a class which behaves strangely if accessed by multiple threads. The threads are started during sklearn's GridSearch training (with jobs=3), so I don't know exactly how they are called. My class itself looks roughly like this: from sklearn.base import BaseEstimator, TransformerMixin import threading class FeatureExtractorBase(BaseEstimator, TransformerMixin): expensive_dependency =...

In sklearn, does a fitted pipeline reapply every transform?

python,scikit-learn,pipeline,feature-selection
Apologies if this is obvious but I couldn't find a clear answer to this: Say I've used a pretty typical pipeline: feat_sel = RandomizedLogisticRegression() clf = RandomForestClassifier() pl = Pipeline([ ('preprocessing', preprocessing.StandardScaler()), ('feature_selection', feat_sel), ('classification', clf)]) pl.fit(X,y) Now when I apply pl on a new set, pl.predict(X_classify); is RandomizedLogisticRegression going...

series object not callable with linear regression in python

python,scikit-learn,linear-regression,statsmodels
I am new to Python and I am trying to build a simple linear regression model. I am able to build the model and see the results, but when I try to look at the parameters I get an error and I am not sure where I am going wrong....

Why does scikit-learn cause core dumped?

python,scikit-learn,coredump
I try to run a simple linear fit in scikit-learn: from sklearn import linear_model clf = linear_model.LinearRegression() clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) As a result I get: Illegal instruction (core dumped) Does anybody know what is the reason of this problem and how the problem...