FAQ Database Discussion Community


How to use Apache Spark ALS (alternating-least-squares) algorithm with limited Rating values

apache-spark,collaborative-filtering,mllib
I am trying to use ALS, but currently my data is limited to information about what user bought. So I was trying to fill ALS from Apache Spark with Ratings equal 1 (one) when user X bought item Y (and only such information I provided to that algorithm). I was...

spark-mllib: Error “reassignment to val” in source code

apache-spark,mllib
I'm using IDEA SBT project to test spark-mllib code. Here is build.sbt: name := "SparkTest" version := "1.0" scalaVersion := "2.11.6" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.0", "org.apache.spark" %% "spark-mllib" % "1.2.0" ) After all the import and compile work has done, I found some errors in lib...

Field “item” does not exist using Spark MLlib pipeline for ALS

scala,apache-spark,mllib
I am training a recommender system with ALS (Spark version: 1.3.1). Now I want to use a Pipeline for model selection via cross-validation. As a first step, I tried to adapt the example code and came up with this: val conf = new SparkConf().setAppName("ALS").setMaster("local") val sc = new SparkContext(conf) val...

Spark saving RDD[(Int, Array[Double])] to text file got strange result

apache-spark,mllib
I am trying to save the userFeature of a MatrixFactorizationModel to textFile, which according to the doc is a RDD of type [(Int, Array[Double])]. So I just called model.userFeature.saveAsTextFile("feature") However, the results I got are something like: (1,[[email protected]) (5,[[email protected]) (9,[[email protected]) (13,[[email protected]) (17,[[email protected]) (21,[[email protected]) (25,[[email protected]) (29,[[email protected]) (33,[[email protected]) (37,[[email protected]) (41,[[email protected]) (45,[[email protected]) (49,[[email protected])...

PySpark & MLLib: Class Probabilities of Random Forest Predictions

apache-spark,random-forest,mllib,pyspark
I'm trying to extract the class probabilities of a random forest object I have trained using PySpark. However, I do not see an example of it anywhere in the documentation, nor is it a a method of RandomForestModel. How can I extract class probabilities from a RandomForestModel classifier in PySpark?...

Apache Spark ALS Recommendation

machine-learning,apache-spark,collaborative-filtering,mllib
I've ran a little ALS recommender system program as found on the Apache Spark website which utilises Mllib. When using a dataset with ratings of 1-5 (I've used the MovieLens dataset) it gives recommendations with predicted ratings of over 5! The highest I've found in my small testing is 7.4....

Addition of two RDD[mllib.linalg.Vector]'s

scala,apache-spark,mllib
i need addition of two matrices that are stored in two files. contents of latest1.txt and latest2.txt like 1 2 3 4 5 6 7 8 9 i am reading those files like scala> val rows = sc.textFile(“latest1.txt”).map { line => val values = line.split(‘ ‘).map(_.toDouble) Vectors.sparse(values.length,values.zipWithIndex.map(e => (e._2, e._1)).filter(_._2...

spark mllib predict error with map

scala,apache-spark,mllib
I have a linear regression model model and a set of LabeledPoint regPoints. I am able to predict the first sample scala> model.predict(regPoints.first.features) 15/02/12 16:17:56 INFO SparkContext: Starting job: first at <console>:61 15/02/12 16:17:56 INFO DAGScheduler: Got job 154 (first at <console>:61) with 1 output partitions (allowLocal=true) 15/02/12 16:17:56 INFO...

Use of foreachActive for spark Vector in Java

java,apache-spark,mllib
How to write simple code in Java which iterate over active elements in sparse vector? Lets say we have such Vector: Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); I was trying with lambda or Function2 (from three different imports but always failed). If you use...

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

scala,apache-spark,mllib
I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm. For example, there is one LogisticRegression in org.apache.spark.ml.classification also a LogisticRegressionwithSGD in org.apache.spark.mllib.classification. The only difference I can find is that the one in...

Spark: value reduceByKey is not a member

vector,apache-spark,reduce,mllib
After clustering some sparse vectors I need to find intersection vector in every cluster. To achieve this I try to reduce MLlib vectors as in the following example: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors //For Sparse Vector import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import org.apache.spark.mllib.linalg.{Vector, Vectors} object Recommend...

spark mllib apply function to all the elements of a rowMatrix

scala,apache-spark,mllib
I have a rowMatrix xw scala> xw res109: org.apache.spark.mllib.linalg.distributed.RowMatrix = [email protected] and I would like to apply a function to each of its elements: f(x)=exp(-x*x) The type of element of the matrix can be visualized as: scala> xw.rows.first res110: org.apache.spark.mllib.linalg.Vector =...

Converting string features to numeric features: algorithm efficiency

scala,apache-spark,mllib
I'm converting several columns of strings to numeric features I can use in a LabeledPoint. I'm considering two approaches: Create a mapping of strings to doubles, iterate through the RDD and lookup each string and assign the appropriate value. Sort the RDD by the column, iterate through the RDD with...

How to set cutoff while training the data in Random Forest in Spark

apache-spark,random-forest,mllib
I am using Spark Mlib to train the data for classification using Random Forest Algorithm. The MLib provides a RandomForest Class which has trainClassifier Method which does the required. Can I set a threshold value while training the data set, similar to the cutoff option provided in R's randomForest Package....

How to configure kernel selection and loss function for Support Vector Machines in Spark MLLib

amazon-web-services,machine-learning,apache-spark,svm,mllib
I have installed spark on AWS Elastic Map Reduce(EMR) and have been running SVM using the packages in MLLib. But there are no options to choose parameters for building the model like kernel selection and cost of misclassification (Like in e1071 package of R). Can someone please tell me how...

How to do text analysis in Spark

hadoop,mapreduce,apache-spark,mllib
I'm quite familiar with Hadoop but totally new to Apache Spark. Currently I'm using LDA (Latent Dirichlet Allocation) algorithm implemented in Mahout to do topic discovery. However as I need to make the process faster I'd like to use spark, however the LDA (or CVB) algorithm is not implemented in...

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

algorithm,machine-learning,apache-spark,mllib
I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; int numIterations = 10; MatrixFactorizationModel model =...

Apache Spark does not see all the ram of my machines

apache-spark,google-compute-engine,mllib
I have created a Spark cluster of 8 machines. Each machine have 104 GB of RAM and 16 virtual cores. I seems that Spark only sees 42 GB of RAM per machine which is not correct. Do you know why Spark does not see all the RAM of the machines?...

How do I create an RDD from input directory containing text files?

machine-learning,apache-spark,bigdata,analysis,mllib
I am working with the 20 newsgroup dataset. Basically, I have a folder and n text files. The Files in the folder belong to the topic the folder is named. I have 20 such folders. How do I load all this data into Spark and make an RDD out of...

Apache-Spark library content

java,apache-spark,mllib
I am trying to run a Java test program using the MLlib library from Apache-Spark. I downloaded the latest Spark version from their website and followed the O'reilly book "Learning Spark, Lightning-Fast Big Data Analysis" to find useful examples and tips, but when it comes to importing the right libraries,...

Why do I get a type error in model.predictOnValues when I try the official example of Streaming Kmeans Clustering of Apache Spark?

scala,apache-spark,mllib
I'm trying the Streaming Clustering example code at the end of the official guide, but I get a type error. Here is my code: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.clustering.StreamingKMeans object Kmeans { def main(args:...