FAQ Database Discussion Community


In Spark, what is left in the memory after a Job is done?

memory,apache-spark,rdd
I used ./bin/spark-shell to run some experiments and find out the following facts. When running jobs (transformation + action), I notice the memory usage in the top. For example, for 5G text file, I did a simple filter() and count(). After the job is done, there are 7g marked as...

OutofMemoryErrory creating fat jar with sbt assembly

jar,cassandra,apache-spark,sbt
We are trying to make a fat jar file containing one small scala source file and a ton of dependencies (simple mapreduce example using spark and cassandra): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import com.datastax.spark.connector._ import org.apache.spark.SparkConf object VMProcessProject { def main(args: Array[String]) { val conf = new SparkConf() .set("spark.cassandra.connection.host", "127.0.0.1") .set("spark.executor.extraClassPath",...

org.apache.spark.SparkException: Task not serializable - JavaSparkContext

java,serialization,apache-spark
I'm trying to run the following simple Spark code: Gson gson = new Gson(); JavaRDD<String> stringRdd = jsc.textFile("src/main/resources/META-INF/data/supplier.json"); JavaRDD<SupplierDTO> rdd = stringRdd.map(new Function<String, SupplierDTO>() { private static final long serialVersionUID = -78238876849074973L; @Override public SupplierDTO call(String str) throws Exception { return gson.fromJson(str, SupplierDTO.class); } }); But it's throwing the following...

Apache Spark: Error while starting PySpark

python,hadoop,apache-spark,pyspark
On a Centos machine, Python v2.6.6 and Apache Spark v1.2.1 Getting the following error when trying to run ./pyspark Seems some issue with python but not able to figure out 15/06/18 08:11:16 INFO spark.SparkContext: Successfully stopped SparkContext Traceback (most recent call last): File "/usr/lib/spark_1.2.1/spark-1.2.1-bin-hadoop2.4/python/pyspark/shell.py", line 45, in <module> sc =...

takeOrdered descending Pyspark

python,apache-spark
i would like to sort K/V pairs by values and then take the biggest five values. I managed to do this with reverting K/V with first map, sort in descending order with FALSE, and then reverse key.value to the original (second map) and then take the first 5 that are...

Shuffled vs non-shuffled coalesce in Apache Spark

scala,apache-spark,bigdata,distributed-computing
What is the difference between the following transformations when they are executed right before writing RDD to a file? coalesce(1, shuffle = true) coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)...

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind issue in ecllipse

maven,hadoop,apache-spark,word-count
I have the below spark word count program : package com.sample.spark; import java.util.Arrays; import java.util.List; import java.util.Map; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFlatMapFunction; import org.apache.spark.api.java.function.PairFunction; import...

SparkR and Packages

r,apache-spark,sparkr
How do one call packages from spark to be utilized for data operations with R? example i am trying to access my test.csv in hdfs as below Sys.setenv(SPARK_HOME="/opt/spark14") library(SparkR) sc <- sparkR.init(master="local") sqlContext <- sparkRSQL.init(sc) flights <- read.df(sqlContext,"hdfs://sandbox.hortonWorks.com:8020 /user/root/test.csv","com.databricks.spark.csv", header="true") but getting error as below: Caused by: java.lang.RuntimeException: Failed to...

Retrieving TriangleCount

scala,apache-spark,spark-graphx
I'm trying to retrieve the amount of triangles from a graph using graphX. As I'm new to both Scala and graphX, I'm currently quite stuck. I'm creating a graph from an edgefile: 1 2 1 3 2 3 This should be 1 triangle. Next I'm using the build in function...

Can't start local instance of Spark-Jobserver

apache-spark,spark-jobserver
So I'm trying to create a local instance of spark jobserver to test jobs on and I can't even get it to run. So the first thing I do when I got into my vagrant instance is I start spark. I know this works because I submit jobs to spark...

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

algorithm,machine-learning,apache-spark,mllib
I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; int numIterations = 10; MatrixFactorizationModel model =...

Can I change SparkContext.appName on the fly?

apache-spark,pyspark
I know I can use SparkConf.set('spark.app.name',...) to set appName before creating the SparkContext. However, I want to change the name of the application as it progresses, i.e., after SparkContext has been created. Alas, setting sc.appName does not change how the job is shown by yarn application -list. Is there a...

Call Distinct on 'pyspark.resultiterable.ResultIterable'

python,apache-spark,pyspark
I am writing some spark code and I have an RDD which looks like [(4, <pyspark.resultiterable.ResultIterable at 0x9d32a4c>), (1, <pyspark.resultiterable.ResultIterable at 0x9d32cac>), (5, <pyspark.resultiterable.ResultIterable at 0x9d32bac>), (2, <pyspark.resultiterable.ResultIterable at 0x9d32acc>)] What I need to do is to call a distinct on the pyspark.resultiterable.ResultIterable I tried this def distinctHost(a, b): p...

Is possible to run spark (specifically pyspark) in process?

apache-spark,pyspark
When running a pyspark job there's a significant launch overhead. is it possible to run 'lightweight' jobs that don't use an external daemon? (mainly for testing with small data sets)

Profiling a scala spark application

scala,apache-spark
I would like to profile my spark scala applications to figure out the parts of the code which i have to optimize. I enabled -Xprof in --driver-java-options but this is not of much help to me as it gives lot of granular details. I am just interested to know how...

Is the DStream return by updateStateByKey function only contains one RDD?

apache-spark,spark-streaming,apache-spark-sql,pyspark
Is the DStream return by updateStateByKey function only contains one RDD? If not,Under what circumstances will the DStream contains more than one RDD?

In sbt, how can we specify the version of hadoop on which spark depends?

apache-spark,sbt
Well I have a sbt project which uses spark and spark sql, but my cluster uses hadoop 1.0.4 and spark 1.2 with spark-sql 1.2, currently my build.sbt looks like this: libraryDependencies ++= Seq( "com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5", "com.datastax.cassandra" % "cassandra-driver-mapping" % "2.1.5", "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.2.1", "org.apache.spark" %...

Conditionally Combining/Reducing key-pairs

python,apache-spark,pyspark
I've had this issue for some time now, and I think it has to do with my lack of understanding of how to use combineByKey and reduceByKey, so hopefully somebody can clear this up. I am working with DNA sequences, so I have a procedure to produce a bunch of...

Spark RDD repeated reduce operations yielding inconsistent results

scala,mapreduce,apache-spark,reduce,rdd
Consider the following code in Spark that should return the sum of the sqrt's of a sequence of integers: // Create an RDD of a sequence of integers val data = sc.parallelize(Range(0,100)) // Transform RDD to sequence of Doubles val x = data.map(_.toDouble) // Reduce the sequence as the sum...

Key value mapping in spark

scala,apache-spark
I have two files. 1st file contains: (assume apple is key and fruit is value) apple fruit banana fruit tomato vegetable mango fruit potato vegetable 2nd file contains: apple banana mango potato tomato I need to loop through 2nd file and find matching value in file 1. I need final...

PySpark No suitable driver found for jdbc:mysql://dbhost

apache-spark,apache-spark-sql,pyspark
I am trying to write my dataframe to a mysql table. I am getting No suitable driver found for jdbc:mysql://dbhost when I try write. As part of the preprocessing I read from other tables in the same DB and have no issues doing that. I can do the full run...

Omit input data of map function in Scala

scala,apache-spark,scala-collections,scala-2.10
I am learning Spark source code, and get confused on the following code: /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1) What is the input data for...

What to set `SPARK_HOME` to?

python,apache-spark,pythonpath,pyspark,apache-zeppelin
Installed apache-maven-3.3.3, scala 2.11.6, then ran: $ git clone git://github.com/apache/spark.git -b branch-1.4 $ cd spark $ build/mvn -DskipTests clean package Finally: $ git clone https://github.com/apache/incubator-zeppelin $ cd incubator-zeppelin/ $ mvn install -DskipTests Then ran the server: $ bin/zeppelin-daemon.sh start Running a simple notebook beginning with %pyspark, I got an error...

Spark-submit class not found exception

scala,apache-spark
I am trying to run this simple spark application with the spark submit command using this quick start tutorial. http://spark.apache.org/docs/1.2.0/quick-start.html#self-contained-applications. when I try to run it using spark-1.4.0-bin-hadoop2.6\bin>spark-submit --class " SimpleApp" --master local[4] C:/.../Documents/Sparkapp/target/scala- 2.10/simple-project_2.10-1.0.jar I get the following exception: java.lang.ClassNotFoundException: SimpleApp at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown...

Convert RDD[Map[String,Double]] to RDD[(String,Double)]

scala,apache-spark,rdd
I did some calculation and returned my values in a RDD containing scala map and now I want to remove this map and want to collect all keys values in a RDD. Any help will be appreciated....

Spark streaming on YARN executor's logs not available

logging,apache-spark,yarn,spark-streaming
I'm running the following code .map{x => Logger.fatal("Hello World") x._2 } It's spark streaming applciation runs on YARN. I upadted log4j and provided it with spark-submit (using --files). My Log4j configuration was loaded which I see from logs and applied to Driver's logs (I see my log level only and...

Spark: use reduceByKey instead of groupByKey and mapByValues

python,apache-spark,pyspark
I have an RDD with duplicates values with the following format: [ {key1: A}, {key1: A}, {key1: B}, {key1: C}, {key2: B}, {key2: B}, {key2: D}, ..] I would like the new RDD to have the following output and to get ride of duplicates. [ {key1: [A,B,C]}, {key2: [B,D]}, ..]...

ActorNotFound Exception trying to run Spark 1.3.1 on windows 7

apache-spark,akka
We are at a road block trying to understand why Spark 1.3.1 doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty much the same setup and everything works fine for me. I searched for the error message but still didn't find a resolution. Here...

File Processing with Spark and Cassandra

cassandra,apache-spark
Right now I'm working on loading a table from a Cassandra cluster into a Spark cluster with the Datastax Cassandra Spark Connector. Right now the spark program performs a simple mapreduce job that counts the number of rows in the Cassandra table. Everything is set up and run locally. The...

when should groupByKey API used in spark programming?

apache-spark
GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?

how to map a log file in java using spark?

java,apache-spark
i have to monitor a log file in which is written the history of utilization of an app. This log file is formatted in this way: <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> ... about 800000 rows AppId is always the same, because is referenced at only one app, date is expressed...

Results of reduce and count differ in pyspark

python,hadoop,apache-spark
For my spark trials, I have downloaded the NY taxi csv files and merged them into a single file, nytaxi.csv . I then saved this in hadoop fs. I am using spark on yarn with 7 nodemanagers. I am connecting to spark over Ipython notebook. Here is a sample python...

how to create a keyspace in cassandra?

java,eclipse,cassandra,apache-spark
i'm using a snippet to understand cassandra and syntax: import com.datastax.driver.core.Cluster; import com.datastax.driver.core.ResultSet; import com.datastax.driver.core.Row; import com.datastax.driver.core.Session; public class App { public static void main(String[] args) { Cluster cluster; Session session; // Connect to the cluster and key space "demo" cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); session = cluster.connect("demo"); // Insert one record...

“remoteContext object has no attribute”

amazon-s3,apache-spark,pyspark
I'm running Spark 1.4 in Databrick's Cloud. I loaded a file into my S3 instance and mounted it. Mounting worked. But I'm having trouble creating an RDD: dbutils.fs.mount("s3n://%s:%s@%s" % (ACCESS_KEY, SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME) Any ideas? sc.parallelize([1,2,3]) rdd = sc.textFiles("/mnt/GDELT_2014_EVENTS/GDELT_2014.csv") ...

What is the proper syntax of using broadcast variables in Spark using Scala?

scala,apache-spark,broadcast
I want to use a broadcast variable in Spark with Scala. But I can't find enough help on how to use them. Say, I have an object of class A, which I normally would declare as follows in Scala. val a = new A() What would be the syntax of...

In Spark, does the filter function turn the data into tuples?

mapreduce,apache-spark,cloud
Just wondering does the filter turn the data into tuples? For example val filesLines = sc.textFile("file.txt") val split_lines = filesLines.map(_.split(";")) val filteredData = split_lines.filter(x => x(4)=="Blue") //from here if we wanted to map the data would it be using tuple format ie. x._3 OR x(3) val blueRecords = filteredData.map(x =>...

Connecting from Spark/pyspark to PostgreSQL

postgresql,jdbc,jar,apache-spark,pyspark
I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work: import os os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6" from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext spark_config = SparkConf().setMaster("local[8]") sc = SparkContext(conf=spark_config) sqlContext = SQLContext(sc) textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md") textFile.count() textFile.filter(lambda...

Pyspark: using filter for feature selection

python,apache-spark,pyspark
I have an array of dimensions 500 x 26. Using the filter operation in pyspark, I'd like to pick out the columns which are listed in another array at row i. Ex: if a[i]= [1 2 3] Then pick out columns 1, 2 and 3 and all rows. Can this...

Map a table of a cassandra database using spark and RDD

java,mapreduce,apache-spark,rdd
i have to map a table in which is written the history of utilization of an app. The table has got these tuples: <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> AppId is always different, because is referenced at many app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are...

how to parse a custom log file in scala to extract some key value pairs using patterns

java,regex,scala,apache-spark
I am building a spark streaming app that takes in logs coming out of a server. A log line looks something like this. 2015-06-18T13:53:46.606-0400 CustomLog v4 INFO: source="ABCD" type="type1" <xml some xml here attr1='value1' attr2='value2' > </xml> <some more xml></> time ="232" I am trying to follow the sample app...

Access key from mapValues or flatMapValues?

scala,apache-spark
In Spark 1.3, is there a way to access the key from mapValues? Specifically, if I have val y = x.groupBy(someKey) val z = y.mapValues(someFun) can someFun know which key of y it is currently operating on? Or do I have to do val y = x.map(r => (someKey(r), r)).groupBy(_._1)...

Graphx EdgeRDD count taking long time to compute

apache-spark,spark-graphx
I am running a stand alone spark, I have this code below related to EdgeRDD. These are graph edges loaded from a textfile. There are around 67 million records. val edges: RDD[Edge[Int]] = edge_file.map(line => {val x = line.split("\\s+") Edge(x(0).toLong, x(1).toLong, x(2).toInt); }) val edges1: EdgeRDD[Int] = EdgeRDD.fromEdges(edges) println(edges1.count) The...

Is it possible to rerun a command in spark-shell via its history number?

scala,apache-spark
I'd like to be able to run a spark-shell command via it's history number. When I type :history or :h?, I then cut and paste the command - even though the history command gives it an ID number. I'd like to be able to type :61 or something to just...

Zeppelin SqlContext registerTempTable issue

scala,apache-spark,apache-spark-sql,apache-zeppelin
I am trying to access some json data using sqlContext.jsonFile in zeppelin... following code execute without any error: import sys.process._ val sqlCon = new org.apache.spark.sql.SQLContext(sc) val jfile = sqlCon.jsonFile(s"file:///usr/local/src/knoldus/projects/scaladay_data/scalaDays2015Amsterdam_tweets.json") import sqlContext.implicits._ jfile.registerTempTable("jTable01") output : import sys.process._ sqlCon: org.apache.spark.sql.SQLContext = [email protected] jfile: org.apache.spark.sql.DataFrame = [id: struct,...

Where are the API docs for org.apache.spark.sql.cassandra for Spark 1.3.x?

cassandra,apache-spark,apache-spark-sql
I'm writing a Spark job that uses Spark-Cassandra connector to connect to Cassandra from spark and then runs queries on Spark/Cassandra using Spark SQL. I was wondering where I could find the API docs for this? Looking at the api here https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package It would seem like the package doesn't even...

When using textFile to create an RDD in Spark, what is the index that is displayed in the result?

apache-spark,rdd
When I create an RDD using sc.textFile in Spark, I get a result like: org.apache.spark.rdd.RDD[String] = file:///home/cloudera/data MapPartitionsRDD[133] at textFile at <console>:23 What does the [133] represent? I see that it increases, so feels like some kind of ID....

Why partition.size turn to zero in foreachPartition

scala,apache-spark
In the following code(written in scala), I print partition.size twice but got two different results. Code: val a = Array(1,2,3,4,5,6,7,8,9) val rdd = sc.parallelize(a) rdd.foreachPartition { partition => println("1. partition.size: " + partition.size) //1. partition.size: 2 println("2. partition.size: " + partition.size) //2. partition.size: 0 } Results: 1. partition.size: 2 2....

Spark saving RDD[(Int, Array[Double])] to text file got strange result

apache-spark,mllib
I am trying to save the userFeature of a MatrixFactorizationModel to textFile, which according to the doc is a RDD of type [(Int, Array[Double])]. So I just called model.userFeature.saveAsTextFile("feature") However, the results I got are something like: (1,[[email protected]) (5,[[email protected]) (9,[[email protected]) (13,[[email protected]) (17,[[email protected]) (21,[[email protected]) (25,[[email protected]) (29,[[email protected]) (33,[[email protected]) (37,[[email protected]) (41,[[email protected]) (45,[[email protected]) (49,[[email protected])...

Spark's takeSample() results in two stages

apache-spark,sample
I've observed interesting behavior in Spark 1.3.1, the reason for which is not clear. Doing something as simple as sc.textFile("...").takeSample(...) always results in two stages: ...

Error when running job that queries against Cassandra via Spark SQL through Spark Jobserver

cassandra,apache-spark,apache-spark-sql,spark-jobserver,spark-cassandra-connector
So I'm trying to run job that simply runs a query against cassandra using spark-sql, the job is submitted fine and the job starts fine. This code works when it is not being run through spark jobserver (when simply using spark submit). Could someone tell my what is wrong with...

Accessing csv file placed in hdfs using spark

csv,hadoop,apache-spark,pyspark
I have placed a csv file into the hdfs filesystem using hadoop -put command. I now need to access the csv file using pyspark csv. Its format is something like `plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')` I am a newbie to hdfs. How do I find the address to be placed in hdfs://x.x.x.x?...

HashMap as a Broadcast Variable in Spark Streaming?

java,apache-spark,spark-streaming
I have some data that needs to be classified in spark streaming. The classification key-values are loaded at the beginning of the program in a HashMap. Hence each incoming data packet needs to be compared against these keys and tagged accordingly. I realize that spark has variables called broadcast variables...

Word count using Spark and Scala

scala,apache-spark
i have to write a program in Scala, using spark which counts how many times a word occours in a text, but using the RDD my variable count always displays 0 at the end. Can you help me please? This is my code import scala.io.Source import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import...

groupByKey not properly working in spark

scala,mapreduce,apache-spark
So, I have an RDD, which has key-value pair like following. (Key1, Val1) (Key1, Val2) (Key1, Val3) (Key2, Val4) (Key2, Val5) After groupByKey, I expect to get something like this Key1, (Val1, Val2, Val3) Key2, (Val4, Val5) However, I see that same keys are being repeated even after doing groupByKey()....

Removing duplicates from Spark RDDPair values

python,apache-spark,pyspark
I am new to Python and also Spark. I've an pair RDD containing (key, List) but some of the values are duplicate. RDD is of the form (zipCode,streets) I want a pair RDD which does not contain duplicates. I am trying to achieve it using python. Can anyone please help...

Task had a not serializable result in spark

apache-spark,spark-cassandra-connector
I am trying to read cassandra table using the cassandra driver to the spark. Here is the code. val x = 1 to 2 val rdd = sc.parallelize(x) val query = "Select data from testkeyspace.testtable where id=%d" val cc = CassandraConnector(sc.getConf) val res1 = rdd.map{ it => cc.withSessionDo{ session =>...

Is there a way to mimic R's higher order (binary) function shorthand syntax within spark or pyspark?

r,apache-spark,pyspark
In R, I can write the following: ## Explicit Reduce(function(x,y) x*y, c(1, 2, 3)) # returns 6 However, I can also do this less explicitly with the following: ## Less explicit Reduce(`*`, c(1, 2, 3)) # also returns 6 In pyspark, I could do the following: rdd = sc.parallelize([1, 2,...

scala spark word2vec try and catch exception

scala,apache-spark,try-catch
I am trying to create a dictionary using spark's word2vec. In the process, I create an Array of around 200 words, and apply the findSynonyms function to each of them. However, out of the 200 words, there will be a few words that will not return any Synonyms (due to...

Spark: Filtering out agregated data?

scala,hash,apache-spark,filtering
There is a table with two columns books and readers of these books, where books and readers are book and reader IDs, respectively. I need to remove from this table readers who read more then 10 books. First I group books by reader and get sizes of these groups: val...

Error running spark app using spark-cassandra connector

cassandra,apache-spark,spark-cassandra-connector
I have written a basic spark app that reads and writes to Cassandra following this guide (https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md) This is what the .sbt for this app looks like: name := "test Project" version := "1.0" scalaVersion := "2.10.5" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.1", "com.google.guava" % "guava" % "14.0.1",...

Find mutually Edges with Spark and GraphX

graph,apache-spark,vertices,edges,spark-graphx
I'm really new to spark and graphx. My question is that if i have a graph with some nodes that have mutual(reciprocally) edges between them, i want to select the edges with a good performace. An example: Source Dst. 1 2 1 3 1 4 1 5 2 1 2...

Install SparkR that comes with Spark 1.4

r,apache-spark,sparkr
The newest version of Spark (1.4) now comes with SparkR. Does anyone know how to go about installing the SparkR implementation on Windows? The sparkR.R script is currently located in C:/spark-1.4.0/R/pkgs/R/ This appears to be a step in the right direction, but the instructions don't work for Windows as there...

Does the scala ! subprocess command waits for the subprocess to finish?

scala,apache-spark
So, I have the following kind of code running in each map tasks on Spark. @volatile var res = (someProgram + fileName) ! var cmdRes = ("rm " + fileName) !; The filenames for each map tasks are unique. The basic idea is that once the first command finishes, the...

Include package in Spark local mode

python,apache-spark,py.test,pyspark
I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script. I'm using pytest to run my tests with Spark in local mode: conf = SparkConf().setAppName('myapp').setMaster('local[1]') sc = SparkContext(conf=conf) My question is, since...

Convert Apache Spark Scala code to Python

python,scala,apache-spark
Can anyone convert this very simple scala code to python? val words = Array("one", "two", "two", "three", "three", "three") val wordPairsRDD = sc.parallelize(words).map(word => (word, 1)) val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect() ...

spark-1.4 with zeppelin installation

installation,apache-spark,apache-zeppelin
After successfully install spark 1.4, I tried to install Apache Zeppelin for note-book like utility. From some other online resources, I download and unzip the zeppelin source and started to compile with Maven via $ mvn clean install -Pspark-1.4 -DskipTests (spark home exported in the environment) I got nice INFO...

Spark utf 8 error, non-English data becomes `??????????`

scala,hadoop,apache-spark
One of the fields in our data is in a non-English language (Thai). We can load the data into HDFS and the system displays the non-English field correctly when we run: hadoop fs -cat /datafile.txt However, when we use Spark to load and display the data, all the non-English data...

Does Spark from DSE laod all data into RDD before running SQL Query?

cassandra,apache-spark,datastax
Running DSE 4.7 So say I have a 4 node DSE Cassandra/Spark cluster... I have a Cassandra table with say 4,000,000 records in it. On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?" Will Spark load all the data into...

Spark resources not fully allocated on Amazon EMR

apache-spark,yarn,emr
I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I get one of...

Passing a function foreach key of an Array

scala,apache-spark,scala-collections,spark-graphx
I have an array like that : val pairs: Array[(Int, ((VertexId, Seq[Int]), Int))] which generates this output : (11,((11,ArraySeq(2, 5, 4, 5)),1)) (11,((12,ArraySeq(7, 7, 8, 2)),1)) (11,((13,ArraySeq(5, 9, 8, 7)),1)) (1,((1,ArraySeq(1, 2, 3, 4)),1)) (1,((4,ArraySeq(1, 5, 1, 1)),1)) I want to build a Graph for each pairs._1. That means for...

Spark executors with different amounts of memory on Mesos

apache-spark,mesos
Is it possible to have executors with different amounts of memory on a Mesos cluster? Or am I bounded by the machine with the least memory? (Assuming I want to use all available cpus).

Dataframe is not saved into Cassandra

java,cassandra,apache-spark,apache-spark-sql,spark-cassandra-connector
I have one application with Spark (version 1.4.0) and Spark-Cassandra-connector (version 1.3.0-M1). In which, I am trying to store one dataframe into Cassandra table which has two columns (total, message). And i already created table into Cassandra with these two columns. Here is my Code, scoredTweet.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){ @Override public Void...

Scala - sort RDD partitions

scala,sorting,apache-spark,rdd
Assume I have RDD of Integer from 1 to 1,000,000,000 and I want to print them ordered using foreachPartition. There might be situation that the partition of 5-6-7-8 will be printed before 1-2-3-4. How can I prevent this? Thanks, Maya...

ReduceByKey with a byte array as the key

apache-spark,rdd
I would like to work with RDD pairs of Tuple2<byte[], obj>, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset,...

More than expected jobs running in apache spark

apache-spark,bigdata,pyspark
I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api. data = xrange(1, 10000) xrangeRDD = sc.parallelize(data, 8) def ten(value): """Return whether value is below ten. Args: value (int): A number. Returns: bool: Whether `value` is less than ten....

Which spark MLIB algorithm to use?

machine-learning,apache-spark
I'm newbie to machine learning and would like to understand what algorithm (Classification algorithm or co-relation algorithm?) to use in order to understand what is the relationship between one or more attributes. for example consider I have following set of attributes, Bill No, Bill Amount, Tip amount, Waiter Name and...

Pyspark StructType is not defined

python,apache-spark,pyspark
I'm trying to struct a schema for db testing, and StructType apparently isn't working for some reason. I'm following a tut, and it doesn't import any extra module. <type 'exceptions.NameError'>, NameError("name 'StructType' is not defined",), <traceback object at 0x2b555f0>) I'm on spark 1.4.0, and Ubuntu 12 if that has anything...

Spark on yarn jar upload problems

java,hadoop,mapreduce,apache-spark
I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following: YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/; /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar This method gives the following error: diagnostics: Application application_1434177111261_0007...

How do I flatMap a row of arrays into multiple rows?

apache-spark,apache-spark-sql
After parsing some jsons I have a one-column DataFrame of arrays scala> val jj =sqlContext.jsonFile("/home/aahu/jj2.json") res68: org.apache.spark.sql.DataFrame = [r: array<bigint>] scala> jj.first() res69: org.apache.spark.sql.Row = [List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)] I'd like to explode each row out into several rows. How? edit: Original json file:...

How to extract application ID from the PySpark context

apache-spark,yarn,pyspark
A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala. So, how do I figure out the application id (for yarn) of my PySpark process?...

Populate csv with Scala

python,scala,csv,apache-spark
I have a csv file which is more or less "semi-structured) rowNumber;ColumnA;ColumnB;ColumnC; 1;START; b; c; 2;;;; 4;;;; 6;END;;; 7;START;q;x; 10;;;; 11;END;;; Now I would like to get data of this row --> 1;START; b; c; populated until it finds a 'END' in columnA. Then it should take this row -->...

Spark stream unable to read files created from flume in hdfs

hadoop,apache-spark,hdfs,spark-streaming,flume-ng
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am...

spark-mllib: Error “reassignment to val” in source code

apache-spark,mllib
I'm using IDEA SBT project to test spark-mllib code. Here is build.sbt: name := "SparkTest" version := "1.0" scalaVersion := "2.11.6" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.0", "org.apache.spark" %% "spark-mllib" % "1.2.0" ) After all the import and compile work has done, I found some errors in lib...

Subtract an RDD from another RDD dosen't work correctly

scala,apache-spark,spark-graphx
I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed ! Is there any other function to do that ? Or...

Error calling `JValue.extract` from distributed operations in spark-shell

apache-spark,json4s
I am trying to use the case class extraction feature of json4s in Spark, ie calling jvalue.extract[MyCaseClass]. It works fine if I bring the JValue objects into the master and do the extraction there, but the same calls fail in the workers: import org.json4s._ import org.json4s.jackson.JsonMethods._ import scala.util.{Try, Success, Failure}...

Can't access Ganglia on EC2 Spark cluster

amazon-ec2,apache-spark,ganglia
Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... <...> Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond: [ OK ] Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond: [ OK ] Connection to <...> closed. <...> Stopping httpd: [FAILED] Starting httpd: httpd: Syntax error on...

How to find spark master URL on Amazon EMR

apache-spark,spark-streaming,amazon-emr
I am new to spark and trying to install spark on Amazon cluster with version 1.3.1. when i do SparkConf sparkConfig = new SparkConf().setAppName("SparkSQLTest").setMaster("local[2]"); it does work for me , however i came to know that this is for testing purpose i can set local[2] When i tried to use...

reduceByKey with two columns in Spark

python,apache-spark,reduce,pyspark
I am trying to do group by two columns in Spark and am using reduceByKey as follows: pairsWithOnes = (rdd.map(lambda input: (input.column1,input.column2, 1))) print pairsWithOnes.take(20) The above maps command works fine and produces three columns with the third one being all ones. I tried summing the third by the first...

Count of second dimension in two dimension data in Spark

apache-spark
I have the data in this format (apple, laptop) (apple, laptop) (apple, ipad) (dell, laptop) I want to output to be (apple, laptop, 2) (apple, ipad, 1) (dell, laptop, 1) I wanted to do this using groupby and then count but groupby is not allowing grouping based on two columns....

How to transform a tabular data into transactions in spark(scala)?

scala,apache-spark
I have an order transaction dataset, which looks like the following table 1,John,iPhone Cover,9.99 2,Jack,iPhone Cover,9.99 4,Jill,Samsung Galaxy Cover,9.95 3,John,Headphones,5.49 5,Bob,iPad Cover,5.45 I am considering grouping data within certain differences into different transactions. For example, I would group product 1,2,4 into transaction list List(1,2,4) for their absolute differences in price...

scala.MatchError: in Dataframes

java,scala,apache-spark,spark-streaming,apache-spark-sql
I have one Spark (version 1.3.1) application. In which, I am trying to convert one Java bean RDD JavaRDD<Message> into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double). But when, I am executing my Code. messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){ @Override public Void call(JavaRDD<Message> arg0, Time arg1)...

Spark streaming transform function

java,apache-spark,spark-streaming
I am having compilation errors in the transform function for spark streaming. Specifically seem to be missing finalizing the DStream variable or something similar. I have copied from the amplab tutorials so slightly confused... Here is the code, the problem is in the transform function towards the end. Here is...

How to add any new library like spark-csv in Apache Spark prebuilt version

python,apache-spark,apache-spark-sql
I have build the Spark-csv and able to use the same from pyspark shell using the following command bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 error getting >>> df_cat.save("k.csv","com.databricks.spark.csv") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/dataframe.py", line 209, in save self._jdf.save(source, jmode, joptions) File...

How to use spark for map-reduce flow to select N columns, top M rows of all csv files under a folder?

hadoop,mapreduce,apache-spark,spark-streaming,pyspark
To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB): id name address city... 1 Matt add1 LA... 2 Will add2 LA... 3 Lucy add3 SF... ... And we have a lookup table based on "name"...

How to get rid of “Spark assembly has been built with Hive, including Datanucleus jars on classpath” message?

cron,apache-spark
I am running an Apache Spark app as a cron job, but I keep getting emails with the following message Spark assembly has been built with Hive, including Datanucleus jars on classpath message My cron entry is something like the following ...home/sparkJob.sh > /home/SaprkJobOperation-`date +\%Y\%m\%d\%H`-cron.log I understand that there are...

Spark .mapValues setup with multiple values

scala,apache-spark
I am trying to setup mapValues so I can do something I have the following RDD created: res10: Array[(Int, (Double, Double, Double))] = Array((1,(9.1383276E7,1.868480924818E12,4488.0)), (22,(107667.11999999922,2582934.208799982,4488.0)), (2,(2.15141303E8,1.0585204549689E13,4488.0)), (3,(4488.0,4488.0,4488.0)), (44,(0.0,0.0,4488.0)), (18,(1348501.0,4.06652001E8,4488.0)), (9,(4488.0,4488.0,4488.0))) I am trying to implement the following code but something is off in my syntax: val dataStatsVals = dataStatsRDD.mapValues(x =>...

How to un-nest a spark rdd that has the following type ((String, scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]]))

scala,cassandra,apache-spark
Its a nested map with contents like this when i print it onto screen (5, Map ( "ABCD" -> Map("3200" -> 3, "3350.800" -> 4, "200.300" -> 3) (1, Map ( "DEF" -> Map("1200" -> 32, "1320.800" -> 4, "2100" -> 3) I need to get something like this Case...

Join files using Apache Spark / Spark SQL

java,apache-spark,apache-spark-sql
I am trying to use Apache Spark for comparing two different files based on some common field, and get the values from both files and write it as output file. I am using Spark SQL for joining both files (after storing the RDD as table). Is this the correct approach?...

Apache Spark, add an “CASE WHEN … ELSE …” calculated column to an existing DataFrame

scala,apache-spark,dataframes,apache-spark-sql
I'm trying to add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame, using Scala APIs. Starting dataframe: color Red Green Blue Desired dataframe (SQL syntax: CASE WHEN color == Green THEN 1 ELSE 0 END AS bool): color bool Red 0 Green 1 Blue 0 How...

Issue with UDF on a column of Vectors in PySpark DataFrame

apache-spark,apache-spark-sql,pyspark,spark-sql
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow = Row('id', 'features') data = sc.parallelize([(0, Vectors.dense([9.7, 1.0, -3.2])), (1,...