FAQ Database Discussion Community


PySpark No suitable driver found for jdbc:mysql://dbhost

apache-spark,apache-spark-sql,pyspark
I am trying to write my dataframe to a mysql table. I am getting No suitable driver found for jdbc:mysql://dbhost when I try write. As part of the preprocessing I read from other tables in the same DB and have no issues doing that. I can do the full run...

How to use Spark SQL DataFrame with flatMap?

scala,apache-spark,apache-spark-sql
I am using the Spark Scala API. I have a Spark SQL DataFrame (read from an Avro file) with the following schema: root |-- ids: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: integer | | |-- value: string (valueContainsNull = true) |--...

Spark sql: query with case and thousands of columns

mysql,apache-spark,cloudera-cdh,apache-spark-sql,spark-sql
I had a table with two thousands columns. i need to modify few columns data based on flag column. tableSchemaRDD.registerAsTable("customer") var results = sqlContext.sql("select *,case when flag1 = 'A' then null else charges end as charges, flag2 = 'B' then null then else stax end as stax from customer") flag1,flag2,...

scala.MatchError: in Dataframes

java,scala,apache-spark,spark-streaming,apache-spark-sql
I have one Spark (version 1.3.1) application. In which, I am trying to convert one Java bean RDD JavaRDD<Message> into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double). But when, I am executing my Code. messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){ @Override public Void call(JavaRDD<Message> arg0, Time arg1)...

Calculating duration by subtracting two datetime columns in string format

apache-spark,apache-spark-sql,pyspark
I have a Spark Dataframe in that consists of a series of dates: from pyspark.sql import SQLContext from pyspark.sql import Row from pyspark.sql.types import * sqlContext = SQLContext(sc) import pandas as pd rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'), ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'), ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),...

Why spark application fail with “executor.CoarseGrainedExecutorBackend: Driver Disassociated”?

apache-spark,apache-spark-sql
When i execute query sql via spark-submit and spark-sql, corresponding spark application always fail with error follows: 15/03/10 18:50:52 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp:[email protected]:60697/user/HeartbeatReceiver 15/03/10 18:52:08 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp:[email protected]:35643] -> [akka.tcp:[email protected]:60697] disassociated! Shutting down. and above is just one of the error, i used "yarn logs -application...

Error when running job that queries against Cassandra via Spark SQL through Spark Jobserver

cassandra,apache-spark,apache-spark-sql,spark-jobserver,spark-cassandra-connector
So I'm trying to run job that simply runs a query against cassandra using spark-sql, the job is submitted fine and the job starts fine. This code works when it is not being run through spark jobserver (when simply using spark submit). Could someone tell my what is wrong with...

How do I flatMap a row of arrays into multiple rows?

apache-spark,apache-spark-sql
After parsing some jsons I have a one-column DataFrame of arrays scala> val jj =sqlContext.jsonFile("/home/aahu/jj2.json") res68: org.apache.spark.sql.DataFrame = [r: array<bigint>] scala> jj.first() res69: org.apache.spark.sql.Row = [List(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)] I'd like to explode each row out into several rows. How? edit: Original json file:...

How to load DataFrame directly to Hive in Spark

hive,apache-spark,apache-spark-sql
Is it possible to save DataFrame in spark directly to Hive. I have tried with converting DataFrame to Rdd and then saving as text file and then loading in hive. But i am wondering if i can directly save dataframe to hive...

Where are the API docs for org.apache.spark.sql.cassandra for Spark 1.3.x?

cassandra,apache-spark,apache-spark-sql
I'm writing a Spark job that uses Spark-Cassandra connector to connect to Cassandra from spark and then runs queries on Spark/Cassandra using Spark SQL. I was wondering where I could find the API docs for this? Looking at the api here https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package It would seem like the package doesn't even...

Is there better way to display entire Spark SQL DataFrame?

scala,apache-spark,apache-spark-sql
I would like to display the entire Apache Spark SQL DataFrame with the Scala API. I can use the show() method: myDataFrame.show(Int.MaxValue) Is there a better way to display an entire DataFrame than using Int.MaxValue?...

Spark Cassandra SQL can't perform DataFrame methods on query results

scala,cassandra,apache-spark-sql,spark-cassandra-connector
So I have a Spark-Cassandra cluster that I am trying to execute sql queries on. I build a jar with sbt assembly then I submit it with spark-submit. This works fine when I am not using spark-sql. When I am using spark sql I get an error, below is the...

Apache Spark, add an “CASE WHEN … ELSE …” calculated column to an existing DataFrame

scala,apache-spark,dataframes,apache-spark-sql
I'm trying to add an "CASE WHEN ... ELSE ..." calculated column to an existing DataFrame, using Scala APIs. Starting dataframe: color Red Green Blue Desired dataframe (SQL syntax: CASE WHEN color == Green THEN 1 ELSE 0 END AS bool): color bool Red 0 Green 1 Blue 0 How...

Inserting Analytic data from Spark to Postgres

java,postgresql,cassandra,apache-spark,apache-spark-sql
I have Cassandra database from which i analyzed the data using SparkSQL through Apache Spark. Now i want to insert those analyzed data into PostgreSQL . Is there any ways to achieve this directly apart from using the PostgreSQL driver (I achieved it using postREST and Driver i want to...

Filter out rows with NaN values for certain column

scala,apache-spark,apache-spark-sql
I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql: val df_data = sqlContext.sql("SELECT...

Spark: How to translate count(distinct(value)) in Dataframe API's

count,apache-spark,distinct,dataframes,apache-spark-sql
I'm trying to compare different ways to aggregate my data. This is my input data with 2 elements (page,visitor): (PAG1,V1) (PAG1,V1) (PAG2,V1) (PAG2,V2) (PAG2,V1) (PAG1,V1) (PAG1,V2) (PAG1,V1) (PAG1,V2) (PAG1,V1) (PAG2,V2) (PAG1,V3) Working with a SQL command into Spark SQL with this code: import sqlContext.implicits._ case class Log(page: String, visitor: String)...

Defining a dictionary with a very large amount of columns

python,dictionary,apache-spark,dataframes,apache-spark-sql
I have a dataset that I want to move to spark sql. This dataset has about 200 columns. The best way I have found to doing this is mapping the data to a dictionary and then moving that dictionary to a spark sql table. The problem is that if I...

Spark SQL optimal configurations for a single node process?

apache-spark,spark-streaming,apache-spark-sql
We're using Spark SQL's great in memory sql functionality to join and parse some local data files before uploading them elsewhere. While we're happy with the functionality, we'd like to tweak the configs to squeeze some extra performance out. We don't have a cluster, but will likely have 5 individual...

HBase Get values where rowkey in

hadoop,apache-spark,hbase,apache-spark-sql
How do I get all the values in HBase given Rowkey values? val tableName = "myTable" val hConf = HBaseConfiguration.create() val hTable = new HTable(hConf, tableName) val theget= new Get(Bytes.toBytes("1001-A")) // rowkey values (1001-A, 1002-A, 2010-A, ...) val result = hTable.get(theget) val values = result.listCells() The code above only works...

compute string length in Spark SQL DSL

documentation,apache-spark,string-length,apache-spark-sql
I've been trying to compute on the fly the length of a string column in a SchemaRDD for orderBy purposes. I am learning Spark SQL so my question is strictly about using the DSL or the SQL interface that Spark SQL exposes, or to know their limitations. My first attempt...

Spark: Group RDD Sql Query

sql,hadoop,apache-spark,rdd,apache-spark-sql
I have 3 RDDs that I need to join. val event1001RDD: schemaRDD = [eventtype,id,location,date1] [1001,4929102,LOC01,2015-01-20 10:44:39] [1001,4929103,LOC02,2015-01-20 10:44:39] [1001,4929104,LOC03,2015-01-20 10:44:39] val event2009RDD: schemaRDD = [eventtype,id,celltype,date1] (not grouped by id since I need 4 dates from this depending on celltype) [2009,4929101,R01,2015-01-20 20:44:39] [2009,4929102,R02,2015-01-20 14:00:00] (RPM) [2009,4929102,P01,2015-01-20 12:00:00] (PPM) [2009,4929102,R03,2015-01-20 15:00:00] (RPM)...

Using HiveContext in Maven project

eclipse,scala,maven,apache-spark,apache-spark-sql
I have built Spark-1.2.1 using Maven to enable Hive support using the following command : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package which resulted in some class files generated in /spark-1.2.1/core/target/scala-2.10/classes folder Now how do I use this newly built Spark in my Eclipse + Maven project? I want...

Spark Streaming : Join Dstream batches into single output Folder

hadoop,apache-spark,spark-streaming,apache-spark-sql,twitter-streaming-api
I am using Spark Streaming to fetch tweets from twitter by creating a StreamingContext as : val ssc = new StreamingContext("local[3]", "TwitterFeed",Minutes(1)) and creating twitter stream as : val tweetStream = TwitterUtils.createStream(ssc, Some(new OAuthAuthorization(Util.config)),filters) then saving it as text file tweets.repartition(1).saveAsTextFiles("/tmp/spark_testing/") and the problem is that the tweets are being...

How is spark HiveContext/SQLContext retrieving schema/data?

apache-spark,apache-spark-sql
I can't seem to find much documentation on it but when I pull data from Hive in Spark SQL how is it retrieving the schema, is it automatically looking in the Hive Metastore? Also is it Hive telling spark to look at the file location to pull the data into...

How to extract hashtags (or other Arrays) from Twitter Tweets in Apache Spark

json,twitter,apache-spark,apache-spark-sql,arraybuffer
I am trying to do analysis on Twitter Tweet data with Apache Spark from a file of JSON Tweet objects. Here's how I'm loading it in with Spark's jsonFile method: val sqc = new org.apache.spark.sql.SQLContext(sc) val tweets = sqc.jsonFile("stored_tweets/*.json") tweets.registerTempTable("tweets") Next, I sample only the hashtag entities with the following...

Issue with UDF on a column of Vectors in PySpark DataFrame

apache-spark,apache-spark-sql,pyspark,spark-sql
I am having trouble using a UDF on a column of Vectors in PySpark which can be illustrated here: from pyspark import SparkContext from pyspark.sql import Row from pyspark.sql.types import DoubleType from pyspark.sql.functions import udf from pyspark.mllib.linalg import Vectors FeatureRow = Row('id', 'features') data = sc.parallelize([(0, Vectors.dense([9.7, 1.0, -3.2])), (1,...

Does any of Cloudera Hadoop distribution supports Apache Spark SQL

apache-spark,cloudera-cdh,apache-spark-sql
I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..? And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark...

Exception when using UDT in Spark DataFrame

apache-spark,apache-spark-sql
I'm trying to create a user defined type in spark sql, but I receive: com.ubs.ged.risk.stdout.spark.ExamplePointUDT cannot be cast to org.apache.spark.sql.types.StructType even when using their example. Has anyone made this work? My code: test("udt serialisation") { val points = Seq(new ExamplePoint(1.3, 1.6), new ExamplePoint(1.3, 1.8)) val df = SparkContextForStdout.context.parallelize(points).toDF() } @SQLUserDefinedType(udt...

Dataframe is not saved into Cassandra

java,cassandra,apache-spark,apache-spark-sql,spark-cassandra-connector
I have one application with Spark (version 1.4.0) and Spark-Cassandra-connector (version 1.3.0-M1). In which, I am trying to store one dataframe into Cassandra table which has two columns (total, message). And i already created table into Cassandra with these two columns. Here is my Code, scoredTweet.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){ @Override public Void...

How to enable SQL on SchemaRDD via the JDBC interface? (is it even possible ?)

hive,apache-spark,scala-2.10,apache-spark-sql
UPDATING the problem statement We are using spark 1.2.0 (Hadoop 2.4). We have defined SchemaRDDs using data files in HDFS and would like to enable querying these as tables via HiveServer2. We are encountering runtime exceptions while trying to saveAsTable and would like guidance on how to proceed. Source code:...

Skip/Take with Spark SQL

sql,scala,apache-spark,datastax-enterprise,apache-spark-sql
How would one go about implementing a skip/take query (typical server side grid paging) using Spark SQL. I have scoured the net and can only find very basic examples such as these here: https://databricks-training.s3.amazonaws.com/data-exploration-using-spark-sql.html I don't see any concept of ROW_NUMBER() or OFFSET/FETCH like with T-SQL. Does anyone know how...

How to install Apache Zeppelin on existing Apache Spark standalone cluster

amazon-web-services,apache-spark,bigdata,apache-spark-sql,apache-zeppelin
I have an existing Apache Spark (1.3 version) standalone cluster on AWS and I would like to install Apache Zeppelin. I have a very simple question, do I have to install Zeppelin on the Spark's master? If the answer is yes, Could I use that guide https://github.com/apache/incubator-zeppelin#build ? thank you...

Apache Spark MySQL JavaRDD.foreachPartition - why I getting ClassNotFoundException

java,mysql,apache-spark,classnotfoundexception,apache-spark-sql
I would like to save data from each partition to MySQL Database. For doing that I created Class which implements VoidFunction<> : public class DatabaseSaveFunction implements VoidFunction<Iterator<String>> { /** * */ private static final long serialVersionUID = -7039277486852158360L; public void call(Iterator<String> it) { Connection connect = null; PreparedStatement preparedStatement =...

Spark SQL Stackoverflow

apache-spark,apache-spark-sql
I'm a newbie on spark and spark sql and I was trying to make the example that is on Spark SQL website, just a simple SQL query after loading the schema and data from a JSON files directory, like this: import sqlContext.createSchemaRDD val sqlContext = new org.apache.spark.sql.SQLContext(sc) val path =...

How to pass list of values, json pyspark

apache-spark,apache-spark-sql,pyspark,sparklines
>>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> rdd =sqlContext.jsonFile("tmp.json") >>> rdd_new= rdd.map(lambda x:x.name,x.age) Its working properly.But there is list of values list1=["name","age","gene","xyz",.....] When I am passing For each_value in list1: `rdd_new=rdd.map(lambda x:x.each_value)` I am getting error ...

Removing Unnecessary JSON fields using SPARK (SQL)

json,apache-spark,apache-spark-sql
I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that...

java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL

scala,jdbc,apache-spark,apache-spark-sql
I'm hitting very strange problem when trying to load JDBC DataFrame into Spark SQL. I've tried several Spark clusters - YARN, standalone cluster and pseudo distributed mode on my laptop. It's reproducible on both Spark 1.3.0 and 1.3.1. The problem occurs in both spark-shell and when executing the code with...

Should I use registerDataFrameAsTable in Spark SQL?

apache-spark,apache-spark-sql,pyspark
During migration from PySpark to Spark with Scala I encountered a problem caused by the fact that SqlContext's registerDataFrameAsTable method is private. It made me think that my approach might be incorrect. In PySpark I do the following: load each table: df = sqlContext.load(source, url, dbtable), then register each sqlContext.registerDataFrameAsTable(df,...

SparkSQL UDF Registration in Java8

apache-spark,java-8,dataframes,apache-spark-sql
I'm using Spark 1.3.0 on Java 8. I've got no issues setting up my SQLContext and creating dataframes, the spark DSL is pretty smooth. But I want to use a custom UDF. According to the spark documentation: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#udf-registration-moved-to-sqlcontextudf-java--scala sqlCtx.udf().register("strLen", (String s) -> { s.length(); }); Should do it for registering...

Add new column in DataFrame base on existing column

scala,apache-spark,apache-spark-sql
I have a csv file with datetime column: "2011-05-02T04:52:09+00:00". I am using scala, the file is loaded into spark DataFrame and I can use jodas time to parse the date: val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val df = new SQLContext(sc).load("com.databricks.spark.csv", Map("path" -> "data.csv", "header" -> "true")) val d...

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

apache-spark,apache-spark-sql,pyspark
Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), ('Baz',22,'US',6), ('Baz',36,'US',6)]) What I would like to do is remove duplicate rows based on the values of the first,third and fourth columns only. Removing entirely duplicate rows is straightforward: data = data.distinct()...

Run spark SQL on CHD5.4.1 NoClassDefFoundError

hive,apache-spark,apache-spark-sql,pyspark
I setup my CHD5.4.1 to run some test Spark SQL on Spark. Spark work well but Spark SQL have some issues. I start pyspark as below: /opt/cloudera/parcels/CDH-5.4.1-1.cdh5.4.1.p0.6/lib/spark/bin/pyspark --master yarn-client I want to select a table in Hive with Spark SQL: results = sqlCtx.sql("SELECT * FROM my_table").collect() It print error logs:...

How to add any new library like spark-csv in Apache Spark prebuilt version

python,apache-spark,apache-spark-sql
I have build the Spark-csv and able to use the same from pyspark shell using the following command bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 error getting >>> df_cat.save("k.csv","com.databricks.spark.csv") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/dataframe.py", line 209, in save self._jdf.save(source, jmode, joptions) File...

Join files using Apache Spark / Spark SQL

java,apache-spark,apache-spark-sql
I am trying to use Apache Spark for comparing two different files based on some common field, and get the values from both files and write it as output file. I am using Spark SQL for joining both files (after storing the RDD as table). Is this the correct approach?...

What's the meaning of “already computed partitions that can short-circuit the computation of a parent RDD”?

apache-spark,rdd,apache-spark-sql
The spark thesis(http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf) say as pic below I don't understand "What's the meaning of "already computed partitions that can short-circuit the computation of a parent RDD" Can you explain it to me and list one or two examples ?...

How do I groupBy on a SchemaRDD

scala,apache-spark,apache-spark-sql
Suppose I have a SchemaRDD tableRDD. How can i groupBy on a certain column and get the count(*) as countGrouped of the resultant set. I am trying something like : tableRDD.groupBy('colname)(??).collect() I am not able to understand what should be my aggregate function (represented by ??)...

spark - SparkContext and SqlContext - lifecycle and threadsafty

apache-spark,apache-spark-sql
I am developing rest services top of spark which takes user query as input, formulate spark sql and execute it on spark cluster. Here's what my assumption with JavaSparkContext and JavaSqlContext: 1) They both threadsafe 2) I can reuse single instance throughout application(Context!) Based on this I initialize only one...

Zeppelin SqlContext registerTempTable issue

scala,apache-spark,apache-spark-sql,apache-zeppelin
I am trying to access some json data using sqlContext.jsonFile in zeppelin... following code execute without any error: import sys.process._ val sqlCon = new org.apache.spark.sql.SQLContext(sc) val jfile = sqlCon.jsonFile(s"file:///usr/local/src/knoldus/projects/scaladay_data/scalaDays2015Amsterdam_tweets.json") import sqlContext.implicits._ jfile.registerTempTable("jTable01") output : import sys.process._ sqlCon: org.apache.spark.sql.SQLContext = [email protected] jfile: org.apache.spark.sql.DataFrame = [id: struct,...

Is it possible to get and use a JavaSparkContext from within a task?

apache-spark,spark-streaming,apache-spark-sql
I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse. For every incoming record, I'd like to potentially launch a...

Spark SQL HiveContext - saveAsTable creates wrong schema

hive,apache-spark,apache-spark-sql
I try to store a Dataframe to a persistent Hive table in Spark 1.3.0 (PySpark). This is my code: sc = SparkContext(appName="HiveTest") hc = HiveContext(sc) peopleRDD = sc.parallelize(['{"name":"Yin","age":30}']) peopleDF = hc.jsonRDD(peopleRDD) peopleDF.printSchema() #root # |-- age: long (nullable = true) # |-- name: string (nullable = true) peopleDF.saveAsTable("peopleHive") The Hive...

Using Spark Shell (CLI) in standalone mode on distributed files

apache-spark,apache-spark-sql
I am using Spark 1.3.1 in standalone mode (No YARN/HDFS involved - Only Spark) on a cluster with 3 machines. I have a dedicated node for master (no workers running on it) and 2 separate worker nodes. The cluster starts healthy, and I am just trying to test my installation...

Scala: Spark sqlContext query

sql,hadoop,apache-spark,apache-spark-sql,parquet
I only have 3 events (3rd column) 01, 02, 03 in my file. the schema is unixTimestamp|id|eventType|date1|date2|date3 639393604950|1001|01|2015-05-12 10:00:18||| 639393604950|1002|01|2015-05-12 10:04:18||| 639393604950|1003|01|2015-05-12 10:05:18||| 639393604950|1001|02||2015-05-12 10:40:18|| 639393604950|1001|03|||2015-05-12 19:30:18| 639393604950|1002|02|2015-05-12 10:04:18||| in sqlContext, how do I merge the data by ID? I am expecting this for id 1001: 639393604950|1001|01|2015-05-12 10:00:18|2015-05-12 10:40:18|2015-05-12 19:30:18|...

Join two (non)paired RDDs to make a DataFrame

apache-spark,rdd,apache-spark-sql,pyspark
As the title describes, say I have two RDDs rdd1 = sc.parallelize([1,2,3]) rdd2 = sc.parallelize([1,0,0]) or rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)]) rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)]) How can I create the following DataFrame? Id Result 1 1 2 0 3 0 If I could create the paired RDD [(1,1),(2,0),(3,0)]...

Getting NullPointerException when using Spark-submit [duplicate]

scala,apache-spark,user-defined-functions,apache-spark-sql
This question already has an answer here: What is a NullPointerException? [duplicate] 4 answers I have some code that reads a data file into a Spark dataframe, is supposed to filter out any rows with nulls in a certain field, and writes out the result. When I run it...

Reshaping/Pivoting data in Spark RDD and/or Spark DataFrames

apache-spark,apache-spark-sql,pyspark
I have some data in the following format (either RDD or Spark DataFrame): from pyspark.sql import SQLContext sqlContext = SQLContext(sc) rdd = sc.parallelize([('X01',41,'US',3), ('X01',41,'UK',1), ('X01',41,'CA',2), ('X02',72,'US',4), ('X02',72,'UK',6), ('X02',72,'CA',7), ('X02',72,'XX',8)]) # convert to a Spark DataFrame schema = StructType([StructField('ID', StringType(), True), StructField('Age', IntegerType(), True), StructField('Country', StringType(), True), StructField('Score', IntegerType(), True)]) df...

Writing RDD partitions to individual parquet files in its own directory

apache-spark,rdd,apache-spark-sql,parquet
I am struggling with step where I want to write each RDD partition to separate parquet file with its own directory. Example will be: <root> <entity=entity1> <year=2015> <week=45> data_file.parquet Advantage of this format is I can use this directly in SparkSQL as columns and I will not have to repeat...

Is the DStream return by updateStateByKey function only contains one RDD?

apache-spark,spark-streaming,apache-spark-sql,pyspark
Is the DStream return by updateStateByKey function only contains one RDD? If not,Under what circumstances will the DStream contains more than one RDD?

How do I sort a dataframe by column in descending order using the scala api in spark?

scala,apache-spark,apache-spark-sql
I tried df.orderBy("col1").show(10) but it sorted in ascending order. df.sort("col1").show(10) also sorts in descending order. I looked on stackoverflow and the answers I found were all outdated or referred to RDDs. I'd like to use the native dataframe in spark.

Spark: Hive Query

hive,apache-spark,hiveql,apache-spark-sql,parquet
I have a log file, and the first column would be my partition in Hive table. logSchemaRDD.registerTempTable("logs") hiveContext.sql("insert overwrite table logs_parquet PARTITION(create_date=select ? from logs) select * from logs") How do I construct the query to select the first column (marked as ? here) and ensure that the one I...