FAQ Database Discussion Community


In Apache Spark, why does RDD.union does not preserve partitioner?

apache-spark,partitioning,hadoop-partitioning
As all knows Spark partitioner has a huge performance impact on any "wide" operations, so its usually customized in operations. When I test partitioner with the following code: val rdd = sc.parallelize(1 to 50).keyBy(_ % 10).partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd.cogroup(rdd2) println("cogrouped:"...

which logic sould be followed using custom partitioner in map reduce to solve this

java,hadoop,mapreduce,load-balancing,hadoop-partitioning
If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

Data in HDFS files not seen under hive table

hadoop,hive,sqoop,hadoop-partitioning
I have to create a hive table from data present in oracle tables. I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files. The sqoop completes successfully and the files also get generated in the HDFS target directory....