FAQ Database Discussion Community

which logic sould be followed using custom partitioner in map reduce to solve this

If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

In Apache Spark, why does RDD.union does not preserve partitioner?

As all knows Spark partitioner has a huge performance impact on any "wide" operations, so its usually customized in operations. When I test partitioner with the following code: val rdd = sc.parallelize(1 to 50).keyBy(_ % 10).partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd.cogroup(rdd2) println("cogrouped:"...

Data in HDFS files not seen under hive table

I have to create a hive table from data present in oracle tables. I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files. The sqoop completes successfully and the files also get generated in the HDFS target directory....