FAQ Database Discussion Community
I have to create a hive table from data present in oracle tables. I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files. The sqoop completes successfully and the files also get generated in the HDFS target directory....
If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?
As all knows Spark partitioner has a huge performance impact on any "wide" operations, so its usually customized in operations. When I test partitioner with the following code: val rdd = sc.parallelize(1 to 50).keyBy(_ % 10).partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd.cogroup(rdd2) println("cogrouped:"...