FAQ Database Discussion Community
As all knows Spark partitioner has a huge performance impact on any "wide" operations, so its usually customized in operations. When I test partitioner with the following code: val rdd = sc.parallelize(1 to 50).keyBy(_ % 10).partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd.cogroup(rdd2) println("cogrouped:"...
If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?
I have to create a hive table from data present in oracle tables. I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files. The sqoop completes successfully and the files also get generated in the HDFS target directory....