hadoop,hive,hdfs,partitioning , Hive output larger than dfs blocksize limit


Hive output larger than dfs blocksize limit

Question:

Tag: hadoop,hive,hdfs,partitioning

I have a table test which was created in hive. It is partitioned by idate and often partitions need to be inserted into. This can leave files on hdfs which have only a few rows.

hadoop fs -ls /db/test/idate=1989-04-01
Found 3 items
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000000_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000001_0
-rwxrwxrwx   3 deployer   supergroup        710 2015-04-26 11:33 /db/test/idate=1989-04-01/000002_0

I am trying to put together a simple script to combine these files, to avoid having many small files on my partitions:

insert overwrite table test partition (idate)
select * from test
where idate = '1989-04-01'
distribute by idate

This works, it creates the new file with all the rows from the old one. The problem is when I run this script on larger partitions, the output is still a single file:

hadoop fs -ls /db/test/idate=2015-04-25
Found 1 items
-rwxrwxrwx   3 deployer   supergroup 1400739967 2015-04-27 10:53 /db/test/idate=2015-04-25/000001_0

This file is over 1 GB in size, but the block size is set to 128 MB:

hive> set dfs.blocksize;
dfs.blocksize=134217728

I could manually set the number of reducers to keep the block size small, but shouldn't this be split up automatically? Why is hive creating files larger than the allowed block size?


NOTE These are compressed rcfiles so I can't just cat them together.


Answer:

It's alright to have a large file that is in splittable format, as downstream job's can split that file based on block size. Generally, you will get 1 output file per reducer, to get more reducers, you should define bucketing on your table. Tune the # buckets to get the files of the size you want? For your bucket column, pick a high cardinality column that you will likely join on as your candidate.


Related:


Apache Spark: Error while starting PySpark


python,hadoop,apache-spark,pyspark
On a Centos machine, Python v2.6.6 and Apache Spark v1.2.1 Getting the following error when trying to run ./pyspark Seems some issue with python but not able to figure out 15/06/18 08:11:16 INFO spark.SparkContext: Successfully stopped SparkContext Traceback (most recent call last): File "/usr/lib/spark_1.2.1/spark-1.2.1-bin-hadoop2.4/python/pyspark/shell.py", line 45, in <module> sc =...

How to run hadoop appliaction automatically?


hadoop
I know that a MapReduce program can be ran using the command line "hadoop jar *.jar" for a time. But now the program is required to be ran a time for every hour in background. Are there any methods to make the MR program be hourly submitted to hadoop automatically?...

Input of the reduce phase is not what I expect in Hadoop (Java)


java,hadoop,mapreduce,reduce,emit
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph): 1 3 3 1 3 2 2 3 Now, I want to use MapReduce to...

Importtsv command gives : Container exited with a non-zero exit code 1 error


hadoop,hbase,classpath,yarn
I am trying to load a tsv file into an existing hbase table. I am using the following command: /usr/local/hbase/bin$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:value '-Dtable_name.separator=\t' Table-name /hdfs-path-to-input-file But when I execute the above command, I get the following error Container id: container_1434304449478_0018_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)...

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?


hadoop
From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)? My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression. 15/06/06 17:19:15 INFO mapred.JobClient: map 100%...

Hive UDF returning an array called twice - performance?


hive,hiveql
I have created a GenericUDF in hive that takes one string argument and returns an array of two strings, something like: > select normalise("ABC-123"); ... > [ "abc-123", "abc123" ] The UDF makes a call out via JNI to a C++ program for each row to calculate the return data...

JMH Benchmark on Hadoop YARN


java,hadoop,yarn,microbenchmark,jmh
I have written a JMH benchmark for my MapReduce job. If I run my app in local mode, it works, but when I run it with the yarn script on my hadoop cluster, then I get the following error: [[email protected] Desktop]$ ./launch_mapreduce.sh # JMH 1.10 (released 5 days ago) #...

permissions to hive tables based on user roles


hive
how can we specify permissions for a table in hive in a way that only specific columns are visible to the users when they query according to there roles( I can use "views" but if they are 150 different roles)

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind issue in ecllipse


maven,hadoop,apache-spark,word-count
I have the below spark word count program : package com.sample.spark; import java.util.Arrays; import java.util.List; import java.util.Map; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFlatMapFunction; import org.apache.spark.api.java.function.PairFunction; import...

Hadoop Basic - error while creating directroy


hadoop,hdfs
I have started learning hadoop recently and I am getting the below error while creating new folder - [email protected]:~/Installations/hadoop-1.2.1/bin$ ./hadoop fs -mkdir helloworld Warning: $HADOOP_HOME is deprecated. 15/06/14 19:46:35 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) Request you to help.below...

Create an external Hive table from an existing external table


csv,hadoop,hive
I have a set of CSV files in a HDFS path and I created an external Hive table, let's say table_A, from these files. Since some of the entries are redundant, I tried creating another Hive table based on table_A, say table_B, which has distinct records. I was able to...

Hadoop append data to hdfs file and ignore duplicate entries


java,hadoop,mapreduce,hive,hdfs
How can I append data to HDFS files and ignore duplicate values? I have a huge HDFS file (MainFile) and I have 2 other new files from different sources and I want to append data from this files to the MainFile. Main File and the other files has same structure....

hadoop large file does not split


performance,hadoop,split,mapreduce
I have an input file of size 136MB and I launched some WordCount test and I monitor only one mapper. Then I set dfs.blocksize to 64MB in my hdfs-site.xml and I still get one mapper. Am I doing wrong ?

How to insert and Update simultaneously to PostgreSQL with sqoop command


postgresql,hadoop,hive,sqoop
I am trying to insert into postgreSQL DB with sqoop command. sqoop export --connect jdbc:postgresql://10.11.12.13:1234/db --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema It is working fine if there is not primary key constrain. I want to insert new records and update old records...

What are the different ways to check if the mapreduce program ran successfully


hadoop,mapreduce,bigdata
If we need to automate a mapreduce program or run from a script, what are the different ways to check if the mapreduce program ran successfully? One way is to find is if _SUCCESS file is created in the output directory. Does the command "hadoop jar program.jar hdfs:/input.txt hdfs:/output" return...

SQL Server 2012 & Polybase - 'Hadoop Connectivity' configuration option missing


sql-server,hadoop,sql-server-2012
As described in the title, I am using SQL Server 2012 Parallel Data Warehouse with Polybase feature to try to access a HDInisght Hadoop cluster. As a starting point for every connection to Hadoop from SQL Server, I find to execute the command sp_configure @configname = 'hadoop connectivity', @configvalue =...

Use of core-site.xml in mapreduce program


hadoop,mapreduce,bigdata
I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

Hadoop map reduce Extract specific columns from csv file in csv format


java,hadoop,file-io,mapreduce,bigdata
I am new to hadoop and working on a big data project where I have to clean and filter given csv file. like if given csv file has 200 columns then I need to select only 20 specific columns (so called data filtering) as a output for further operation. also...

what is the main difference between dynamic and static partitioning in hive


hive
What is the main difference between static and dynamic partition in hive. Using individual insert means static and single insert to partition table means dynamic. Is there any other advantage.

Spark on yarn jar upload problems


java,hadoop,mapreduce,apache-spark
I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following: YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/; /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar This method gives the following error: diagnostics: Application application_1434177111261_0007...

Best way to store relational data in hdfs


sql,hadoop,hdfs
I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs. First of...

ERROR jdbc.HiveConnection: Error opening session Hive


java,hadoop,jdbc,hive
i try to run JBDC code for Hive2 get error. i have hive 1.2.0 version hadoop 1.2.1 version. but in command line hive and beeline works fine without any problem.but with jdbc getting error. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static...

Why we are configuring mapred.job.tracker in YARN?


hadoop,mapreduce,yarn
What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port. The description for mapred.job.tracker property is "The host and port that the MapReduce job...

jets3t cannot upload file to s3


hadoop,amazon-s3,jets3t
I'm trying to upload files from local to s3 using hadoop fs and jets3t, but I'm getting the following error Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error. HEAD '/project%2Ftest%2Fsome_event%2Fdt%3D2015-06-17%2FsomeFile' on Host 'host.s3.amazonaws.com' @ 'Thu, 18 Jun 2015 23:33:01 GMT' -- ResponseCode: 404, ResponseStatus: Not Found, RequestId: AVDFJKLDFJ3242, HostId: D+sdfjlakdsadf\asdfkpagjafdjsafdj I'm...

issue monitoring hadoop response


hadoop,cluster-computing,ganglia,gmetad
I am using ganglia to monitor Hadoop. gmond and gmetad are running fine. When I telnet on gmond port (8649) and when I telnet gmetad on its xml answer port, I get no hadoop data. How can it be ? cluster { name = "my cluster" owner = "Master" latlong...

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]


java,hadoop,arraylist,mapreduce
This question already has an answer here: Output a list from a Hadoop Map Reduce job using custom writable 1 answer How can we pass an arraylist as value from the mapper to the reducer. My code basically has certain rules to work with and would create new values(String)...

Save flume output to hive table with Hive Sink


hadoop,hive,flume
I am trying to configure flume with Hive to save flume output to hive table with Hive Sink type. I have single node cluster. I use mapr hadoop distribution. Here is my flume.conf agent1.sources = source1 agent1.channels = channel1 agent1.sinks = sink1 agent1.sources.source1.type = exec agent1.sources.source1.command = cat /home/andrey/flume_test.data agent1.sinks.sink1.type...

Oozie on YARN - oozie is not allowed to impersonate hadoop


hadoop,yarn,oozie,ambari
I'm trying to use Oozie from Java to start a job on a Hadoop cluster. I have very limited experience with Oozie on Hadoop 1 and now I'm struggling trying out the same thing on YARN. I'm given a machine that doesn't belong to the cluster, so when I try...

hadoop complains about attempting to overwrite nonempty destination directory


hadoop,hdfs
I'm following Rasesh Mori's instructions to install Hadoop on a multinode cluster, and have gotten to the point where jps shows the various nodes are up and running. I can copy files into hdfs; I did so with $HADOOP_HOME/bin/hdfs dfs -put ~/in /in and then tried to run the wordcount...

Vertica: Input record 1 has been rejected (Too few columns found)


hadoop,vertica
I am trying to copy file from Hadoop to a Vertica table and get the an error. The problem is same copy sometimes pass and some times fails,any idea? The Error: Caused by: java.sql.SQLException: [Vertica]VJDBC ERROR: COPY: Input record 1 has been rejected (Too few columns found) at com.vertica.util.ServerErrorData.buildException(Unknown Source)...

how to drop partition metadata from hive, when partition is drop by using alter drop command


hadoop,apache-hive
I have dropped the all the partitions in the hive table by using the alter command alter table emp drop partition (hiredate>'0'); After droping partitions still I can see the partitions metadata.How to delete this partition metadata? Can I use the same table for new partitions? ...

HIVE or SQL query to compare pre and post sales for same sample size


mysql,sql,hive
I have a table which has columns as employeeIDs (string), performance rating(int), date(string), along with flag account (string) if subscribed( Account = 'yes' after subscription and 'no' before subscription) DIFFERENT EMPLOYEES subscribe on different dates, pre = before subscription post = after subscription Need to calculate their sum performance...

Flink error - org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4


maven,hadoop,flink
I am trying to run a flink job using a file from HDFS. I have created a dataset as following - DataSource<Tuple2<LongWritable, Text>> visits = env.readHadoopFile(new TextInputFormat(), LongWritable.class,Text.class, Config.pathToVisits()); I am using flink's latest version - 0.9.0-milestone-1-hadoop1 (I have also tried with 0.9.0-milestone-1) whereas my Hadoop version is 2.6.0 But,...

HIVE: apply delimiter until a specified column


hadoop,datatable,hive,delimiter
I am trying to move data from a file into a hive table. The data in the file looks something like this:- StringA StringB StringC StringD StringE where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one...

Different ways of hadoop installation


hadoop,installation
I'm new to hadoop and trying to install it on my local machine. I see that there are many ways in installing hadoop like install vmware Horton works and install hadoop on top of that or install Oracle virtual box , Cloudera and then Hadoop . My question is that...

Add PARTITION after creating TABLE in hive


hadoop,hive,partition
i have created a non partitioned table and load data into the table,now i want to add a PARTITION on the basis of department into that table,can I do this? If I do: ALTER TABLE Student ADD PARTITION (dept='CSE') location '/test'; It gives me error: FAILED: SemanticException table is not...

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field


hadoop,apache-pig
Below is the data I have and the schema for the same is- student_name, question_number, actual_result(either - false/Correct) (b,q1,Correct) (a,q1,false) (b,q2,Correct) (a,q2,false) (b,q3,false) (a,q3,Correct) (b,q4,false) (a,q4,false) (b,q5,flase) (a,q5,false) What I want is to get the count for each student i.e. a/b for total correct and false answer he/she has...

Hive external table not reading entirety of string from CSV source


csv,hadoop,hive,hiveql
Relatively new to the Hadoop world so apologies if this is a no-brainer but I haven't found anything on this on SO or elsewhere. In short, I have an external table created in Hive that reads data from a folder of CSV files in HDFS. The issue is that while...

Merging two columns into a single column and formatting the content to form an accurate date-time format in Hive?


sql,regex,hadoop,hive,datetime-format
these are the 2 columns(month,year). I want to create a single column out of them having an accurate date-time format('YYYY-MM-DD HH:MM:SS') and add as new column in the table. Month year 12/ 3 2013 at 8:40pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at...

Datanode and Nodemanager on slave machine are not able to connect to NameNode and ResourceManager on master machine


java,apache,sockets,hadoop,tcp
I have installed hadoop on two node cluster- Node1 and Node2. Node1 is master and Node2 is slave. Node2's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1 respectively. However Node1's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1. Node1: jps...

Sqoop Export with Missing Data


sql,postgresql,shell,hadoop,sqoop
I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a...