FAQ Database Discussion Community


How to log using log4j to local file system inside a Spark application that runs on YARN?

logging,log4j,apache-spark,yarn
I'm building an Apache Spark Streaming application and cannot make it log to a file on the local filesystem when running it on YARN. How can achieve this? I've set log4.properties file so that it can successfully write to a log file in /tmp directory on the local file system...

How can get memory and CPU usage of hadoop yarn application?

hadoop,memory,mapreduce,cpu-usage,yarn
I want to ask, after I've ran my hadoop mapreduce application, how can I get the total memory and CPU usage of that application. I've seen it on log and resource manager web page but I didn't get it. Is it possible? Can I get it per job execution or...

Spark streaming on YARN executor's logs not available

logging,apache-spark,yarn,spark-streaming
I'm running the following code .map{x => Logger.fatal("Hello World") x._2 } It's spark streaming applciation runs on YARN. I upadted log4j and provided it with spark-submit (using --files). My Log4j configuration was loaded which I see from logs and applied to Driver's logs (I see my log level only and...

Which cluster type should I choose for Spark?

apache-spark,yarn,mesos
I am new to Apache Spark, and I just learned that Spars supports 3 types of cluster: Standalone - meaning Spark will manage its own cluster YARN - using Hadoop's YARN resource manager Mesos - Apache's dedicated resource manager project Since I am new to Spark, I think I should...

YARN UNHEALTHY nodes

hadoop,distributed-computing,cloudera,yarn,cloudera-cdh
In our YARN cluster which is 80% full, we are seeing some of the yarn nodemanager's are marked as UNHEALTHY. after digging into logs I found its because disk space is 90% full for data dir. With following error 2015-02-21 08:33:51,590 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Node hdp009.abc.com:8041 reported UNHEALTHY with details: 4/4...

Apache Hadoop Yarn - Underutilization of cores

hadoop,apache-spark,yarn,resourcemanager
No matter how much I tinker with the settings in yarn-site.xml i.e using all of the below options yarn.scheduler.minimum-allocation-vcores yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores yarn.scheduler.maximum-allocation-mb yarn.scheduler.maximum-allocation-vcores i just still cannot get my application i.e Spark to utilize all the cores on the cluster. The spark executors seem to be correctly taking up all...

Spark resources not fully allocated on Amazon EMR

apache-spark,yarn,emr
I'm trying to maximize cluster usage for a simple task. Cluster is 1+2 x m3.xlarge, runnning Spark 1.3.1, Hadoop 2.4, Amazon AMI 3.7 The task reads all lines of a text file and parse them as csv. When I spark-submit a task as a yarn-cluster mode, I get one of...

How/Where to set limits to avoid error container running beyond physical memory limits

hadoop,yarn,cloudera-cdh
I know this type of question has been addressed in a few posts, but I cannot find an answer the provides the specific "how" or "where" I am using CDH5.2, running an oozie workflow that executes a shell command. Each time I run it, nodemanager kills the job with the...

How to extract application ID from the PySpark context

apache-spark,yarn,pyspark
A previous question recommends sc.applicationId, but it is not present in PySpark, only in scala. So, how do I figure out the application id (for yarn) of my PySpark process?...

What does it mean that my resource manager does not have an open port 8032?

hadoop,yarn,cloudera-cdh
I have my YARN resource manager on a different node than my namenode, and I can see that something is running, which I take to be the resource manager. Ports 8031 and 8030 are bound, but not port 8032, to which my client tries to connect. I am on CDH...

PySpark distributed processing on a YARN cluster

apache-spark,yarn,cloudera-cdh,pyspark
I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark). I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from). I have...

How much memory and vcore allocated on hadoop YARN?

hadoop,memory,allocation,core,yarn
I want to ask, in hadoop yarn both on yarn-site.xml and mapred-site.xml there are property like minimum and maximum memory or vcore. I'm little bit confuse, actually in real how much memory and vcore allocated because on configuration we only write minimum and maximum not the actual size. If I...

Yarn MapReduce job dies with strange message

java,hadoop,yarn
I have Hadoop-Yarn cluster, when i try to run hadoop examples i get strange error message in the container log: Error: Could not find or load main class 1638 My Java version is: java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed...

How to find the exact hadoop jar command which was running my job?

hadoop,yarn,oozie,cascading,scalding
I'm using CDH5.4. I'm running a hadoop job which from command line appears to be ok (when simply running with hadoop jar). However if I run it from yarn It finishes silently with a single mapper and no reducers. I really suspect both 'runs' were running the same exact command....

Could not deallocate container for task attemptId NNN

hadoop,memory,mapreduce,bigdata,yarn
I'm trying to understand how the container allocates memory in YARN and their performance based on different hardware configuration. So, the machine has 30 GB RAM and I picked 24 GB for YARN and leave 6 GB for the system. yarn.nodemanager.resource.memory-mb=24576 Then I followed http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html to come up with some...

Big input data for Spark job

apache-spark,yarn
I have 1800 *.gz files under a folder input. Each *.gz file is around 300 M, and after unzip, each file is around 3G. So totally 5400 G when unzipped. I can't have a cluster with 5400G executor-memory. Is it possible to read all files under input folder like below?...

In spark yarn cluster, How to work the container depends on the number of RDD partitions?

apache,hadoop,apache-spark,yarn,rdd
i have a one problem about Apache Spark(yarn cluster) In this code, although, create 10 partition but In yarn cluster, just work 3 of contatiner val sc = new SparkContext(new SparkConf().setAppName("Spark Count")) val sparktest = sc.textFile("/spark_test/58GB.dat",10) val test = sparktest.flatMap(line=> line.split(" ")).map(word=>(word, 1)) In spark yarn cluster, How to work...

Amazon EMR Application Master web UI?

hadoop,yarn,hadoop2,amazon-emr
I have started running PIG jobs on Amazon EMR using Hadoop YARN (AMI 3.3.1) however as there is no longer a job tracker in Yarn, I can't seem to be able to find a web UI so that I can track the number of Mappers and Reducers for a MapReduce...

How to distribute jar to hadoop before Job submission

java,hadoop,mapreduce,yarn
I want to implement REST API to submit Hadoop JOBs for execution. This is done purely via Java code. If I compile a jar file and execute it via "hadoop -jar" everything works as expected. But when I submit Hadoop Job via Java code in my REST API - job...

Importtsv command gives : Container exited with a non-zero exit code 1 error

hadoop,hbase,classpath,yarn
I am trying to load a tsv file into an existing hbase table. I am using the following command: /usr/local/hbase/bin$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:value '-Dtable_name.separator=\t' Table-name /hdfs-path-to-input-file But when I execute the above command, I get the following error Container id: container_1434304449478_0018_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)...

Apache Spark YARN mode startup takes too long (10+ secs)

hadoop,apache-spark,yarn
I’m running a spark application with YARN-client or YARN-cluster mode. But it seems to take too long to startup. It takes 10+ seconds to initialize the spark context. Is this normal? Or can it be optimized? The environment is as follows: Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) (Tiny test cluster...

Why we are configuring mapred.job.tracker in YARN?

hadoop,mapreduce,yarn
What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port. The description for mapred.job.tracker property is "The host and port that the MapReduce job...

What does namespace and block pool mean in MapReduce 2.0 YARN?

hadoop,yarn
I understood that in MRv2 all datanodes reports to multiple namenodes regarding blocks with heartbeats. where does this datanodes exactly report so that it will be saved across all namenodes? If any of the namenode goes down will the cluster loose some block information?

Yarn and MapReduce resource configuration

hadoop,mapreduce,yarn
I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram. My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want...

Not able to format namenode in hadoop-2.6.0 multi node installation

ubuntu,hadoop,installation,yarn
I am trying to install Hadoop-2.6.0 in ubuntu 14-04 machine. 5 node cluster. But when I format the namenode, it gives me the following error No command 'hdfs' found, did you mean: Command 'hfs' from package 'hfsutils-tcltk' (universe) Command 'hdfls' from package 'hdf4-tools' (universe) hdfs: command not found And in...

Hadoop HDFS is not distributing blocks of data evenly

hadoop,filesystems,hdfs,yarn
I am currently running a cluster with 2 nodes. 1 Node is master/slave and the other one is just slave. I have a file and I set the block size to half the size of that file. Then I do hdfs dfs -put file / File gets copied to the...

Running Spark on the slave node (YARN) doesn't work

hadoop,apache-spark,yarn,master-slave
I can run SparkPi example on the master node, but when I try the same command "spark-submit --class SparkPi --master yarn-client sparkpi.jar 10" on the slave node, I got an error: 2015-05-19 14:05:44,881 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: maintainer 2015-05-19 14:05:44,886 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) -...

How to improve performance of loading data from NON Partition table into ORC partition table in HIVE

hadoop,hive,yarn,hdinsight
I'm new to Hive Querying, I'm looking for best practices to retrieve data from Hive table. we have enabled TeZ has execution engine and enabled vectorization. We want to make reporting from Hive table, I read from TEZ document that it can be used for real time reporting. Scenario is...

How to cleaning hadoop mapreduce memory usage?

hadoop,memory,mapreduce,jobs,yarn
I want to ask. I can say for example I have 10 MB memory on each node after I activate start-all.sh process. So, I run the namenode, datanode, secondary namenode, dll. But after I've done the hadoop mapreduce job, why the memory for example decrease to 5 MB for example....

running Hadoop with HBase: org.apache.hadoop.hbase.client.HTable.(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String

apache,hadoop,mapreduce,hbase,yarn
I'm trying to make a mapreduce program on Hadoop using HBase. I'm using Hadoop 2.5.1 with HBase 0.98.10.1. The program can be compiled successfully and being made into a jar file. But, when I try to run the jar using "hadoop jar" the program shows error says: "org.apache.hadoop.hbase.client.HTable.(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String". Here is...

Why does YARN takes a lot of memory for a simple count operation?

hadoop,mapreduce,hive,yarn,hortonworks-data-platform
I have a standard configured HDP 2.2 environment with Hive, HBase and YARN. I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN. How can I reduce this memory...

No active nodes in Hadoop cluster

hdfs,yarn,hadoop2
I set up Hadoop 2.6.0 with 1 master and 2 slaves according to How to install Apache Hadoop 2.6.0 in Ubuntu (Multi node/Cluster setup). After all I checked jps on master and slaves, all looked good: NameNode, SecondaryNameNode, ResourceManager on master; and DataNode, NodeManager on slaves. But when I browsed...

Sample Hadoop Config for 4GB server?

hadoop,apache-spark,yarn
I am currently trying to set up a small Hadoop demo system on a virtual server with only 4GB of RAM. I know, 4GB is not very much for Hadoop - but that's all I have at the moment. The server should run HDFS, YARN and Spark (on Yarn) plus...

Oozie on YARN - oozie is not allowed to impersonate hadoop

hadoop,yarn,oozie,ambari
I'm trying to use Oozie from Java to start a job on a Hadoop cluster. I have very limited experience with Oozie on Hadoop 1 and now I'm struggling trying out the same thing on YARN. I'm given a machine that doesn't belong to the cluster, so when I try...

Benefits of YARN

hadoop,mapreduce,yarn
While reading about the benefits of YARN from this video, They said that there is Improved utilization of cluster as Scheduler optimizes cluster utilization. Scheduler bases the optimization on certain criteria i) Capacity guarantees ii)fairness iii)SLA’s So I was confuse, What is SLA's and how it works optimization for scheduling...

How to make Hadoop YARN faster with memory and vcore configuration?

hadoop,memory,containers,core,yarn
On Hadoop YARN, if I have more containers to run map task or reduce task, would it become faster to process a job? So if that's true when I make container allocation memory smaller than default, I can get more containers run on the host, and make the job faster....

Apache Spark: setting executor instances does not change the executors

apache-spark,yarn
I have an Apache Spark application running on a YARN cluster (spark has 3 nodes on this cluster) on cluster mode. When the application is running the Spark-UI shows that 2 executors (each running on a different node) and the driver are running on the third node. I want the...

JMH Benchmark on Hadoop YARN

java,hadoop,yarn,microbenchmark,jmh
I have written a JMH benchmark for my MapReduce job. If I run my app in local mode, it works, but when I run it with the yarn script on my hadoop cluster, then I get the following error: [[email protected] Desktop]$ ./launch_mapreduce.sh # JMH 1.10 (released 5 days ago) #...

Hadoop 2.6.0 does not work reduce tasks in WordCount example

apache,hadoop,mapreduce,yarn
I've installed Hadoop Cluster Setup on Multiple Nodes (physical). I have one server for NameNode, ResourceManager and JobHistory server. I have two servers for DataNodes. I followed this tutorial while configuring. I tried to test MapReduce programs, such as WordCount, Terasoft, Teragen and etc, all i can launch from hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar...

Not able to cast from TaggedInputSplit to FileSplit in MR 2.3

hadoop,mapreduce,yarn
I am getting this classcast exception when I am using MultipleInput in my MR job. Error: java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit at com.capitalone.integratekeys.mapreduce.mapper.IntegrationKeysMapperInput.setup(IntegrationKeysMapperInput.java:74) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapreduce.lib.input.DelegatingMapper.run(DelegatingMapper.java:55) at...

Best way to deploy Spark?

hadoop,amazon-ec2,apache-spark,yarn,amazon-emr
Are there substantial advantages to deploying Spark on top of YARN or EMR, instead of EC2? This would be for research and prototyping, primarily, and probably using Scala. Our reluctance to not using EC2 stems primarily from the extra infrastructure and complexity other options involve, but perhaps they provide substantial...

Yarn - why doesn't task go out of heap space but container gets killed?

hadoop,yarn,hadoop2
If a YARN container grows beyond its heap size setting, the map or reduce task will fail, with an error similar to the one below: 2015-02-06 11:58:15,461 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=10305,containerID=container_1423215865404_0002_01_000007] is running beyond physical memory limits. Current usage: 42.1 GB of 42 GB physical memory used; 42.9 GB of...