FAQ Database Discussion Community


Pig how to format a semi-structured CSV with filters

csv,hadoop,apache-pig
I have semi-structured CSV , which looks something like this. VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61 VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++ VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++ VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++...

Trying to get spark streaming to read data stream from website, what is the socket?

hadoop,apache-spark,spark-streaming,rdd
I am trying to get this data http://stream.meetup.com/2/rsvps into spark stream They are JSON objects, I know the lines will be strings, I just want it to work before I try JSON. I am not sure what to put as the port, I assume that is the problem. SparkConf conf...

InstanceProfile is required for creating cluster - create python function to install module

python,hadoop,amazon-web-services,boto
Im using elastic mapreduce with boto. Everything was working fine, but since this week Im getting this error: InstanceProfile is required for creating cluster Im trying to fix this issue, and it seems that now we need to create a default role for elastic map reduce. And I did this...

java.sql.SQLNonTransientConnectionException:Keyspace names must be composed of alphanumerics and underscores (parsed: '')

java,database,eclipse,hadoop,cassandra
I'm trying connect to cassandra db and verify users to login and sign up I'm getting this error: Keyspace names must be composed of alphanumerics and underscores (parsed: '') at org.apache.cassandra.cql.jdbc.Utils.parseURL(Utils.java:195) at org.apache.cassandra.cql.jdbc.CassandraDriver.connect(CassandraDriver.java:85) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at com.rest.inndata.services.ConnectCassandra.createConnection(ConnectCassandra.java:56)...

What is the diffrence between being Hadoop Admin and Hadoop DevOps

hadoop,devops
What is the difference between being Hadoop Admin and Hadoop DevOps ? what does each one do and don't ?...

issue monitoring hadoop response

hadoop,cluster-computing,ganglia,gmetad
I am using ganglia to monitor Hadoop. gmond and gmetad are running fine. When I telnet on gmond port (8649) and when I telnet gmetad on its xml answer port, I get no hadoop data. How can it be ? cluster { name = "my cluster" owner = "Master" latlong...

How to get raw hadoop metrics

hadoop,monitoring
sorry if this question is a duplicate, so far I haven't been able to find a satisfactory answer. Is it possible to get the raw data of hadoop2 metrics? (e.g. in text file/json format) According to https://wiki.apache.org/hadoop/GangliaMetrics, I know that I can use ganglia (or maybe nagios) to get the...

Hadoop distcp not working

hadoop,distcp
I am trying to copy data from one HDFS to another HDFS. Any suggestion why 1st one works but not 2nd one? (works) hadoop distcp hdfs://abc.net:8020/foo/bar webhdfs://def.net:14000/bar/foo (does not work ) hadoop distcp webhdfs://abc.net:50070/foo/bar webhdfs://def:14000/bar/foo Thanks!...

How to insert and Update simultaneously to PostgreSQL with sqoop command

postgresql,hadoop,hive,sqoop
I am trying to insert into postgreSQL DB with sqoop command. sqoop export --connect jdbc:postgresql://10.11.12.13:1234/db --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema It is working fine if there is not primary key constrain. I want to insert new records and update old records...

Get the count through iterate over Data Bag but condition should be different count for each value associated to that field

hadoop,apache-pig
Below is the data I have and the schema for the same is- student_name, question_number, actual_result(either - false/Correct) (b,q1,Correct) (a,q1,false) (b,q2,Correct) (a,q2,false) (b,q3,false) (a,q3,Correct) (b,q4,false) (a,q4,false) (b,q5,flase) (a,q5,false) What I want is to get the count for each student i.e. a/b for total correct and false answer he/she has...

Installing findbugs on Ubuntu

maven,ubuntu,hadoop,findbugs
For the purpose of building hadoop, I need to install findbugs. I tried to install it by following the link. I see that findbugs had installed properly. But when I run the maven build command for hadoop I still see the same error at hadoop-common: [ERROR] Failed to execute goal...

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

hadoop
From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)? My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression. 15/06/06 17:19:15 INFO mapred.JobClient: map 100%...

Spark stream unable to read files created from flume in hdfs

hadoop,apache-spark,hdfs,spark-streaming,flume-ng
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am...

Use of core-site.xml in mapreduce program

hadoop,mapreduce,bigdata
I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

Hadoop setting the HADOOP_HOME correctly to bin/hadoop it gives command not found

hadoop
After installing hadoop and setting up HADOOP_HOME to /usr/local/hadoop/bin/hadoop and when running hadoop by just typing hadoop in terminal, it says that I don't have privileges. Then I tried running it with sudo then it says that, sudo : command not found

hadoop complains about attempting to overwrite nonempty destination directory

hadoop,hdfs
I'm following Rasesh Mori's instructions to install Hadoop on a multinode cluster, and have gotten to the point where jps shows the various nodes are up and running. I can copy files into hdfs; I did so with $HADOOP_HOME/bin/hdfs dfs -put ~/in /in and then tried to run the wordcount...

Different ways of hadoop installation

hadoop,installation
I'm new to hadoop and trying to install it on my local machine. I see that there are many ways in installing hadoop like install vmware Horton works and install hadoop on top of that or install Oracle virtual box , Cloudera and then Hadoop . My question is that...

Why column oriented file formats are not well suited to streaming writes?

hadoop,column-oriented
Hadoop the definitive guide(4th edition) has a paragraph on page 137: Column-oriented formats need more memory for reading and writing, since they have to buffer a row split in memory, rather than just a single row. Also, it’s not usually possible to control when writes occur (via flush or sync...

Best way to store relational data in hdfs

sql,hadoop,hdfs
I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs. First of...

Is possible to set hadoop blocksize 24 MB?

hadoop,size,hdfs,block,megabyte
I just want to ask your opinion about HDFS block size. So I set HDFS block size to 24 MB and it's can run normally. I remember that 24 MB is not an exponential number (multiplication of 2) for the usual size on computer. So I want to ask all...

MapReduce job not working in HADOOP-2.6.0

hadoop,mapreduce
I am trying to run wordcount example Here is the code import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public...

a file on hdfs with 3 replication will store on 3 hosts?

hadoop,hdfs,replication
a file on hdfs with 3 replication will store on 3 hosts ? Or store on not only 3 hosts ? ...

Why does YARN takes a lot of memory for a simple count operation?

hadoop,mapreduce,hive,yarn,hortonworks-data-platform
I have a standard configured HDP 2.2 environment with Hive, HBase and YARN. I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN. How can I reduce this memory...

Perform SUBSTRING on a Pig Latin Command Parameter

hadoop,apache-pig
I have a pig script that is passed a command argument as part of oozie workflow, I want to create a new variable as a substring of the passed parameter. eg: %declare VAR1 SUBSTRING($INPUT, 0, 5); The error is usually; ParseException: Encountered " "0, "" at line 5, column 37....

Multiple file storage in HDFS using Apache Spark

hadoop,apache-spark,hdfs
I am doing a project that involves using HDFS for storage and Apache Spark for computation. I have a directory in HDFS which have several text files in it at same depth.I want to process all these files using Spark and store back their corresponding results back to HDFS with...

How to find the exact hadoop jar command which was running my job?

hadoop,yarn,oozie,cascading,scalding
I'm using CDH5.4. I'm running a hadoop job which from command line appears to be ok (when simply running with hadoop jar). However if I run it from yarn It finishes silently with a single mapper and no reducers. I really suspect both 'runs' were running the same exact command....

Error while creating external table in Hive using EsStorageHandler

hadoop,elasticsearch,hive
I am facing an error while creating an External Table to push the data from Hive to ElasticSearch. What I have done so far: 1) Successfully set up ElasticSearch-1.4.4 and is running. 2) Successfully set up Hadoop1.2.1, all the daemons are up and running. 3) Successfully set up Hive-0.10.0. 4)...

How can I parse a Json column of a Hive table using a Json serde?

json,hadoop,hive
I am trying to load de-serialized json events into different tables, based on the name of the event. Right now I have all the events in the same table, the table has only two columns EventName and Payload (the payload stores the json representation of the event): CREATE TABLE event(...

load struct or any other complex data type in hive

hadoop,hive,hiveql
I have a .xlsx file which contains data some thing like the below image, am trying to create using the below create query create table aus_aboriginal( code int, area_name string, male_0_4 STRUCT<num:double, total:double, perc:double>, male_5_9 STRUCT<num:double, total:double, perc:double>, male_10_14 STRUCT<num:double, total:double, perc:double>, male_15_19 STRUCT<num:double, total:double, perc:double>, male_20_24 STRUCT<num:double, total:double, perc:double>,...

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind issue in ecllipse

maven,hadoop,apache-spark,word-count
I have the below spark word count program : package com.sample.spark; import java.util.Arrays; import java.util.List; import java.util.Map; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFlatMapFunction; import org.apache.spark.api.java.function.PairFunction; import...

hadoop large file does not split

performance,hadoop,split,mapreduce
I have an input file of size 136MB and I launched some WordCount test and I monitor only one mapper. Then I set dfs.blocksize to 64MB in my hdfs-site.xml and I still get one mapper. Am I doing wrong ?

Hive(Bigdata)- difference between bucketing and indexing

hadoop,mapreduce,hive,bigdata
What is the main difference between bucketing and indexing of a table in Hive?

Hadoop distributed mode

hadoop
This question may seem very trivial, but I'm new to Hadoop and currently confused by one question. When starting the daemons, how can the appropriate files be located on the slave nodes? I know you specify the masters and the slaves in the appropriate files, but how does it know...

What does namespace and block pool mean in MapReduce 2.0 YARN?

hadoop,yarn
I understood that in MRv2 all datanodes reports to multiple namenodes regarding blocks with heartbeats. where does this datanodes exactly report so that it will be saved across all namenodes? If any of the namenode goes down will the cluster loose some block information?

PySpark repartitioning RDD elements

hadoop,apache-spark,partitioning,rdd,pyspark
I have a spark job that reads from a Kafka stream and performs an action for each RDD in the stream. If the RDD is not empty, I want to save the RDD to HDFS, but I want to create a file for each element in the RDD. I've found...

On which hadoop node would the below scalding pre-process and post-process runs?

scala,hadoop,scalding
I have the below example code for some preprocess before sclading job runs and some post-process. As these pre-process and post-process are calling some mysql database I would like to know on which hadoop nodes would hadoop potentially run them? (I need to open the port from these nodes to...

passing argument from shell script to hive script

bash,hadoop,hive
I've a concern which can be categorized in 2 ways: My requirement is of passing argument from shell script to hive script. OR within one shell script I should include variable's value in hive statement. I'll explain with an example for both: 1) Passing argument from shell script to hiveQL->...

What does it mean that my resource manager does not have an open port 8032?

hadoop,yarn,cloudera-cdh
I have my YARN resource manager on a different node than my namenode, and I can see that something is running, which I take to be the resource manager. Ports 8031 and 8030 are bound, but not port 8032, to which my client tries to connect. I am on CDH...

start-dfs.sh: command not found

hadoop,ubuntu-14.04
I have installed hadoop 2.7.0. on Ubuntu 14.04. But the code start-dfs.sh is not working. when I run this code It returns start-dfs.sh: command not found. The start-dfs.sh, start-all.sh , stop-dfs.sh and stop-all.sh are in the sbin directory. I have installed and set the paths of java and hadoop correctly....

How to pass a file as parameter in mapreduce

java,caching,hadoop
I want to search for particular words in a file and display its count. When the word to be searched is a single word, I am able to do it by setting the configuration in the driver like below : Driver class : Configuration conf = new Configuration(); conf.set("wordtosearch", "fun");...

How to flatMap a function on GroupedDataSet in Apache Flink

scala,hadoop,flink
I want to apply a function via flatMap to each group produced by DataSet.groupBy. Trying to call flatMap I get the compiler error: error: value flatMap is not a member of org.apache.flink.api.scala.GroupedDataSet My code: var mapped = env.fromCollection(Array[(Int, Int)]()) var groups = mapped.groupBy("myGroupField") groups.flatMap( myFunction: (Int, Array[Int]) => Array[(Int, Array[(Int,...

Importtsv command gives : Container exited with a non-zero exit code 1 error

hadoop,hbase,classpath,yarn
I am trying to load a tsv file into an existing hbase table. I am using the following command: /usr/local/hbase/bin$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:value '-Dtable_name.separator=\t' Table-name /hdfs-path-to-input-file But when I execute the above command, I get the following error Container id: container_1434304449478_0018_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)...

How to find number of unique connection using hive/pig

hadoop,hive,apache-pig
I have a sample table like below: caller receiver 100 200 100 300 400 100 100 200 I need to find the number of unique connection for each number. For ex: 100 will have connections like 200,300 and 400. My output should be like: 100 3 200 1 300 1...

Hbase: Having just the first version of each cell

hadoop,hbase
I was wondering how can I configure Hbase in a way to store just the first version of each cell? Suppose the following Htable: row_key cf1:c1 timestamp ---------------------------------------- 1 x t1 After putting ("1","cf1:c2",t2) in the scenario of ColumnDescriptor.DEFAULT_VERSIONS = 2 the mentioned Htable becomes: row_key cf1:c1 timestamp ---------------------------------------- 1...

Merging two columns into a single column and formatting the content to form an accurate date-time format in Hive?

sql,regex,hadoop,hive,datetime-format
these are the 2 columns(month,year). I want to create a single column out of them having an accurate date-time format('YYYY-MM-DD HH:MM:SS') and add as new column in the table. Month year 12/ 3 2013 at 8:40pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at...

How to change replication factor while running copyFromLocal command?

hadoop,hdfs
I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders. hadoop fs -setrep -R -w 3 <folder-path> I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while...

Flink error - org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

maven,hadoop,flink
I am trying to run a flink job using a file from HDFS. I have created a dataset as following - DataSource<Tuple2<LongWritable, Text>> visits = env.readHadoopFile(new TextInputFormat(), LongWritable.class,Text.class, Config.pathToVisits()); I am using flink's latest version - 0.9.0-milestone-1-hadoop1 (I have also tried with 0.9.0-milestone-1) whereas my Hadoop version is 2.6.0 But,...

Oozie on YARN - oozie is not allowed to impersonate hadoop

hadoop,yarn,oozie,ambari
I'm trying to use Oozie from Java to start a job on a Hadoop cluster. I have very limited experience with Oozie on Hadoop 1 and now I'm struggling trying out the same thing on YARN. I'm given a machine that doesn't belong to the cluster, so when I try...

Results of reduce and count differ in pyspark

python,hadoop,apache-spark
For my spark trials, I have downloaded the NY taxi csv files and merged them into a single file, nytaxi.csv . I then saved this in hadoop fs. I am using spark on yarn with 7 nodemanagers. I am connecting to spark over Ipython notebook. Here is a sample python...

Should Apache Kafka and Hadoop be installed seperatedly (on a diffrent cluster)?

hadoop,apache-kafka,kafka
Should Apache Kafka and Hadoop be installed seperatedly (on a diffrent cluster) ?

Hadoop Basic - error while creating directroy

hadoop,hdfs
I have started learning hadoop recently and I am getting the below error while creating new folder - [email protected]:~/Installations/hadoop-1.2.1/bin$ ./hadoop fs -mkdir helloworld Warning: $HADOOP_HOME is deprecated. 15/06/14 19:46:35 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) Request you to help.below...

pcap to Avro on Hadoop

hadoop,mapreduce,pcap,avro
I need to know if there is any way where I can convert pcap file to avro , so that I can write map reduce program on avro data using hadoop ? Otherwise what is the best practice when dealing with pcap files on hadoop ? Thanks ...

How to change ulimit in CentOS6

linux,hadoop,centos6
I am using CentOS6.6 and trying to install HDP2.2 When I do: ulimit -Sn Its value is 1024 When I do: ulimit -Hn Its value is 4096 The recommended maximum number of open file descriptors is 10000, or more. I am trying to increase this value. I have checked several...

Presence of “in” in Pig's UDF causes problems

hadoop,apache-pig,udf
I was trying my first UDF in pig and wrote the following function - package com.pig.in.action.assignments.udf; import org.apache.pig.EvalFunc; import org.apache.pig.PigWarning; import org.apache.pig.data.Tuple; import java.io.IOException; public class CountLength extends EvalFunc<Integer> { public Integer exec(Tuple inputVal) throws IOException { // Validate Input Value ... if (inputVal == null || inputVal.size() == 0...

Hadoop map reduce Extract specific columns from csv file in csv format

java,hadoop,file-io,mapreduce,bigdata
I am new to hadoop and working on a big data project where I have to clean and filter given csv file. like if given csv file has 200 columns then I need to select only 20 specific columns (so called data filtering) as a output for further operation. also...

Save flume output to hive table with Hive Sink

hadoop,hive,flume
I am trying to configure flume with Hive to save flume output to hive table with Hive Sink type. I have single node cluster. I use mapr hadoop distribution. Here is my flume.conf agent1.sources = source1 agent1.channels = channel1 agent1.sinks = sink1 agent1.sources.source1.type = exec agent1.sources.source1.command = cat /home/andrey/flume_test.data agent1.sinks.sink1.type...

Apache Spark: Error while starting PySpark

python,hadoop,apache-spark,pyspark
On a Centos machine, Python v2.6.6 and Apache Spark v1.2.1 Getting the following error when trying to run ./pyspark Seems some issue with python but not able to figure out 15/06/18 08:11:16 INFO spark.SparkContext: Successfully stopped SparkContext Traceback (most recent call last): File "/usr/lib/spark_1.2.1/spark-1.2.1-bin-hadoop2.4/python/pyspark/shell.py", line 45, in <module> sc =...

Hadoop Definitive Guide “National Climatic Data Center” get the data

hadoop
I'm trying to learn Hadoop. I'm trying to get the National Climatic Data Center data to my newly installed Hadoop Master. What is the easiest way of getting whole data Edit: since I get down vote even I get my answer I think I should explain my question in detail....

Spark on yarn jar upload problems

java,hadoop,mapreduce,apache-spark
I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following: YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/; /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar This method gives the following error: diagnostics: Application application_1434177111261_0007...

Not able to format namenode in hadoop-2.6.0 multi node installation

ubuntu,hadoop,installation,yarn
I am trying to install Hadoop-2.6.0 in ubuntu 14-04 machine. 5 node cluster. But when I format the namenode, it gives me the following error No command 'hdfs' found, did you mean: Command 'hfs' from package 'hfsutils-tcltk' (universe) Command 'hdfls' from package 'hdf4-tools' (universe) hdfs: command not found And in...

Running HBase in standalone mode but get hadoop “retrying connect to server” message?

hadoop,hbase
I'm trying to run HBase in standalone mode following this tutorial: http://hbase.apache.org/book.html#quickstart I get the following exception when I try to run create 'test', 'cf' in the HBase shell ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing I've seen questions here regarding this error, but the solutions haven't worked for me. What...

Add PARTITION after creating TABLE in hive

hadoop,hive,partition
i have created a non partitioned table and load data into the table,now i want to add a PARTITION on the basis of department into that table,can I do this? If I do: ALTER TABLE Student ADD PARTITION (dept='CSE') location '/test'; It gives me error: FAILED: SemanticException table is not...

Create an external Hive table from an existing external table

csv,hadoop,hive
I have a set of CSV files in a HDFS path and I created an external Hive table, let's say table_A, from these files. Since some of the entries are redundant, I tried creating another Hive table based on table_A, say table_B, which has distinct records. I was able to...

Datanode and Nodemanager on slave machine are not able to connect to NameNode and ResourceManager on master machine

java,apache,sockets,hadoop,tcp
I have installed hadoop on two node cluster- Node1 and Node2. Node1 is master and Node2 is slave. Node2's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1 respectively. However Node1's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1. Node1: jps...

Input of the reduce phase is not what I expect in Hadoop (Java)

java,hadoop,mapreduce,reduce,emit
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph): 1 3 3 1 3 2 2 3 Now, I want to use MapReduce to...

does hadoop not suffer the disk seeks as it sits on top of linux filesystem?

hadoop,hdfs
I am new to Hadoop and i know HDFS is 64 mb (min) per block and can increase depending on the system. but as hdfs is installed on top of linux filesystem which is 4kb per block, does hadoop not suffer disk seek? also does hdfs interact with linux filesystem...

Why we are configuring mapred.job.tracker in YARN?

hadoop,mapreduce,yarn
What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port. The description for mapred.job.tracker property is "The host and port that the MapReduce job...

Data in HDFS files not seen under hive table

hadoop,hive,sqoop,hadoop-partitioning
I have to create a hive table from data present in oracle tables. I'm doing a sqoop, thereby converting the oracle data into HDFS files. Then I'm creating a hive table on the HDFS files. The sqoop completes successfully and the files also get generated in the HDFS target directory....

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

java,hadoop,arraylist,mapreduce
This question already has an answer here: Output a list from a Hadoop Map Reduce job using custom writable 1 answer How can we pass an arraylist as value from the mapper to the reducer. My code basically has certain rules to work with and would create new values(String)...

SQL Server 2012 & Polybase - 'Hadoop Connectivity' configuration option missing

sql-server,hadoop,sql-server-2012
As described in the title, I am using SQL Server 2012 Parallel Data Warehouse with Polybase feature to try to access a HDInisght Hadoop cluster. As a starting point for every connection to Hadoop from SQL Server, I find to execute the command sp_configure @configname = 'hadoop connectivity', @configvalue =...

Use case HBase on EMR

hadoop,amazon-web-services,hbase,storage,emr
I read the documentation on AWS, but a point is still unclear. Is S3 the primary storage of EMR cluster? or does the data are in EC2 and S3 is just a copy? In the doc : "HBase on Amazon EMR provides the ability to back up your HBase data...

Spark utf 8 error, non-English data becomes `??????????`

scala,hadoop,apache-spark
One of the fields in our data is in a non-English language (Thai). We can load the data into HDFS and the system displays the non-English field correctly when we run: hadoop fs -cat /datafile.txt However, when we use Spark to load and display the data, all the non-English data...

Hadoop append data to hdfs file and ignore duplicate entries

java,hadoop,mapreduce,hive,hdfs
How can I append data to HDFS files and ignore duplicate values? I have a huge HDFS file (MainFile) and I have 2 other new files from different sources and I want to append data from this files to the MainFile. Main File and the other files has same structure....

ERROR jdbc.HiveConnection: Error opening session Hive

java,hadoop,jdbc,hive
i try to run JBDC code for Hive2 get error. i have hive 1.2.0 version hadoop 1.2.1 version. but in command line hive and beeline works fine without any problem.but with jdbc getting error. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static...

Hadoop Job class not found

java,eclipse,hadoop,mapreduce,classpath
Hi I'm having trouble and I haven't been able to get help yet from similar threads. I am doing an example of a hadoop job and I'm just trying to run it from the IDE right now. Here is my source code package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.io.*;...

Data node not starting..here is the log file

hadoop
java.io.FileNotFoundException: File file:/hadoopuser/hdfs/datanode does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) at...

how to drop partition metadata from hive, when partition is drop by using alter drop command

hadoop,apache-hive
I have dropped the all the partitions in the hive table by using the alter command alter table emp drop partition (hiredate>'0'); After droping partitions still I can see the partitions metadata.How to delete this partition metadata? Can I use the same table for new partitions? ...

HIVE: apply delimiter until a specified column

hadoop,datatable,hive,delimiter
I am trying to move data from a file into a hive table. The data in the file looks something like this:- StringA StringB StringC StringD StringE where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one...

Cassandra WordCount Hadoop

hadoop,cassandra
Can anyone explain to me the following lines from Cassandra 2.1.15 WordCount example? CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3"); CqlConfigHelper.setInputCql(job.getConfiguration(), "select * from " + COLUMN_FAMILY + " where token(id) > ? and token(id) <= ? allow filtering"); How do I define concrete values which will be used to replace "?" in the query?...

where is the the default scheme configuration in hadoop?

hadoop
I'm trying to learning Hadoop. According to the Document: All FS shell commands take path URIs as arguments. The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme...

How to use spark for map-reduce flow to select N columns, top M rows of all csv files under a folder?

hadoop,mapreduce,apache-spark,spark-streaming,pyspark
To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB): id name address city... 1 Matt add1 LA... 2 Will add2 LA... 3 Lucy add3 SF... ... And we have a lookup table based on "name"...

The reduce task is stopped by Too Many Fetch Failure message in Hadoop multi node (10x) cluster

java,linux,ubuntu,hadoop,distributed
I am using Hadoop 1.0.3 for a 10 Desktop cluster system each having Ubuntu 12.04LTS 32 bit OS. The JDK is 7 u 75. Each machine has 2 GB RAM with core 2-duo processor. For a research project, I need to run a hadoop job similar to "Word Count". And...

Amazon Redshift: query execution hangs

grails,hadoop,amazon-web-services,amazon-redshift
I use amazon redshift and sometimes the query execution hangs without any error messages e.g. this query will execute: select extract(year from date), extract(week from date),count(*) from some_table where date>'2015-01-01 00:00:00' and date<'2015-12-31 23:59:59' group by extract(year from date), extract(week from date) and this not: select extract(year from date), extract(week...

schedule and automate sqoop import/export tasks

shell,hadoop,automation,hive,sqoop
I have a sqoop job which requires to import data from oracle to hdfs. The sqoop query i'm using is sqoop import --connect jdbc:oracle:thin:@hostname:port/service --username sqoop --password sqoop --query "SELECT * FROM ORDERS WHERE orderdate = To_date('10/08/2013', 'mm/dd/yyyy') AND partitionid = '1' AND rownum < 10001 AND \$CONDITIONS" --target-dir /test1...

hadoop - vertica jar

hadoop,jar,vertica
I am trying to transfer data from Vertica to hive. According to the manual the following should be set as the input format: -inputformat com.vertica.hadoop.deprecated.VerticaStreamingInput But the hadoop-vertica jar has org.apache.hadoop.vertica.VerticaStreamingInput class and not the above. So it is throwing me the following exception: Exception in thread "main" java.lang.RuntimeException: class...

HDFS Path for Spark Submit and Flink on YARN

java,hadoop,apache-spark,hdfs,flink
i work with cloudera live vm, there i have a hadoop and spral standalone cluster. now i want submit my jobs with spark submit and flink run scripts. this works, too. but my apps can find the path to input and outputs files in the hdfs. i set the path...

Hive external table not reading entirety of string from CSV source

csv,hadoop,hive,hiveql
Relatively new to the Hadoop world so apologies if this is a no-brainer but I haven't found anything on this on SO or elsewhere. In short, I have an external table created in Hive that reads data from a folder of CSV files in HDFS. The issue is that while...

jets3t cannot upload file to s3

hadoop,amazon-s3,jets3t
I'm trying to upload files from local to s3 using hadoop fs and jets3t, but I'm getting the following error Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error. HEAD '/project%2Ftest%2Fsome_event%2Fdt%3D2015-06-17%2FsomeFile' on Host 'host.s3.amazonaws.com' @ 'Thu, 18 Jun 2015 23:33:01 GMT' -- ResponseCode: 404, ResponseStatus: Not Found, RequestId: AVDFJKLDFJ3242, HostId: D+sdfjlakdsadf\asdfkpagjafdjsafdj I'm...

Spark - How to count number of records by key

hadoop,apache-spark,cloud
This is probably an easy problem but basically I have a dataset where I am to count the number of females for each country. Ultimately I want to group each count by the country but I am unsure of what to use for the value since there is not a...

Vertica: Input record 1 has been rejected (Too few columns found)

hadoop,vertica
I am trying to copy file from Hadoop to a Vertica table and get the an error. The problem is same copy sometimes pass and some times fails,any idea? The Error: Caused by: java.sql.SQLException: [Vertica]VJDBC ERROR: COPY: Input record 1 has been rejected (Too few columns found) at com.vertica.util.ServerErrorData.buildException(Unknown Source)...

How I can use map reduce program to check if a value of column match with a criteria

hadoop,mapreduce
How can we use the algorithm Map Reduce to check whether the values of a column in a data file correspond to a given criterion? e.g: for a column C1 we want to check that the values of this column match with the criterion : C1 in ("A", "B", "C")....

MapReduce (Hadoop-2.6.0)+ HBase-1.0.1.1 class not found exception

eclipse,hadoop,mapreduce,hbase
I have written a Map-Reduce program to fetch data from an input file and output it to a HBase table. But I am not able to execute. I am getting the following error Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at beginners.VisitorSort.main(VisitorSort.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at...

JMH Benchmark on Hadoop YARN

java,hadoop,yarn,microbenchmark,jmh
I have written a JMH benchmark for my MapReduce job. If I run my app in local mode, it works, but when I run it with the yarn script on my hadoop cluster, then I get the following error: [[email protected] Desktop]$ ./launch_mapreduce.sh # JMH 1.10 (released 5 days ago) #...

How to run hadoop appliaction automatically?

hadoop
I know that a MapReduce program can be ran using the command line "hadoop jar *.jar" for a time. But now the program is required to be ran a time for every hour in background. Are there any methods to make the MR program be hourly submitted to hadoop automatically?...

Hive shell throws Filenotfound exception while executing queries, inspite of adding jar files using “ADD JAR”

java,hadoop,hive,hdfs,hiveql
1) I have added serde jar file using "ADD JAR /home/hduser/softwares/hive/hive-serdes-1.0-SNAPSHOT.jar;" 2) Create table 3) The table is creates successfully 4) But when I execute any select query it throws file not found exception hive> select count(*) from tab_tweets; Query ID = hduser_20150604145353_51b4def4-11fb-4638-acac-77301c1c1806 Total jobs = 1 Launching Job 1...

Data flow among Inplutsplit, RecordReader & Map instace and Mapper

hadoop,mapreduce
If I've a data file with 1000 lines.. and I use TextInputFormat in my map method for my Word Count program. So, every line in the data file will be considered as one split. A RecordReader will feed each line(or split) as (Key, Value) pair to the map() method. As...

What are the different ways to check if the mapreduce program ran successfully

hadoop,mapreduce,bigdata
If we need to automate a mapreduce program or run from a script, what are the different ways to check if the mapreduce program ran successfully? One way is to find is if _SUCCESS file is created in the output directory. Does the command "hadoop jar program.jar hdfs:/input.txt hdfs:/output" return...

Sqoop Export with Missing Data

sql,postgresql,shell,hadoop,sqoop
I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a...

Accessing csv file placed in hdfs using spark

csv,hadoop,apache-spark,pyspark
I have placed a csv file into the hdfs filesystem using hadoop -put command. I now need to access the csv file using pyspark csv. Its format is something like `plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')` I am a newbie to hdfs. How do I find the address to be placed in hdfs://x.x.x.x?...

Weird dse hive integration in DSE 4.7

hadoop,hive,datastax,datastax-enterprise
I'm trying to run Hive query over existing C* table. Here is my C* table definition: drop table IF EXISTS mydata.site_users; CREATE TABLE IF NOT EXISTS appdata.site_users ( user_id text, user_test_uuid uuid, --for testing purposes, if we can use it in queries, there could be some serde problems? user_name text,...