FAQ Database Discussion Community


hadoop MultipleOutputs to absolute path , but file is already being created by other attempt

hadoop,mapreduce,multipleoutputs
I use MultipleOutputs to output data to some absolute paths, instead of a path relative to OutputPath. Then, i get the error: Error: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/test/convert.bak/326/201505110030/326-m-00035] for [DFSClient_attempt_1425611626220_29142_m_000035_1001_-370311306_1] on client [192.168.7.146], because this file is already being created by [DFSClient_attempt_1425611626220_29142_m_000035_1000_-53988495_1] on [192.168.7.149] at...

Unable to access HBase from MapReduce code

java,hadoop,mapreduce,hbase,zookeeper
I am trying to use HDFS file as source and HBase as sink. My Hadoop cluster has following specification: master 192.168.4.65 slave1 192.168.4.176 slave2 192.168.4.175 slave3 192.168.4.57 slave4 192.168.4.146 The Zookeeper nodes are on following ip address: zks1 192.168.4.60 zks2 192.168.4.61 zks3 192.168.4.66 The HBase nodes are on following ip...

Hive(Bigdata)- difference between bucketing and indexing

hadoop,mapreduce,hive,bigdata
What is the main difference between bucketing and indexing of a table in Hive?

How I can use map reduce program to check if a value of column match with a criteria

hadoop,mapreduce
How can we use the algorithm Map Reduce to check whether the values of a column in a data file correspond to a given criterion? e.g: for a column C1 we want to check that the values of this column match with the criterion : C1 in ("A", "B", "C")....

How does mapper output get written to HDFS in case of Sqoop?

java,hadoop,mapreduce,hdfs,sqoop
As I have learned about Hadoop Map-Reduce jobs that mapper output is written to local storage and not to HDFS, as it is ultimately a throwaway data and so no point of storing in HDFS. But as I see in case of Sqoop mapper output file part-m-00000 is written into...

Data flow among Inplutsplit, RecordReader & Map instace and Mapper

hadoop,mapreduce
If I've a data file with 1000 lines.. and I use TextInputFormat in my map method for my Word Count program. So, every line in the data file will be considered as one split. A RecordReader will feed each line(or split) as (Key, Value) pair to the map() method. As...

How to omit empty part-000x files from Python streaming MapReduce job

python,hadoop,mapreduce,hadoop-streaming
I created a Python mapper that I run as a Hadoop streaming MapReduce job. It validates the input and writes a message to output if the input is invalid. ... # input from STDIN for line in sys.stdin: indata = json.loads(line) try: jsonschema.validate(indata,schema) except jsonschema.ValidationError, error: # validation against schema...

Map Function on Spark returning 'NoneType'

python,mapreduce,apache-spark
I have written the following code in Python to run on Apache Spark: import sys from pyspark import SparkContext def generate_kdmer(seq): res = [] beg2, end2 = k+d, k+d+k last = len(seq) - end2 + 1 for i in range(last): res.append([seq[i:i+k], seq[i+beg2:i+end2]]) return res.sort() if __name__ == "__main__": if len(sys.argv)...

How to transfer java .class from one machine to another machine through network?

java,serialization,mapreduce,rmi
So basically, I am trying to implement a MapReduce framework in java. Problem here is, I want slave nodes(machines) know user-defined map and reduce functions. At the beginning, only master node knows what use code is, because by default, users write code on master machine. However, slave nodes and master...

CouchDB sum by date range and type

mapreduce,couchdb
Simply put I want to _sum totals over a date range grouped by type. The original docs in the db are each for a single date, containing data by type. (For example, each doc has total apples, oranges, and pears picked on a date. We want to query for total...

Serializing Java object using Hadoop libraries

java,hadoop,serialization,mapreduce
I'm trying to serialize an object in Java and write it to a file so that my Map function can take that from the file and deserialize it to get the object back. I am of the opinion that Java serialization isn't very optimal. So I want to use hadoop...

Map a table of a cassandra database using spark and RDD

java,mapreduce,apache-spark,rdd
i have to map a table in which is written the history of utilization of an app. The table has got these tuples: <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> <AppId,date,cpuUsage,memoryUsage> AppId is always different, because is referenced at many app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are...

Uploading HFiles in Hbase fails because of method not found error

hadoop,mapreduce,hbase,hdfs
I am trying to upload Hfiles to Hbase using bulkload. While doing so I am encountering method not found error . Giving the logs and command below. Command hadoop jar /usr/lib/hbase/lib/hbase-server-0.98.11-hadoop2.jar completebulkload /output NBAFinal2010 where output is the Hfiles output folder and NBAFinal2010 is table in Hbase. logs :- 15/05/05...

Split size vs Block size in Hadoop

hadoop,mapreduce,hdfs
What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

What are the different ways to check if the mapreduce program ran successfully

hadoop,mapreduce,bigdata
If we need to automate a mapreduce program or run from a script, what are the different ways to check if the mapreduce program ran successfully? One way is to find is if _SUCCESS file is created in the output directory. Does the command "hadoop jar program.jar hdfs:/input.txt hdfs:/output" return...

How to import string data from impala database to wordcount mapreduce

java,hadoop,mapreduce
I was trying to use Wordcount code with mapreduce hadoop. But, almost all of the wordcount tutorial I read, are importing data from the file path of the job configuration. I want to connect impala database to word count mapreduce using java. How do I proceed? Or just enter the...

Working With Reducer class in hadoop [duplicate]

java,hadoop,mapreduce,reduce
This question already has an answer here: How do I compare strings in Java? 23 answers I am building a map reduce job. The problem is that comparing is not working correctly. How can I compare these Strings? public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException...

CouchDB-Why my rerduce is always coming as false ? I am not able to reduce anything properly

mapreduce,couchdb,couchdb-futon
I am new to CouchDB. I have a 9 gb dataset loaded into my couchdb. I am able to map everything correctly. But I cannot reduce any of the results using the code written in the reduce column. When i tried log, log shows that rereduce values as false. Do...

How do we improve a MongoDB MapReduce function that takes too long to retrieve data and gives out of memory errors?

performance,mongodb,delphi,mapreduce
Retrieving data from mongo takes too long, even for small datasets. For bigger datasets we get out of memory errors of the javascript engine. We've tried several schema designs and several ways to retrieve data. How do we optimize mongoDB/mapReduce function/MongoWire to retrieve more data quicker? We're not very experienced...

Clarification regarding this map reduce word count example?

hadoop,mapreduce
I am studying map reduce, and I have a question regarding the basic word count example of map reduce. Say my text is My name is X Y X. here is the map class, I am referring to public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {...

Hadoop Job class not found

java,eclipse,hadoop,mapreduce,classpath
Hi I'm having trouble and I haven't been able to get help yet from similar threads. I am doing an example of a hadoop job and I'm just trying to run it from the IDE right now. Here is my source code package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.io.*;...

Input Reader dictionary on mapreduce.yaml

google-app-engine,mapreduce,yaml
Ive been trying to launch a specific mapreduce straight from the /mapreduce dashboard, but for that I need input reader parameters passed in as a dictionary or I get BadReaderParamsError: Input reader parameters should be a dictionary The problem is yaml validation won't let me add any kind of nested...

Map keys to emit function

javascript,object,mapreduce,nosql,couchdb
i've got an object in couchDB and in this looks like are several arrays, im new to couchDB and i don't know how to access the keys of it. The document looks like this: { "_id": "113232", "_rev": "1-c967a81c0eccba6a7c92e3c4b352d4eb", "name": "Ezequiel Campion", "vorlesungen": [ { "Ethik": 1.7 }, { "Glaube...

Mongodb sum all the same fields in array foreach document - Map-Reduce

javascript,arrays,mongodb,mapreduce
I have documents like this in my db: { "first_name": "John", "last_name": "Bolt", "account": [ { "cardnumber": "4844615935257045", "cardtype": "jcb", "currency": "CZK", "balance": 4924.99 }, { "cardnumber": "3552058835710041", "cardtype": "jcb", "currency": "BRL", "balance": 9630.38 }, { "cardnumber": "5108757163721629", "cardtype": "visa-electron", "currency": "CNY", "balance": 6574.18 } } And my question is...

NullPointerException in MapReduce Sorting Program

java,sorting,hadoop,mapreduce
I know that SortComparator is used to sort the map output by their keys. I have written a custom SortComparator to understand the MapReduce framework better.This is my WordCount class with custom SortComparator class. package bananas; import java.io.FileWriter; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text;...

Aggregating heterogeneous documents in MongoDB

mongodb,mapreduce,aggregation-framework
I am writing a tool to inspect a MongoDB session store, in which the structure is namespaces containing variables containing information on the variables, roughly: { "some namespace" : { "some variable" : { "t" : "serialization type", "b" : "data value", "c" : "java class type" }, "some other...

Why does YARN takes a lot of memory for a simple count operation?

hadoop,mapreduce,hive,yarn,hortonworks-data-platform
I have a standard configured HDP 2.2 environment with Hive, HBase and YARN. I've used Hive (/w HBase) to perform a simple count operation on a table that has about 10 million rows and it resulted with a 10gb of memory consumption from YARN. How can I reduce this memory...

Exception in mapreduce code which is accessing Hbase table java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString

mapreduce,hbase
Hi getting the following exception, when running the map reduce program. The code has access to Hbase table and doing Put operation. Exception in thread "main" java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString ...

In Spark, does the filter function turn the data into tuples?

mapreduce,apache-spark,cloud
Just wondering does the filter turn the data into tuples? For example val filesLines = sc.textFile("file.txt") val split_lines = filesLines.map(_.split(";")) val filteredData = split_lines.filter(x => x(4)=="Blue") //from here if we wanted to map the data would it be using tuple format ie. x._3 OR x(3) val blueRecords = filteredData.map(x =>...

hadoop large file does not split

performance,hadoop,split,mapreduce
I have an input file of size 136MB and I launched some WordCount test and I monitor only one mapper. Then I set dfs.blocksize to 64MB in my hdfs-site.xml and I still get one mapper. Am I doing wrong ?

pcap to Avro on Hadoop

hadoop,mapreduce,pcap,avro
I need to know if there is any way where I can convert pcap file to avro , so that I can write map reduce program on avro data using hadoop ? Otherwise what is the best practice when dealing with pcap files on hadoop ? Thanks ...

How to aggregate time series documents in mongodb

mongodb,mapreduce,time-series,mongodb-query,nosql-aggregation
i have a mongo sharded cluster where i save data from a virtual machines monitoring system (zabbix ecc). Now I want to get some information from the db, for example the avg memfree in the last 2 days of one vm. I read the tutorials about aggregation and also the...

Filter array field in couchbase

android,view,filter,mapreduce,couchbase
I m working on couchbase lite android. I have series of documents, every document contains a field which it's value is array of string. now I want to filter value of this field. { type : "customer", name: "customerX", states: [ "IL" , "IO" , "NY" , "CA" ] },...

Map reducing a collection of coordinate pairs

javascript,d3.js,mapreduce
I've been a developer for a couple years now, and one concept I don't seem to quite get is map reduce. I have a collection of coordinates that define squares, with each value being an array of two arrays. Each internal array is itself an array of two numeric values....

Processing HUGE number of small files independently

hadoop,amazon-web-services,amazon-ec2,mapreduce,elastic-map-reduce
The task is to process HUGE (around 10,000,000) number of small files (each around 1MB) independently (i.e. the result of processing file F1, is independent of the result of processing F2). Someone suggested Map-Reduce (on Amazon-EMR Hadoop) for my task. However, I have serious doubts about MR. The reason is...

Hadoop map reduce Extract specific columns from csv file in csv format

java,hadoop,file-io,mapreduce,bigdata
I am new to hadoop and working on a big data project where I have to clean and filter given csv file. like if given csv file has 200 columns then I need to select only 20 specific columns (so called data filtering) as a output for further operation. also...

Running Scala Map Reduce code using a web app

scala,hadoop,web-applications,mapreduce
I have a specific requirement where I have to write few hadoop MR code in scala and then to fire those codes using a web-app and then finally show the results in a web page. Is this possible ? If yes is there any framework which I can make use...

groupByKey not properly working in spark

scala,mapreduce,apache-spark
So, I have an RDD, which has key-value pair like following. (Key1, Val1) (Key1, Val2) (Key1, Val3) (Key2, Val4) (Key2, Val5) After groupByKey, I expect to get something like this Key1, (Val1, Val2, Val3) Key2, (Val4, Val5) However, I see that same keys are being repeated even after doing groupByKey()....

SUM, AVG, in Pig are not working

hadoop,mapreduce,apache-pig
I am analyzing Cluster user log files with the following code in pig: t_data = load 'log_flies/*' using PigStorage(','); A = foreach t_data generate $0 as (jobid:int), $1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray), $7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as...

Word count program with two input files and single output file

java,hadoop,mapreduce,word-count
I am new to Hadoop. I have done word count program with single input file and single output file. Now I want to take 2 files as input and write that output to a single file. I tried like this: FileInputFormat.setInputPaths(conf, new Path(args[0]), new Path(args[1])); FileOutputFormat.setOutputPath(conf, new Path(args[2])); This is...

MultipleOutputs to be of different FileOutputFormat

hadoop,mapreduce
I am trying to write Multiple output files using MultipleOutputs. Howevere i want my FileOutputFormat to be of two different format i.e. Text and SequenceFileFormat for different files. Is there any way i can achieve this ?

How to compile MapReduce job source code on Hadoop 2.7.0?

java,hadoop,compiler-errors,mapreduce
I running Hadoop 2.7.0 on ubuntu 14.0.2 O.S., and I create wordcount.java with nano text editor which the source code is copy from Apache Hadoop 2.7.0 Document. After I compile wordcount.java with this command javac -classpath hadoop-2.7.0-core.jar -d MyJava wordcount.java, here are the error messages I got. public class WordCount2...

RavenDB count index including zero values

c#,mapreduce,nosql,ravendb
I have a list of items {Id, Name, CategoryId} and a list of categories {Id, Name, IsActive}. How to get a list {CategoryId, Count} including categories that have zero items. Currently I have such index: public class CategoryCountIndex : AbstractIndexCreationTask<Item, CategoryCountIndex.Result> { public class Result { public string CategoryId {...

wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable

java,hadoop,mapreduce
I have used one mapper,one reducer and one combiner class but I am getting the error as below: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:199) at org.apache.hadoop.mapred.Task$CombineOutputCollector.collect(Task.java:1307) at org.apache.hadoop.mapred.Task$NewCombinerRunner$OutputConverter.write(Task.java:1623) at...

When to prefer Hadoop MapReduce over Spark?

hadoop,mapreduce,apache-spark
very simple questions: in which cases should I prefer Hadoop MapReduce over Spark? (I hope this question has not been asked yet - at least I didn't find it...) I am currently doing a comparison of those two processing frameworks and from what I have read so far, everybody seems...

Cutomized function in $group operator of aggregation framework in MongoDB

javascript,mongodb,mapreduce,aggregate-functions
I want to use a customized $sum (lets call it $boolSum) function that returns the number of true element in an array. e.g. group :{ _id: { a : '$a' b : '$b', c : '$c' }, d1: { $boolSum : '$d1'} d2: { $boolSum : '$d2'} } But it...

How to cleaning hadoop mapreduce memory usage?

hadoop,memory,mapreduce,jobs,yarn
I want to ask. I can say for example I have 10 MB memory on each node after I activate start-all.sh process. So, I run the namenode, datanode, secondary namenode, dll. But after I've done the hadoop mapreduce job, why the memory for example decrease to 5 MB for example....

How can get memory and CPU usage of hadoop yarn application?

hadoop,memory,mapreduce,cpu-usage,yarn
I want to ask, after I've ran my hadoop mapreduce application, how can I get the total memory and CPU usage of that application. I've seen it on log and resource manager web page but I didn't get it. Is it possible? Can I get it per job execution or...

Spark on yarn jar upload problems

java,hadoop,mapreduce,apache-spark
I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following: YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/; /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar This method gives the following error: diagnostics: Application application_1434177111261_0007...

Do values come into a Cloudant reducer in key order?

mapreduce,cloudant,reducers
I'm writing map/reduce code for a database on Cloudant. Do the values come in to the reduce(keys, values, rereduce) function in key order when rereduce=false? I assume they would because that's how I am accustomed to things working in Hadoop, but I can't find anything in the Cloudant documentation that...

passing variable from static inner class to top class

java,mapreduce
So I want to set the value of my top classes variable, foo, from within my static nested class. My end goal here is to figure out how to pass an argument from the Map method to the Reduce method in a MapReduce program I am writing. I simplified the...

Use of core-site.xml in mapreduce program

hadoop,mapreduce,bigdata
I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

Location of hdfs files in pseudodistributed single node cluster?

java,hadoop,mapreduce,bigdata
I have hadoop installed on a single node, in a pseudodistributed mode. The dfs.replication value is 1. Where are the files in the hdfs stored by default? The version of hadoop I am using is 2.5.1.

Why does MapReduce bother mapping every value to 1 in the map step?

hadoop,mapreduce
I'm trying to figure out MapReduce and so far I think I'm gaining an okay understanding. However, one thing confuses me. In every example and explanation of MapReduce I can find, the map step maps all values to 1. For instance, in the most common example (counting occurrences of words...

Scalding convert one row into multiple

mapreduce,scalding
So, I have a scalding pipe that contains entries of the form (String, Map[String, Int]). I need to convert each instance of this row into multiple rows. That is, if I had ( "Type A", ["a1" -> 2, "a2" ->2, "a3" -> 3] ) I need as output 3 rows...

I'm getting NoSuchMethodException mapmethod required while compling my mapreduce code

mapreduce
I tried to find out Top N words from my input text file I tried but unable to compile the code I'm getting a run-time exception () not found in mapper. Please help me on this and I am very new to hadoop trying to expertise in this field. Any...

MapReduce job not working in HADOOP-2.6.0

hadoop,mapreduce
I am trying to run wordcount example Here is the code import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public...

Input of the reduce phase is not what I expect in Hadoop (Java)

java,hadoop,mapreduce,reduce,emit
I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph): 1 3 3 1 3 2 2 3 Now, I want to use MapReduce to...

How to get a “fieldcount” (like wordcount) on CouchDB/Cloudant?

javascript,mapreduce,couchdb,word-count,cloudant
Trying to get a count of fields, just like the classic word count example. I thought this would be trivial... but I got this useless result... {"rows":[ {"key":null,"value":212785214} ]} How can I get what I wanted... an inventory of all fields used in my documents, with a count of how...

Incorrect response to mapReduce query in mongo-db

mongodb,mapreduce
I have 1000 user records in collecton, in which 459 document has gender male and remaining as female //document structure > db.user_details.find().pretty() { "_id" : ObjectId("557e610d626754910f0974a4"), "id" : 0, "name" : "Leanne Flinn", "email" : "[email protected]", "work" : "Unilogic", "dob" : "Fri Jun 11 1965 20:50:58 GMT+0530 (IST)", "age" :...

Storing multiple Strings in the Value field of a Map

java,dictionary,mapreduce
In one of my Banking project I have a RecordFile file which contains some records in the format of: CustomerNumber,AccountNumber,FirstName,LastName, some other fields... In some transactional records which are present in a different file altogether, either of CustomerNumber or AccountNumber or (rarely) both gets populated. The purpose of the mapreduce...

Hadoop job fails, Resource Manager doesnt recognize AttemptID

hadoop,mapreduce,oozie
Im trying to aggregate some data in an Oozie workflow. However the aggregation step fails. I found two points of interests in the logs: The first is an error(?) that seems to occur repeatedly: After a container finishes, it gets killed but exits with non-zero Exit code 143. It finishes:...

What should be the size of the file in HDFS for best MapReduce job performance

hadoop,mapreduce,filesystems,hdfs
I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?

How to use spark for map-reduce flow to select N columns, top M rows of all csv files under a folder?

hadoop,mapreduce,apache-spark,spark-streaming,pyspark
To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB): id name address city... 1 Matt add1 LA... 2 Will add2 LA... 3 Lucy add3 SF... ... And we have a lookup table based on "name"...

Grouping over multiple fields in MongoDb

mongodb,mapreduce,grouping,aggregation
How would I go about grouping over multiple fields? I need to get a unique count for case insensitive true over multiple independent documents. I've looked at both map/reduce and aggregation and I don't quite know what would be the best approach. Lets say I have the following data in...

Order by and Join in SQL or spark or mapreduce

sql,join,mapreduce,apache-spark,phoenix
I have two tables whose content is as below. Table 1: ID1 ID2 ID3 ID4 NAME DESCR STATUS date 1 -12134 17773 8001300701101 name1 descr1 INACTIVE 20121203 2 -12136 17773 8001300701101 name1 descr1 INACTIVE 20121202 3 -12138 17785 9100000161822 name3 descr3 INACTIVE 20121201 4 -12140 17785 9100000161822 name3 descr3 ACTIVE...

out of memory error when reading csv file in chunk

python,csv,mapreduce,out-of-memory,chunks
I am processing a csv-file which is 2.5 GB big. The 2.5 GB table looks like this: columns=[ka,kb_1,kb_2,timeofEvent,timeInterval] 0:'3M' '2345' '2345' '2014-10-5',3000 1:'3M' '2958' '2152' '2015-3-22',5000 2:'GE' '2183' '2183' '2012-12-31',515 3:'3M' '2958' '2958' '2015-3-10',395 4:'GE' '2183' '2285' '2015-4-19',1925 5:'GE' '2598' '2598' '2015-3-17',1915 And I want to groupby ka and kb_1...

How to configure java memory heap space for hadoop mapreduce?

java,hadoop,mapreduce,heap,shuffle
I've tried to run a mapreduce job on about 20 GB data, and I got an error on reduce shuffle phase. It says that because of memory heap space. Then, I've read on many source, that I have to decrease the mapreduce.reduce.shuffle.input.buffer.percent property on mapred-site.xml with the default value 0,7....

MapReduce (Hadoop-2.6.0)+ HBase-1.0.1.1 class not found exception

eclipse,hadoop,mapreduce,hbase
I have written a Map-Reduce program to fetch data from an input file and output it to a HBase table. But I am not able to execute. I am getting the following error Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at beginners.VisitorSort.main(VisitorSort.java:123) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at...

How to bring data from external sources (mainly Restful) to HDFS?

java,rest,hadoop,mapreduce,oozie
I guess this is more sort of design related question. I am a java developer and new to hadoop Big data world; learning hadoop in my Hortonworks HDP Sandbox (it's a single node pseudo cluster provided as VM by Hortonworks). I have designed a Java restful api that interacts with...

MapReduce - reducer emits the output in one line

java,hadoop,mapreduce
I have a simple MapReduce job which is supposed to read a dictionary from a text file and them process another huge file line by line and compute the inverse document matrix. The output is supposed to look like this: word-id1 docX:tfX docY:tfY word-id2 docX:tfX docY:tfY etc... However, the output...

HADOOP - Problems copying text files into HDFS

hadoop,mapreduce,hdfs,word-count
I am implementing Hadoop single-node-cluster following the prominent Michael Noll Tutorial. The cluster is working, checking with jps shows that all components are running after execution of start-all.sh. I face a problem reproducing the wordcount-example using some downloaded texts. I downloaded the files in /tmp/gutenberg and checked if they are...

How value is set in default Linerecordreader in hadoop

hadoop,mapreduce,mapper
Once Jobtracker gets the Splits with getsplits() function of InputFormat class. Then the jobtracker assigns maptasks based on the storage location of the split and maptask calls the createrecordreader() method in InputFormat class which in turn calls linerecordreader class.The initialize function gets the start,end position and nextkeyvalue() sets the key,value....

“Java Heap space Out Of Memory Error” while running a mapreduce program

java,hadoop,mapreduce
I'm facing Out Of Memory error while running a mapreduce program.If I keep 260 files in one folder and give as input to the mapreduce program,it is showing Java Heap space Out of Memory error.If I give only 100 files as input the mapreduce,it is running fine.Then how can I...

Hadoop streaming with Python: splitting input files manually

hadoop,mapreduce,hadoop-streaming
I am new to Hadoop and am trying to use its streaming feature with Python written mapper and reducer. The problem is that my original input file will contain sequences of lines which are to be identified by a mapper. If I let Hadoop split the input file, it might...

Pig in local mode on a large file

mapreduce,apache-pig,bigdata
I am running pig in local mode on a large file 54 GB. I observe it spawning lot of map tasks sequentially . What I am expecting is that maybe each map task is reading 64 MB worth of lines. So if I want to optimize this and maybe reads...

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

java,hadoop,arraylist,mapreduce
This question already has an answer here: Output a list from a Hadoop Map Reduce job using custom writable 1 answer How can we pass an arraylist as value from the mapper to the reducer. My code basically has certain rules to work with and would create new values(String)...

Spark RDD repeated reduce operations yielding inconsistent results

scala,mapreduce,apache-spark,reduce,rdd
Consider the following code in Spark that should return the sum of the sqrt's of a sequence of integers: // Create an RDD of a sequence of integers val data = sc.parallelize(Range(0,100)) // Transform RDD to sequence of Doubles val x = data.map(_.toDouble) // Reduce the sequence as the sum...

When to use map reduce over Aggregation Pipeline in MongoDB?

mongodb,mapreduce,aggregation-framework
While looking at documentation for map-reduce, I found that: NOTE: For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline. I did not understand much from it. What are the use...

Why we are configuring mapred.job.tracker in YARN?

hadoop,mapreduce,yarn
What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port. The description for mapred.job.tracker property is "The host and port that the MapReduce job...

Riak mapReduce fails with > 15 records

mapreduce,erlang,riak
Problem I've been learning Riak and ran into an issue with mapReduce. My mapReduce functions work fine when there's 15 records, but after that, it throws a stack trace error. I'm new to Riak and Erlang, so I'm unsure whether it's my code or it's Riak. Any advice on how...

Hadoop - word count per node

java,hadoop,mapreduce,word-count
I am implementing a customized version of WordCount.java in Hadoop where I am interested in outputting the word counts per node. For example, given text: FindMe FindMe ..... .... .... .. more big text ... FindMe FindMe FindMe FindMe node01: 2 FindMe node02: 3 Here is a snippet from my...

How do I do a basic indexed sum in Cloudant map/reduce?

mapreduce,cloudant
I have a Cloudant database containing the following documents {"test": 1, "value": 10} {"test": 1, "value": 20} {"test": 2, "value": 100} {"test": 2, "value": 200} I want to create a view that takes the sum of all the values for a given test. So the view would contain {"rows":[ {"key":1,"value":...

Does record splitting need to generate unique keys for each record in hadoop?

java,hadoop,mapreduce
I am relatively new to the hadoop world. I have been following examples I could find to understand how the record splitting step works for mapreduce jobs. I noticed that TextInputFormat splits file into records with key as the byte offset and value as a string. In this case, we...

Copy Hbase table to another with different queue for map reduce

hadoop,mapreduce,hbase
I run CopyTable action on Hbase hbase -Dhbase.client.scanner.caching=100000 -Dmapred.map.tasks.speculative.execution=false org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=desc src but map reduce is spowned on the default queue. How to run this task on different Application Queue?...

How to abort App Engine pipelines gracefully?

python,google-app-engine,mapreduce,appengine-pipeline
Problem I have a chain of pipelines: class PipelineA(base_handler.PipelineBase): def run(self, *args): # do something class PipelineB(base_handler.PipelineBase): def run(self, *args): # do something class EntryPipeline(base_handler.PipelineBase): def run(self): if some_condition(): self.abort("Condition failed. Pipeline aborted!") yield PipelineA() mr_output = yield mapreduce_pipeline.MapreducePipeline( # mapreduce configs here # ... ) yield PipelineB(mr_output) p =...

which logic sould be followed using custom partitioner in map reduce to solve this

java,hadoop,mapreduce,load-balancing,hadoop-partitioning
If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

Yarn and MapReduce resource configuration

hadoop,mapreduce,yarn
I currently have a pseudo-distributed Hadoop System running. The machine has 8 cores (16 virtual cores), 32 GB Ram. My input files are between a few MB to ~68 MB (gzipped log files, which get uploaded to my server once they reach >60MB hence no fix max size). I want...

How to tune Spark application with hadoop custom input format

hadoop,mapreduce,apache-spark
My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS. Following is the code snippet. Configuration conf = new Configuration(); JavaPairRDD<Text, Text> baseRDD = ctx .newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf); JavaRDD<myClass> mapPartitionsRDD = baseRDD .mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {...

Hadoop append data to hdfs file and ignore duplicate entries

java,hadoop,mapreduce,hive,hdfs
How can I append data to HDFS files and ignore duplicate values? I have a huge HDFS file (MainFile) and I have 2 other new files from different sources and I want to append data from this files to the MainFile. Main File and the other files has same structure....

What is the equivalent of BlobstoreLineInputReader for targeting Google Cloud Storage?

python,google-app-engine,mapreduce,pipeline
This is a python appengine question, mapreduce library 1.9.21 . I have code writing lines to a blob in the local blobstore, then processing that using mapreduce BlobstoreLineInputReader. Given that the files api is going away, I thought I'd retarget all my processing to cloud storage. I would expect to...

Hadoop - find out the resource utilization of every node and distribute load equally in a cluster

hadoop,mapreduce,cluster-computing,resource-utilization
I want to find out the resource utilization (CPU,RAM) and the Data processing taking place at every node in the Hadoop cluster. Is there any way using MapReduce or HDFS commands to find out the load distributed across each node ? Also, if one node is busy (overloaded) and another...

try to hadoop read from hdfs output

hadoop,mapreduce
This is my program i want to read from my hdfs out which i create using map reduce program but it does not display any output. there is not any compile time and run time error. import java.io.BufferedReader; import java.io.InputStreamReader; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; public class Cat{ public...

Sorting records by reddit algorithm using mongodb

c#,algorithm,mongodb,mapreduce,reddit
I'm trying to implement the reddit algorithm as a sorting option in my app but I'm constantly hitting walls all over the place. I started my implementation using this (Sorting mongodb by reddit ranking algorithm) post as a guide line. I tried to convert it to c#; below is my...

Facing error while starting map job

python,google-app-engine,mapreduce
While starting a map job I am getting this error. ERROR 2015-05-11 06:03:45,719 webapp2.py:1528] __init__() got an unexpected keyword argument '_user_agent' Traceback (most recent call last): File "/home/rshah/google_appengine/lib/webapp2-2.3/webapp2.py", line 1511, in __call__ rv = self.handle_exception(request, response, e) File "/home/rshah/google_appengine/lib/webapp2-2.3/webapp2.py", line 1505, in __call__ rv = self.router.dispatch(request, response) File "/home/rshah/google_appengine/lib/webapp2-2.3/webapp2.py", line...

Reducers for Hive data

mapreduce,hive
I'm a novice. I'm curious to know how reducers are set to different hive data sets. Is it based on the size of the data processed? Or a default set of reducers for all? For example, 5GB of data requires how many reducers? will the same number of reducers set...

reduce function in hadoop doesn't work

java,hadoop,mapreduce,word-count
I learning hadoop. I wrote simple program in Java. Program have to counts words (and creates file with words and number of times each word appears), but program only creates a file with all words, and number "1" near every word. It's look like : rmd 1 rmd 1 rmd...

Spark: Group RDD by id

sql,hadoop,mapreduce,apache-spark,rdd
I have a 2 RDDs. In Spark scala, how do I join event1001RDD and event2009RDD if they have the same id? val event1001RDD: schemaRDD = [eventtype,id,location,date1] [1001,4929102,LOC01,2015-01-20 10:44:39] [1001,4929103,LOC02,2015-01-20 10:44:39] [1001,4929104,LOC03,2015-01-20 10:44:39] val event2009RDD: schemaRDD = [eventtype,id,date1,date2] [2009,4929101,2015-01-20 20:44:39,2015-01-20 20:44:39] [2009,4929102,2015-01-20 15:44:39,2015-01-20 21:44:39] [2009,4929103,2015-01-20 14:44:39,2015-01-20 14:44:39] [2009,4929105,2015-01-20...

is it more efficient to use unions rather than joins in apache spark, or does it not matter?

python,performance,join,mapreduce,apache-spark
Recently I was running a job on an apache spark cluster and I was going to do an inner join on two rdds. However I then thought that for this calculation I could avoid a join by using union, reduceByKey and filter instead. But is this basically what join is...

akka: pattern for combining messages from multiple children

scala,concurrency,mapreduce,akka
Here's the pattern I have come across: An actor A has multiple children C1, ..., Cn. On receiving a message, A sends it to each of its children, which each do some calculation on the message, and on completion send it back to A. A would then like to combine...