hadoop,machine-learning,mahout , Hadoop vs Mahout and Machine learning Issue?

Hadoop vs Mahout and Machine learning Issue?


Tag: hadoop,machine-learning,mahout

I start making research about Data science and machine learning development using mahout, and i found hadoop, Both made me confused :

  1. what is the relationship between hadoop and mahout?
  2. For Data Science and machine learning stuff, what is the best to start ?


Hadoop is a framework based on distributed storage and distributed processing concepts for processing large data. It is having a distributed storage layer called hadoop distributed file system (HDFS) and a distributed processing layer called mapreduce. Hadoop is designed in such a way that it can run on commodity hardware. Hadoop is written in Java.

Mahout is a member in hadoop ecosystem which contains the implementation of various machine learning algorithms. Mahout utilizes hadoop's parallel processing capability to do the processing so that the end user can use this with the large data sets without much complexity. User can either reuse these algorithms directly or use with some customizations, but no need to worry much about the complexities of the mapreduce implementation of the algorithm.

For Data Science and machine learning stuffs, you should learn about the usage and details of the algorithms. Then you can concentrate on mahout. Since mahout jobs in distributed mode are mapreduce jobs, you should learn hadoop fundamentals and mapreduce programming.


Use of core-site.xml in mapreduce program

I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

SQL Server 2012 & Polybase - 'Hadoop Connectivity' configuration option missing

As described in the title, I am using SQL Server 2012 Parallel Data Warehouse with Polybase feature to try to access a HDInisght Hadoop cluster. As a starting point for every connection to Hadoop from SQL Server, I find to execute the command sp_configure @configname = 'hadoop connectivity', @configvalue =...

Why can't I calculate CostFunction J

This is my implementation of CostFunctionJ: function J = CostFunctionJ(X,y,theta) m = size(X,1); predictions = X*theta; sqrErrors =(predictions - y).^2; J = 1/(2*m)* sum(sqrErrors); But when I try to enter the command in MATLAB as: >> X = [1 1; 1 2; 1 3]; >> y = [1; 2; 3];...

Merging two columns into a single column and formatting the content to form an accurate date-time format in Hive?

these are the 2 columns(month,year). I want to create a single column out of them having an accurate date-time format('YYYY-MM-DD HH:MM:SS') and add as new column in the table. Month year 12/ 3 2013 at 8:40pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at 8:39pm 12/ 3 2013 at...

Hadoop append data to hdfs file and ignore duplicate entries

How can I append data to HDFS files and ignore duplicate values? I have a huge HDFS file (MainFile) and I have 2 other new files from different sources and I want to append data from this files to the MainFile. Main File and the other files has same structure....

Save flume output to hive table with Hive Sink

I am trying to configure flume with Hive to save flume output to hive table with Hive Sink type. I have single node cluster. I use mapr hadoop distribution. Here is my flume.conf agent1.sources = source1 agent1.channels = channel1 agent1.sinks = sink1 agent1.sources.source1.type = exec agent1.sources.source1.command = cat /home/andrey/flume_test.data agent1.sinks.sink1.type...

Why is there only one hidden layer in a neural network?

I recently made my first neural network simulation which also uses a genetic evolution algorithm. It's simple software that just simulates simple organisms collecting food, and they evolve, as one would expect, from organisms with random and sporadic movements into organisms with controlled, food-seeking movements. Since this kind of organism...

How to insert and Update simultaneously to PostgreSQL with sqoop command

I am trying to insert into postgreSQL DB with sqoop command. sqoop export --connect jdbc:postgresql:// --table table1 --username user1 --password pass1--export-dir /hivetables/table/ --fields-terminated-by '|' --lines-terminated-by '\n' -- --schema schema It is working fine if there is not primary key constrain. I want to insert new records and update old records...

Vertica: Input record 1 has been rejected (Too few columns found)

I am trying to copy file from Hadoop to a Vertica table and get the an error. The problem is same copy sometimes pass and some times fails,any idea? The Error: Caused by: java.sql.SQLException: [Vertica]VJDBC ERROR: COPY: Input record 1 has been rejected (Too few columns found) at com.vertica.util.ServerErrorData.buildException(Unknown Source)...

Apache Spark: Error while starting PySpark

On a Centos machine, Python v2.6.6 and Apache Spark v1.2.1 Getting the following error when trying to run ./pyspark Seems some issue with python but not able to figure out 15/06/18 08:11:16 INFO spark.SparkContext: Successfully stopped SparkContext Traceback (most recent call last): File "/usr/lib/spark_1.2.1/spark-1.2.1-bin-hadoop2.4/python/pyspark/shell.py", line 45, in <module> sc =...

Which spark MLIB algorithm to use?

I'm newbie to machine learning and would like to understand what algorithm (Classification algorithm or co-relation algorithm?) to use in order to understand what is the relationship between one or more attributes. for example consider I have following set of attributes, Bill No, Bill Amount, Tip amount, Waiter Name and...

Sqoop Export with Missing Data

I am trying to use Sqoop to export data from HDFS into Postgresql. However, I receive an error partially through the export that it can't parse the input. I manually went into the file I was exporting and saw that this row had two columns missing. I have tried a...

hadoop complains about attempting to overwrite nonempty destination directory

I'm following Rasesh Mori's instructions to install Hadoop on a multinode cluster, and have gotten to the point where jps shows the various nodes are up and running. I can copy files into hdfs; I did so with $HADOOP_HOME/bin/hdfs dfs -put ~/in /in and then tried to run the wordcount...

Which classifiers provide weight vector?

What machine learning classifiers exists which provide after the learning phase a weight vector? I know about SVM, logistic regression, perceptron and LDA. Are there more? My goal is to use these weight vector to draw an importance map....

Hive external table not reading entirety of string from CSV source

Relatively new to the Hadoop world so apologies if this is a no-brainer but I haven't found anything on this on SO or elsewhere. In short, I have an external table created in Hive that reads data from a folder of CSV files in HDFS. The issue is that while...

hadoop large file does not split

I have an input file of size 136MB and I launched some WordCount test and I monitor only one mapper. Then I set dfs.blocksize to 64MB in my hdfs-site.xml and I still get one mapper. Am I doing wrong ?

Input of the reduce phase is not what I expect in Hadoop (Java)

I'm working on a very simple graph analysis tool in Hadoop using MapReduce. I have a graph that looks like the following (each row represents and edge - in fact, this is a triangle graph): 1 3 3 1 3 2 2 3 Now, I want to use MapReduce to...

How to run hadoop appliaction automatically?

I know that a MapReduce program can be ran using the command line "hadoop jar *.jar" for a time. But now the program is required to be ran a time for every hour in background. Are there any methods to make the MR program be hourly submitted to hadoop automatically?...

JMH Benchmark on Hadoop YARN

I have written a JMH benchmark for my MapReduce job. If I run my app in local mode, it works, but when I run it with the yarn script on my hadoop cluster, then I get the following error: [[email protected] Desktop]$ ./launch_mapreduce.sh # JMH 1.10 (released 5 days ago) #...

Best way to store relational data in hdfs

I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs. First of...

Using Python to find correlation pairs

NAME PRICE SALES VIEWS AVG_RATING VOTES COMMENTS Module 1 $12.00 69 12048 5 3 26 Module 2 $24.99 12 52858 5 1 14 Module 3 $10.00 1 1381 -1 0 0 Module 4 $22.99 46 57841 5 8 24 ................. So, Let's say I have statistics of sales. I...

Create an external Hive table from an existing external table

I have a set of CSV files in a HDFS path and I created an external Hive table, let's say table_A, from these files. Since some of the entries are redundant, I tried creating another Hive table based on table_A, say table_B, which has distinct records. I was able to...

Datanode and Nodemanager on slave machine are not able to connect to NameNode and ResourceManager on master machine

I have installed hadoop on two node cluster- Node1 and Node2. Node1 is master and Node2 is slave. Node2's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1 respectively. However Node1's datanode and Nodemanager are not able to connect Namenode and Resourcemanager on Node1. Node1: jps...

Why we are configuring mapred.job.tracker in YARN?

What I know is YARN is introduced and it replaced JobTracker and TaskTracker. I have seen is some Hadoop 2.6.0/2.7.0 installation tutorials and they are configuring mapreduce.framework.name as yarn and mapred.job.tracker property as local or host:port. The description for mapred.job.tracker property is "The host and port that the MapReduce job...

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)? My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression. 15/06/06 17:19:15 INFO mapred.JobClient: map 100%...

Does Andrew Ng's ANN from Coursera use SGD or batch learning?

What type of learning is Andrew Ng using in his neural network excercise on Coursera? Is it stochastic gradient descent or batch learning? I'm a little confused right now......

Flink error - org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

I am trying to run a flink job using a file from HDFS. I have created a dataset as following - DataSource<Tuple2<LongWritable, Text>> visits = env.readHadoopFile(new TextInputFormat(), LongWritable.class,Text.class, Config.pathToVisits()); I am using flink's latest version - 0.9.0-milestone-1-hadoop1 (I have also tried with 0.9.0-milestone-1) whereas my Hadoop version is 2.6.0 But,...

Oozie on YARN - oozie is not allowed to impersonate hadoop

I'm trying to use Oozie from Java to start a job on a Hadoop cluster. I have very limited experience with Oozie on Hadoop 1 and now I'm struggling trying out the same thing on YARN. I'm given a machine that doesn't belong to the cluster, so when I try...

Different ways of hadoop installation

I'm new to hadoop and trying to install it on my local machine. I see that there are many ways in installing hadoop like install vmware Horton works and install hadoop on top of that or install Oracle virtual box , Cloudera and then Hadoop . My question is that...

Spark on yarn jar upload problems

I am trying to run a simple Map/Reduce java program using spark over yarn (Cloudera Hadoop 5.2 on CentOS). I have tried this 2 different ways. The first way is the following: YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/; /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class MRContainer --master yarn-cluster --jars /var/tmp/spark/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar simplemr.jar This method gives the following error: diagnostics: Application application_1434177111261_0007...

Is it Item based or content based Collaborative filtering?

I am currently working on an existing system that recommends items that are similar to previous items that the user has liked. It uses Alternating least squares Collaborative Filtering to find feature vectors of users and items. It then uses the feature vectors of the items and uses the cosine...

Extract Patterns from the device log data

I am working on a project, in which we have to extract the patterns(User behavior) from the device log data. Device log contains different device actions with a timestamp like when the devices was switched on or when they was switched off. For example: When a person enters a room....

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanSetDropBehind issue in ecllipse

I have the below spark word count program : package com.sample.spark; import java.util.Arrays; import java.util.List; import java.util.Map; import org.apache.spark.SparkConf; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFlatMapFunction; import org.apache.spark.api.java.function.PairFunction; import...

Dimension Reduction of Feature in Machine Learning

Is there any way to reduce the dimension of the following features from 2D coordinate (x,y) to one dimension? ...

In a MapReduce , how to send arraylist as value from mapper to reducer [duplicate]

This question already has an answer here: Output a list from a Hadoop Map Reduce job using custom writable 1 answer How can we pass an arraylist as value from the mapper to the reducer. My code basically has certain rules to work with and would create new values(String)...

HIVE: apply delimiter until a specified column

I am trying to move data from a file into a hive table. The data in the file looks something like this:- StringA StringB StringC StringD StringE where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one...

issue monitoring hadoop response

I am using ganglia to monitor Hadoop. gmond and gmetad are running fine. When I telnet on gmond port (8649) and when I telnet gmetad on its xml answer port, I get no hadoop data. How can it be ? cluster { name = "my cluster" owner = "Master" latlong...

how to drop partition metadata from hive, when partition is drop by using alter drop command

I have dropped the all the partitions in the hive table by using the alter command alter table emp drop partition (hiredate>'0'); After droping partitions still I can see the partitions metadata.How to delete this partition metadata? Can I use the same table for new partitions? ...

Add PARTITION after creating TABLE in hive

i have created a non partitioned table and load data into the table,now i want to add a PARTITION on the basis of department into that table,can I do this? If I do: ALTER TABLE Student ADD PARTITION (dept='CSE') location '/test'; It gives me error: FAILED: SemanticException table is not...

jets3t cannot upload file to s3

I'm trying to upload files from local to s3 using hadoop fs and jets3t, but I'm getting the following error Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Request Error. HEAD '/project%2Ftest%2Fsome_event%2Fdt%3D2015-06-17%2FsomeFile' on Host 'host.s3.amazonaws.com' @ 'Thu, 18 Jun 2015 23:33:01 GMT' -- ResponseCode: 404, ResponseStatus: Not Found, RequestId: AVDFJKLDFJ3242, HostId: D+sdfjlakdsadf\asdfkpagjafdjsafdj I'm...

How avoid error “TypeError: invalid data type for einsum” in Python

I try to load CSV file to numpy-array and use the array in LogisticRegression etc. Now, I am struggling with error is shown below: import numpy as np import pandas as pd from sklearn import preprocessing from sklearn.linear_model import LogisticRegression dataset = pd.read_csv('../Bookie_test.csv').values X = dataset[1:, 32:34] y = dataset[1:,...

Nominal valued dataset in machine learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning? Should I map each nominal value to real value? For example, if I want to make my program to learn a predictive model...

How to specify the prior probability for scikit-learn's Naive Bayes

I'm using the scikit-learn machine learning library (Python) for a machine learning project. One of the algorithms I'm using is the Gaussian Naive Bayes implementation. One of the attributes of the GaussianNB() function is the following: class_prior_ : array, shape (n_classes,) I want to alter the class prior manually since...

How configure Stanford QNMinimizer to get similar results as scipy.optimize.minimize L-BFGS-B

I want to configurate the QN-Minimizer from Stanford Core NLP Lib to get nearly similar optimization results as scipy optimize L-BFGS-B implementation or get a standard L-BFSG configuration that is suitable for the most things. I set the standard paramters as follow: The python example I want to copy: scipy.optimize.minimize(neuralNetworkCost,...

how to programmatically create ensembles in weka?

Does there already exist a class in weka that takes care of voting/averaging different models, or do I have to come up with my own scheme? I already looked for that kind of functionality on the web, but I couldn't find any specific information....

Hadoop map reduce Extract specific columns from csv file in csv format

I am new to hadoop and working on a big data project where I have to clean and filter given csv file. like if given csv file has 200 columns then I need to select only 20 specific columns (so called data filtering) as a output for further operation. also...

Importtsv command gives : Container exited with a non-zero exit code 1 error

I am trying to load a tsv file into an existing hbase table. I am using the following command: /usr/local/hbase/bin$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,cf:value '-Dtable_name.separator=\t' Table-name /hdfs-path-to-input-file But when I execute the above command, I get the following error Container id: container_1434304449478_0018_02_000001 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)...

Hadoop Basic - error while creating directroy

I have started learning hadoop recently and I am getting the below error while creating new folder - [email protected]:~/Installations/hadoop-1.2.1/bin$ ./hadoop fs -mkdir helloworld Warning: $HADOOP_HOME is deprecated. 15/06/14 19:46:35 INFO ipc.Client: Retrying connect to server: localhost/ Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) Request you to help.below...

ERROR jdbc.HiveConnection: Error opening session Hive

i try to run JBDC code for Hive2 get error. i have hive 1.2.0 version hadoop 1.2.1 version. but in command line hive and beeline works fine without any problem.but with jdbc getting error. import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; import java.sql.Statement; import java.sql.DriverManager; public class HiveJdbcClient { private static...

Prediction based on large texts using Vowpal Webbit

I want to use the resolution time in minutes and the client description of the tickets on Zendesk to predict the resolution time of next tickets based on their description. I will use only this two values, but the description is a large text. I searched about hashing the feature...