FAQ Database Discussion Community


Datsac Cassandra binding with Apache Cassandra

hbase,bigdata,amazon-dynamodb,cassandra-2.0,bigtable
I am tryng to use the Datsax Cassandra (community endition) , but not able to figure out the Datasax git repo for the same . Can someone please help me out in figuring out which release of apache cassandra is used by Datasax cassandra (Community edition ) ??? or does...

What is the best (and most cost effective way) to run automated server operations on a certain day or time by it self?

amazon-web-services,amazon-ec2,webserver,bigdata,analytics
I am using a load balancer (and server) to start other servers to run large database updates and analytics, etc. I am paying quite a bit for running servers. I assumed this enterprise solution would cost quite a bit but I think I might be doing things a little differently...

R code hangs in between with large data?

r,bigdata
I am dealing with db with around 5lac+ records. I want to count the words in the data. This is my code library(tm) library(RPostgreSQL) drv <- dbDriver("PostgreSQL") con <- dbConnect(drv,user="postgres",password="root", dbname="pharma",host="localhost",port=5432) query<-"select data->'PubmedArticleSet'->'PubmedArticle'->'MedlineCitation'->'Article'->'Journal'->>'Title' from searchresult where id BETWEEN 1 AND (select max(id) from searchresult)" der<-dbGetQuery(con,query) der<- VectorSource(der) der<-...

How to parse bigdata json file (wikidata) in C++ efficiently?

c++,json,bigdata,rapidjson,wikidata
I have a single json file which is about 36 GB (coming from wikidata) and I want to access it more efficiently. Currently I'm using rapidjsons SAX-style API in C++ - but parsing the whole file takes on my machine about 7415200 ms (=120 minutes). I want to access the...

Running python script on Microsoft Azure

python,azure,cloud,bigdata,azure-virtual-machine
I'll have a linux machine with a virtual machine installed for Microsoft azure soon. I need to run some data mining/graph analysis algorithms on the azure because I work with big data. I don't want to use azure machine learning stuff. just want to run my own python code. What...

How to be a faster Panda with groupbys

python,performance,pandas,bigdata,dataframes
I have a Pandas dataframe with 150 million rows. Within that there are about 1 million groups I'd like to do some very simple calculations on. For example, I'd like to take some existing column 'A' and make a new column, 'A_Percentile' that expresses the values of 'A' as percentile...

How to find widest paths collection on a directed weighted graph

algorithm,math,graph,path,bigdata
Consider the following graph: nodes 1 to 6 are connected with a transition edge that have a direction and a volume property (red numbers). I'm looking for the right algorithm to find paths with a high volume. In the above example the output should be: Path: [4,5,6] with a minimal...

Hive(Bigdata)- difference between bucketing and indexing

hadoop,mapreduce,hive,bigdata
What is the main difference between bucketing and indexing of a table in Hive?

Unexpected behavior of apply v. for loop in R

r,bigdata,apply
I want to use apply instead of a for loop to speed up a function that creates a character string vector from paste-collapsing each row in a data frame, which contains strings and numbers with many decimals. The speed up is notable, but apply forces the numbers to fill the...

Big Data Analytics using Redshift vs Spark, Oozie Workflow Scheduler with Redshift Analytics

apache-spark,analytics,bigdata,oozie,amazon-redshift
We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time). Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not...

Can't verify by visiting the external URL of the server using browser when installing apache Ranger

hadoop,bigdata,monitoring
I am installing Apache Ranger in centos by following their instructions: https://cwiki.apache.org/confluence/display/RANGER/Ranger+Installation+Guide. But after installing and run the command: service ranger-admin start, Ranger starts but I can't verify by visiting the external URL of the server using browser, for example: http://:6080/. I followed every instruction step by step as they...

More than expected jobs running in apache spark

apache-spark,bigdata,pyspark
I am trying to learn apache-spark. This is my code which i am trying to run. I am using pyspark api. data = xrange(1, 10000) xrangeRDD = sc.parallelize(data, 8) def ten(value): """Return whether value is below ten. Args: value (int): A number. Returns: bool: Whether `value` is less than ten....

Neo4J - Finding the widest path on very large graphs

algorithm,graph,neo4j,apache-spark,bigdata
I have created a very large directional weighted graph, and I'm trying to find the widest path between two points. each edge has a count property Here is a small portion of the graph: I have found this example and modified the query, so the path collecting would be directional...

How to define a Where query with 2 conditions, with QueryBuilder in Cassandra?

cassandra,bigdata,query-builder
I have this statement: SELECT * FROM users WHERE id='12' AND fname ='aaa'; How i use the same statement in QueryBuilder like: Statement statement = QueryBuilder.select().all().from("users").where(eq("id", 12))); ...

Barplot large data set

matlab,plot,bigdata
I have a matrix data of student scores(600x10), where 600 is number of students and 9 columns are different subjects and 10th column is their percentage and I want to plot barplot for each column (1-9) with column (10) to see the distribution of the average in each subject. Like...

Case-insensitive search in Pig Latin

hadoop,apache-pig,bigdata
Beginner in Pig Latin here. I am trying to count the occurrences of multiple strings in an input file. Now the search has to be case-insensitive. I know there is a LOWER built-in function in pig but how do I use it ? For example (Input file): 28-Oct-13,7:00PM,Viraj,New to hadoop...

extracting n grams from huge text

python,performance,nlp,bigdata,text-processing
For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one...

Shuffled vs non-shuffled coalesce in Apache Spark

scala,apache-spark,bigdata,distributed-computing
What is the difference between the following transformations when they are executed right before writing RDD to a file? coalesce(1, shuffle = true) coalesce(1, shuffle = false) Code example: val input = sc.textFile(inputFile) val filtered = input.filter(doSomeFiltering) val mapped = filtered.map(doSomeMapping) mapped.coalesce(1, shuffle = true).saveAsTextFile(outputFile) vs mapped.coalesce(1, shuffle = false).saveAsTextFile(outputFile)...

What is the benefit of the in-memory processing engines with a huge amount of data? [closed]

hadoop,apache-spark,bigdata,batch-processing
Spark performs the best if the dataset fits in memory, in case the dataset doesn't fit, it will use the disk and so it is as fast as hadoop. Let's assume that I m dealing with Tera/Peta bytes of data. with a small cluster. Obviously, there is no way to...

Use of core-site.xml in mapreduce program

hadoop,mapreduce,bigdata
I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

Big data using Microsoft SQL Server

.net,sql-server,bigdata
I'm currently designing a solution to manage tolling transaction. Let's say we have about 2-3M transactions/day and about 1k request/s. I don't have any experience about managing that kind of "big" (in my opinion) data. Could you guys give me any info about the capabilities of SQL Server? Can it...

Best algorithm to find N unique random numbers in VERY large array

performance,algorithm,big-o,bigdata,asymptotic-complexity
I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements. I read about Reservoir...

Omniture Data Warehouse Segments Issue

bigdata,data-warehouse,adobe-analytics
Currently, I'm trying to create a segment filter called "Only Search Page" which filters out one particular server from a list of several thousand. Currently, I'm a little stuck and it might be easier to explain with screenshots. In the Segment Manager I set up a segment to check for...

How does hadoop store data and use MapReduce?

hadoop,mapreduce,hdfs,bigdata
When I was trying to understand hadoop architecture, I want to figure out some problems. When there is a big data input, HDFS will divide it into many chucks(64MB or 128MB per chuck) and then replicate many time to store them in memory block, right? However, I still don't know...

esper fixed window based on event starting time

java,performance,bigdata,esper
I am using Esper for Aggregating my Sensor Data. Data may arrive in any interval i.e. 1 seconds to 120 seconds. Each data point contains TimeStamp and Value. I want Min TimeStamp, Max TimeStamp, Average value and Count of data points in 30 min window. Start point and end point...

Is there a way to use some kind of cache for results of the most often used queries in Spark?

hadoop,mapreduce,apache-spark,bigdata
Is there a way to use some kind of cache for results of the most often used queries in Spark (or by using other Hadoop libraries)?

How to speed up: opening a data file in python

python,excel,bigdata
I am using a simple code line to open a big data .xls file workbook = xlrd.open_workbook('file_name.xls'). It also takes considerable time to execute. I am using Python 2. Is there a way to speed up this step?

Could not deallocate container for task attemptId NNN

hadoop,memory,mapreduce,bigdata,yarn
I'm trying to understand how the container allocates memory in YARN and their performance based on different hardware configuration. So, the machine has 30 GB RAM and I picked 24 GB for YARN and leave 6 GB for the system. yarn.nodemanager.resource.memory-mb=24576 Then I followed http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html to come up with some...

Hbase on hadoop not connecting on distrubuted mode

hadoop,hbase,bigdata,ubuntu-14.04,distributed
Hi I AM TRYING TO SETUP HBASE(hbase-0.98.12-hadoop2) ON HADOOP(hadoop-2.7.0) Hadoop is running on localhost:560070 its running fine . my hbase-site.xml as show below <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name>...

NoSuchMethodError when hive.execution.engine value its tez

java,apache,hadoop,hive,bigdata
I am using hive 1.0.0 and apache tez 0.4.1 When I configure hive to use tez I get an exception. In hive-site.xml when the hive.execution.engine value is mr its works fine. But if I set it to tez I get this error: Exception in thread "main" java.lang.NoSuchMethodError: org.apache.tez.mapreduce.hadoop.MRHelpers.updateEnvBasedOnMRAMEnv(Lorg/apache/hadoop/conf/Configuration;Ljava/util/Map;)V at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:169)...

Moving Google Cloud Storage bucket to another project

bigdata,google-cloud-storage
What is the best way to move an existing Google Cloud Storage bucket to another project? I don't want to copy it outside Google Cloud Storage for the transfer, have two copies of the data or use another bucket name. How close can I get to these requirements?...

How to reindex csv data efficiently?

python,pandas,bigdata
I have a file that I downloaded of tick data from the internet. It looks like this. The file is relatively "large" time,bid,bid_depth,bid_depth_total,offer,offer_depth,offer_depth_total 20150423T014501,81.79,400,400,81.89,100,100 20150423T100001,81.,100,100,84.36,100,100 20150423T100017,81.,100,100,83.52,500,500 20150423T115258,81.01,500,500,83.52,500,500 ... I then want to reindex the data so that I can access it through time type query: from pylab import * from...

How to convert a Date String from UTC to Specific TimeZone in HIVE?

hadoop,timezone,hive,bigdata,hive-udf
My Hive table has a date column with UTC date strings. I want to get all rows for a specific EST date. I am trying to do something like the below: Select * from TableName T where TO_DATE(ConvertToESTTimeZone(T.date)) = "2014-01-12" I want to know if there is a function for...

What are the different ways to check if the mapreduce program ran successfully

hadoop,mapreduce,bigdata
If we need to automate a mapreduce program or run from a script, what are the different ways to check if the mapreduce program ran successfully? One way is to find is if _SUCCESS file is created in the output directory. Does the command "hadoop jar program.jar hdfs:/input.txt hdfs:/output" return...

How to load nested collections in hive with more than 3 levels

hadoop,hive,bigdata
I'm struggling to load data into Hive, defined like this: CREATE TABLE complexstructure ( id STRING, date DATE, day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>> ) row format delimited fields terminated by ',' collection items terminated by '|' map keys terminated by ':'; The day_data field contains a complex structure difficult to load with load...

Merge a large list of logical vectors

r,list,merge,bigdata
I have a large list of TRUE/FALSE logical vectors (144 list elements, each ~ 23 million elements long). I want to merge them using any to produce one logical vector. If any of the first elements of each list element are TRUE then TRUE is returned and so on for...

Mysql - query table with over 10m data

mysql,sql,bigdata
I am maintaining a web project using Java & mysql. One mysql table has over 10 million records, I did partition the table by date, so that to reduce rows in each partition. Indexes are also added properly according to queries. In most query, only the first 1 or 2...

Pros and cons of Datameer vs Altreyx [closed]

hadoop,bigdata,analytics
I am trying to evaluate Datameer and Altreyx for our bigdata analytics needs. What are the pros and cons of these two tools?

Cassandra query flexibility

hadoop,cassandra,apache-spark,bigdata,cql
I'm pretty new to the field of big data and currently stucking by a fundamental decision. For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes) Log Entry ------------------------------------------------------------------ | Timestamp...

Pig:FLATTEN keyword

hadoop,mapreduce,apache-pig,bigdata
I am a little confused with the use of FLATTEN keyword in PIG. Consider the below dataset: tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)} Without using the FLATTEN I can access a field (suppose firstname) like this: display_firstname = FOREACH tuple_record GENERATE details.firstname; Now, using the FLATTEN keyword: flatten_record =...

DataNodes can't talk to NameNode

hadoop,bigdata,hortonworks-data-platform,ambari,hortonworks
Set up a hadoop cluster of 3 nodes. One of them got both NameNode and DataNode roles while other two are just DataNodes. I started all nodes and services but in summary it shows only one of DataNodes's status is live. Status of other nodes are not even showing. My...

Split an single-use large IEnumerable in half using a condition

c#,xml,performance,linq,bigdata
Let's say we have a Foo class: public class Foo { public DateTime Timestamp { get; set; } public double Value { get; set; } // some other properties public static Foo CreateFromXml(Stream str) { Foo f = new Foo(); // do the parsing return f; } public static IEnumerable<Foo>...

Finding gaps in huge event streams?

sql,algorithm,mongodb,postgresql,bigdata
I have about 1 million events in a PostgreSQL database that are of this format: id | stream_id | timestamp ----------+-----------------+----------------- 1 | 7 | .... 2 | 8 | .... There are about 50,000 unique streams. I need to find all of the events where the time between any...

Counter grouped by category, author and date in Redis

redis,bigdata,counter,hyperloglog
I am implementing a system that store a large amount of data in a relational DB. Data can be classified into categories and have an author. I want to get the number of items grouped by date, category and author and the sum of all items of each category grouped...

Distribute computing on multiple devices

java,machine-learning,bigdata,distributed-computing
My project takes very long time at running, I made threads and distributed data and processing on my processor cores, But, still takes long time, I tried to optimize the code as i can, How can i distribute computing on multiple laptops?

Converting “individual clock in/out time logs” to “total occupancy of building over time” efficiently

r,bigdata,ff
So I have data in .csv form showing the time which specific users walks into and out of a building over a few months. I am trying to use R to tabulate the building occupancy every 15/30 minutes for analysis. The data has been cleaned and is in the form...

Pig: Unable to Load BAG

hadoop,mapreduce,apache-pig,bigdata
I have a record in this format: {(Larry Page),23,M} {(Suman Dey),22,M} {(Palani Pratap),25,M} I am trying to LOAD the record using this: records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray}); But I am getting this error: 2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting...

Spark as an engine for Hive

hadoop,hive,apache-spark,bigdata
Can we use Spark as an engine for Hive? We have many legacy systems and code base in Hive and would like to use Spark with Hive. Best,...

Apache Spark - Controlling scheduling of map functions

apache,apache-spark,bigdata
I have a 3 node cluster and I'm trying to come up with a benchmark. The use case is that for an application all the map functions need to run on a particular machine and all the reduce functions on the other. Is there any scheduling property in Spark through...

Read a file from byte a until byte b

file,sed,bigdata,head
I want to break a big file into to smaller files mainly. I use stream because I do not want to keep the big file in my disk. What I am looking it is something similar to: sed -n 'a,bp,' #this uses lines in file while i want bytes or:...

SolrException Plugin init failure for [schema.xml] fieldType “pint”: Error loading class 'solr.IntField'

apache,solr,tomcat7,bigdata,solr-schema
I am getting this error collection1: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core collection1: Plugin init failure for [schema.xml] fieldType "pint": Error loading class 'solr.IntField'. when i am trying to import collection 1 (solr 4.5) schema to solr 5.1. I only copy collection 1 from different machine where solr 4.5...

flume for collecting syslog data

hadoop,bigdata,router,syslog,flume
I am trying to collect syslog from 10 devices(routers). I came to know that I can use syslog source, but need clarification about the host and ports in the properties. Whether they are the local port on the machine where flume agent is running. Also how to redirect syslogs to...

AVG on grouped data throwing ERROR 1046:Use an Explicit Cast

hadoop,mapreduce,apache-pig,bigdata
I have a MAP of data in a txt file: [age#27,height#5.8] [age#25,height#5.3] [age#27,height#5.10] [age#25,height#5.1] I want to display the average height for each group of age. This is the LAOD statement: records = LOAD '~/Documents/Pig_Map.txt' AS (details:map[]); records: {details: map[]} Then I grouped the data based on age: group_data =...

How do I determine the size of my HBase Tables ?. Is there any command to do so?

hadoop,export,hbase,bigdata
I have multiple tables on my Hbase shell that I would like to copy onto my file system. Some tables exceed 100gb. However, I only have 55gb free space left in my local file system. Therefore, I would like to know the size of my hbase tables so that I...

Group data into small chunk (big data issue)

r,grouping,bigdata
I was looking for an answer to group data into small chunks in R. Let's say I have df = data.frame(a = c(1, 2, 3, 1, 5), b = c(2, 3, 2, 4, 4)) I want to have a new column to specify the group id. Rows having same value...

Unable to run sparkPi on Apache Spark cluster

apache-spark,cluster-computing,bigdata
The following is my spark master UI which shows 1 registered worker . I am trying to run the sparkPi application on the cluster ,using the following submit script ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://159.8.201.251:7077 \ /opt/Spark/spark-1.2.1-bin-cdh4/lib/spark-examples-1.2.1-hadoop2.0.0-mr1-cdh4.2.0.jar \ 1 but it keeps giving the following warning ,and never finishes...

Designing an API on top of BigQuery

google-app-engine,bigdata,google-bigquery
I have an AppEngine app that tracks user various sorts of impression data across several websites. Currently we're gathering roughly 40 million records a month and the main BigQuery table is closing in on 15Gb in size after 6 weeks of gathering data and our estimates show that within 6...

optimize pandas query on multiple columns / multiindex

python,numpy,pandas,bigdata
I have a very large table (currently 55 million rows, could be more), and I need to select subsets of it and perform very simple operations on those subsets, lots and lots of times. It seemed like pandas might be the best way to do this in python, but I'm...

video storing and synchronous streaming software (can Hadoop do it?)

hadoop,video-streaming,bigdata,video-capture,video-processing
I need some software solution for video storing. I will have a few IP cameras which has to stream into disks. This so called database records those streams.On demand I should be able to stream any of these videos or or few of them. Just do not want to merge...

Column Logic in SQL [closed]

sql,sql-server,bigdata,sql-server-2014
First time poster here! I would like peoples opinions here. I am collecting daily stock data for the past 10yrs (so approximately 2500 rows of data which is not significant), however I have over 200 stocks (will grow to 1000 over time possibly) with about 30 individual fields per stock....

Ambari 2.0 installation, “” failure

hadoop,bigdata,hortonworks-data-platform,ambari,hortonworks
Trying to establish a Hadoop cluster via Ambari 2.0, however failure occurs at installation phase. Here failure log from one of the datanodes: stderr: /var/lib/ambari-agent/data/errors-416.txt Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/2.0.6/hooks/before-ANY/scripts/hook.py", line 34, in <module> BeforeAnyHook().execute() File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 214, in execute method(env) File...

Pig: Unable to load data using PigStorage

hadoop,mapreduce,apache-pig,bigdata
I have this smaple dataset in a txt file (Format: Firstname,Lastname,age,gender) (Eric,Ack,27,M),(Jeremy,Ross,29,F) (Jenny,Dicken,27,F),(Vijay,Sampath,40,M) (Angs,Dicken,28,M),(Venu,Rao,28,M) (Mahima,Mohanty,29,F),(Kenny,Oath,28,M) I am trying to load this data like this: tuple_record = LOAD '~/Documents/Pig_Tuple.txt' USING PigStorage(',') AS (details:tuple(firstname:chararray,lastname:chararray,age:int,sex:chararray)); But this is not working: DUMP tuple_record; I got this when running this command (i.e. It returns nothing)...

Pig: Invalid field Projection; Projected Field does not exist

hadoop,mapreduce,apache-pig,bigdata
describe filter_records; This gives me the below format: filter_records: {details1: (firstname: chararray,lastname: chararray,age: int,gender: chararray),details2: (firstname: chararray,lastname: chararray,age: int,gender: chararray)} I want to display the firstname from both details1 and details2. I tried this: display_records = FOREACH filter_records GENERATE display1.firstname; But I am getting the error: Invalid field projection. Projected...

Rows with identical keys

hbase,bigdata
When I need to create an HBase-row, I have to call Put(row_key) method. Then, what happens if I'll call Put() method again with the same row_key value? Will the existing row be updated or HBase will create the new row? Is it possible to create 2 rows with identical keys?...

What is the difference between broadcast_address and broadcast_rpc_address in cassandra.yaml?

cassandra,bigdata
GOAL: I am trying to understand the best way to configure my Cassandra cluster so that several different drivers across several different networking scenarios can communicate with it properly. PROBLEM/QUESTION: It is not entirely clear to me, after reading the documentation what the difference is between these two settings: broadcast_address...

BigQuery streaming best practice

bigdata,google-bigquery
I am using Google BigQuery for sometime now, using upload files, As I get some delays with this method I am now trying to convert my code into streaming. Looking for best solution here, what is more correct working with BQ: 1. Using multiple (up to 40) different streaming machines...

Pig: UDF not returning expected resultset

java,hadoop,mapreduce,apache-pig,bigdata
This is the sample data on which i was working: Peter Wilkerson 27 M James Owen 26 M Matt Wo 30 M Kenny Chen 28 M I created a simple UDF for filtering the age like this: public class IsApplicable extends FilterFunc { @Override public Boolean exec(Tuple tuple) throws IOException...

Using partition key along with secondary index

cassandra,nosql,bigdata,cassandra-2.0
Following are the two queries that I need to perform. select * from where dept = 100 and emp_id = 1; select * from where dept = 100 and name = 'One'; Which of the below options is better ? Option 1: Use secondary index along with a partition key....

Why is Spark fast when word count?

parallel-processing,streaming,apache-spark,bigdata,rdd
Test case: word counting in 6G data in 20+ seconds by Spark. I understand MapReduce, FP and stream programming models, but couldn’t figure out the word counting is so amazing fast. I think it’s an I/O intensive computing in this case, and it’s impossible to scan 6G files in 20+...

apache falcon's role in hadoop eco system

apache,hadoop,hdfs,bigdata,hortonworks-data-platform
I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster(same master and slaves) as an existing one and copy the data to the new and then run the same jobs as is. I have read about falcon as a feed processing and...

How do I widen this data frame by removing the duplicates and adding frequencies of occurrences instead in R?

r,bigdata,data-manipulation
I have tried the following code, but the frequency column just gives me 0s and 1s. I want the actual count. data2 <- as.data.frame(table(unique.data.frame(data)))) The data frame originally looked something like this (but large): ID Rating 12 Good 12 Good 16 Good 16 Bad 16 Very Bad 34 Very Good...

How do I split my Hbase Table(which is huge) into equal parts so that I can store it into local file system?

hadoop,export,hbase,bigdata,software-engineering
I have a Hbase Table of Size 53 GB that I want to store into my local file system. However I have only two drives of size 30gb each and I can't store the file completely into one drive. Could anyone please tell me how do I split and store...

D3.js selecting a part of data to be visualized from a large dataset

d3.js,bigdata
I have a large real time incoming data for visualization. I have the speed and time in the dataset. Like if you consider CSV format its like the following Speed, Time s1, t1 ....... sn,tn But I want to visualize say only speed for t1-t10. How can I do that?...

How to create the input file for wordcount program mapreduce [closed]

bigdata
What its mean? /user/joe/wordcount/input - input directory in HDFS /user/joe/wordcount/output - output directory in HDFS ...

Error in running program in hadoop?

java,hadoop,runtime-error,bigdata
My program in hadoop 2.7 of wordcount gives error on terminal on running even when it doesn't shows any error in eclipse . hadoop jar WordCount.jar WordCount user/amandeep/file.txt wordcountoutput Error shown is below :- Exception in thread "main" java.lang.ClassNotFoundException: WordCount at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at...

Convert Json Data into specific table format using Pig

json,hadoop,apache-pig,bigdata,cloudera
I have Json file that has following format: "Properties2":[{"K":"A","T":"String","V":"M "}, {"K":"B","T":"String","V":"N"}, {"K":"D","T":"String","V":"O"}] "Properties2":[{"K":"A","T":"String","V":"W”"},{"K":"B","T":"String","V":"X"},{"K":"C","T":"String","V":"Y"},{"K":"D","T":"String","V":"Z"}] I want to extract data in table format from above mentioned json format by using pig: Expected Format: Note: - In first record C column should be blank or null because in first record there is no...

Is Spark Appropriate for Analyzing (Without Redistributing) Logs from Many Machines?

apache-spark,aggregate,analytics,bigdata
I've got a number of logs spread across a number of machines, and I'd like to collect / aggregate some information about them. Maybe first I want to count the number of lines which contain the string "Message", then later I'll add up the numbers in the fifth column of...

Data Mining and Text Mining

nlp,bigdata,nltk,data-mining,text-mining
What is the difference between Data Mining and Text Mining? Both refers to the extraction of unstructured data to structured ones. Is both forms work in the same fashion? please provide a clarity on that.

Hadoop map reduce Extract specific columns from csv file in csv format

java,hadoop,file-io,mapreduce,bigdata
I am new to hadoop and working on a big data project where I have to clean and filter given csv file. like if given csv file has 200 columns then I need to select only 20 specific columns (so called data filtering) as a output for further operation. also...

Is there an efficient alternative for growing dictionary in python? [closed]

python,dictionary,bigdata
I am reading a large text corpus in xml format and storing counts of some word occurrences in a dictionary where a key is a tuple of three elements {('a','b','c'):1}. This dictionary continuously grows in size while its values get updated. I need to keep a dictionary in memory all...

Sub setting very large data frames in R efficiently

r,data.frame,bigdata,large-data
So I have a data frame of 16 columns and ~17 million rows. I would first like to do some ddply on the data frame and then look at the correlations between the different columns. What’s the best and most efficient way to achieve this? My current approach takes too...

In spark join, does table order matter like in pig?

hadoop,apache-spark,apache-pig,bigdata
Related to Spark - Joining 2 PairRDD elements When doing a regular join in pig, the last table in the join is not brought into memory but streamed through instead, so if A has small cardinality per key and B large cardinality, it is significantly better to do join A,...

Apache Hadoop vs Google Bigdata

hadoop,comparison,hdfs,bigdata,gfs
Can any one explain me the key difference between Apache Hadoop vs Google Bigdata Which one is better(hadoop or google big data). ...

Accessing a large number of unsorted array elements in Python

python,r,bigdata,sparse-matrix,large-data
I'm not very skilled in Python. However, I'm pretty handy with R. Yet, I do have to use Python since it has an up-to-date interface with Cplex. I'm also trying to avoid all the extra coding I would have to do in C/C++ That being said, I have issues with...

Apache Storm: Nimbus not starting on Port 6627

java,apache,bigdata,storm
I can't see anything on port 6627 after starting Nimbus. I am getting the Connection Refused error. Following errors are thrown in Nimbus Log: 6899 [main] ERROR com.smarterme.intake.EmbeddedTopologyRunner - Toplogy submitting failed.....org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused java.lang.RuntimeException: org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused at backtype.storm.utils.NimbusClient.getConfiguredClient(NimbusClient.java:38) at...

Where do hdfs directories reside in linux?

hadoop,hdfs,bigdata,hadoop2
Running my first map-reduce program. I created a directory in hdfs using hdfs dfs -mkdir input The directories created this way reside in hdfs home dir i.e /usr/hdfs (..?) But I couldn't find the directory 'input' I created above anywhere in my linux OS. Any thoughts?...

Sending text message using Log4j2 with Flume

hadoop,log4j,bigdata,log4j2,flume
I have Log4j2 configuration: <?xml version="1.0" encoding="UTF-8"?> <configuration> <appenders> <Console name="console" target="SYSTEM_OUT"> <PatternLayout pattern="%d %-5p - %m%n"/> </Console> <Flume name="flume" > <MarkerFilter marker="FLUME" onMatch="ACCEPT" onMismatch="DENY"/> <Agent host="IP_HERE" port="6999"/> </Flume> <File name="file" fileName="flume.log"> <MarkerFilter marker="FLUME" onMatch="ACCEPT" onMismatch="DENY"/> </File> </appenders>...

Pig in local mode on a large file

mapreduce,apache-pig,bigdata
I am running pig in local mode on a large file 54 GB. I observe it spawning lot of map tasks sequentially . What I am expecting is that maybe each map task is reading 64 MB worth of lines. So if I want to optimize this and maybe reads...

Trouble running Apache Giraph on YARN cluster (Hadoop 2.5.2)

java,hadoop,graph,bigdata,giraph
I'm trying to run the basic ShortestPaths example using Giraph 1.1 on Hadoop 2.5.2. I'm running in actual cluster model (eg, not psuedo-distributed) and I can run standard mapreduce jobs OK. But when I try to run the Giraph example, it seems to hang unless I set -ca giraph.SplitMasterWorker=false and...

how to FAST import a giant sql script for mysql?

mysql,bigdata
Currently I have a situation which needs to import a giant sql script into mysql. The sql script content is mainly about INSERT operation. But there are so much records over there and the file size is around 80GB. The machine has 8 cpus, 20GB mem. I have done something...

Looking for a python datastructure for cleaning/annotating large datasets

python,pandas,iterator,bigdata
I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing...

Update for big count of NDB Entities fails

python,google-app-engine,bigdata,app-engine-ndb
I have very simple task. After migration and adding new field (repeated and composite property) to existing NDB Entity (~100K entities) I need to setup default value for it. I tried that code first: q = dm.E.query(ancestor=dm.E.root_key) for user in q.iter(batch_size=500): user.field1 = [dm.E2()] user.put() But it fails with such...

Location of hdfs files in pseudodistributed single node cluster?

java,hadoop,mapreduce,bigdata
I have hadoop installed on a single node, in a pseudodistributed mode. The dfs.replication value is 1. Where are the files in the hdfs stored by default? The version of hadoop I am using is 2.5.1.

cost of keys in JSON document database (mongodb, elasticsearch)

json,database,mongodb,elasticsearch,bigdata
I would like if someone had any experience with speed or optimization effects on the size of JSON keys in a document store database like mongodb or elasticsearch. So for example: I have 2 documents doc1: { keeeeeey1: 'abc', keeeeeeey2: 'xyz') doc2: { k1: 'abc', k2: 'xyz') Lets say I...

How to parse a very large file in F# using FParsec

parsing,f#,bigdata,large-files,fparsec
I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with...

How to install Apache Zeppelin on existing Apache Spark standalone cluster

amazon-web-services,apache-spark,bigdata,apache-spark-sql,apache-zeppelin
I have an existing Apache Spark (1.3 version) standalone cluster on AWS and I would like to install Apache Zeppelin. I have a very simple question, do I have to install Zeppelin on the Spark's master? If the answer is yes, Could I use that guide https://github.com/apache/incubator-zeppelin#build ? thank you...

Apache spark applying map transformation on RDDs

apache-spark,bigdata,rdd
I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like : HadoopRDD -> RDD1 -> RDD2. My question is whether Spak will iterate over the HadoopRDD record by record...

How is social media data unstructured data?

hadoop,bigdata,data-mining
I recently began reading up on big data, and how there are tools like hadoop or BigInsights that can manage both structured and unstructured data. Social Media Analytics is something that can be done on BigInsights, and it takes unstructured data and analyzes/structures it accordingly. This got me wondering, how...

How do I create an RDD from input directory containing text files?

machine-learning,apache-spark,bigdata,analysis,mllib
I am working with the 20 newsgroup dataset. Basically, I have a folder and n text files. The Files in the folder belong to the topic the folder is named. I have 20 such folders. How do I load all this data into Spark and make an RDD out of...

How do I export a table or multiple tables from Hbase shell to a text format?

hadoop,export,hbase,bigdata,software-engineering
I have a table in my Hbase shell with huge amounts of data and I would like to export it to a text format onto a local file system. could anyone suggest me how to do it. I would also like to know if I could export the Hbase table...