r,bash,shell,unix,terminal , Why is Unix/Terminal faster than R?

Why is Unix/Terminal faster than R?


Tag: r,bash,shell,unix,terminal

I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?

Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.

Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.

I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.

Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?


UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.

Regardless, I'll generate an example of counting the number of rows

First I'll generate a big data set.

row = 1e7
col = 50

Doing it with Unix

time wc -l df.csv
real    0m12.261s
user    0m1.668s
sys     0m2.589s

Doing it with R

system.time({ nrow(fread("df.csv")) })
user   system   elapsed
26.77    1.67     47.07

Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.

I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:

user  system  elapsed
34.69   2.81    47.41

My question is how is the I/O different for Unix and R. Why?


I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.

Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.

[updated in response to your additional information]

Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.

I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.

  1. Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
  2. Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
  3. Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.

Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)

Now let's talk about the way each program processes the data. wc has basically 5 steps:

  1. Read characters one at a time. (I can assure you it uses method 3.)
  2. For each character read, add one to the character count.
  3. If the character read was a newline, add one to the line count.
  4. If the character read was or wasn't a word-separator character, update the word count.
  5. At the very end, print out the counts of lines, words, and/or characters, as requested.

So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)

In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence


R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem

 1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1

to get the value 12345.

But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.

Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)

But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.

So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.

So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty sure there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.

So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")

Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.

Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.

I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.


Bash script using sed acts differently when passing variable

I have a script that I am writing that checks a value and then based on the value modifies it. I am trying to understand why it works this one way and not the other. Based on the google and stackoverflow searches I did, nothing really fits what I am...

How (in a vectorized manner) to retrieve single value quantities from dataframe cells containing numeric arrays?

I've got a dataframe that includes columns like the one on the right here: lengthArray speed_max 1 4 24, 18, 24, 18 2 10 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 3 4 -999, -999, -999, -999 4 2 -999, -999 5 2 18, 18 6 1...

ggplot2 & facet_wrap - eliminate vertical distance between facets

I'm working with some data that I want to display as a nxn grid of plots. Edit: To be more clear, there's 21 categories in my data. I want to facet by category, and have those 21 plots in a 5 x 5 square grid (where the orphan is by...

How to plot data points at particular location in a map in R

I have a dataset that looks like this: LOCALITY numbers 1 Airoli 72 2 Andheri East 286 3 Andheri west 208 4 Arya Nagar 5 5 Asalfa 7 6 Bandra East 36 7 Bandra West 72 I want to plot bubbles (bigger the number bigger would be the bubble) inside...

Skip some lines with fread

I am interested to skip some lines of my data frame before the header names . How can i do it by skiping all the lines before ID_REF or if ID_REF is not present, check for the pattern ILMN_ and deleting all the lines keeping immediate first if not containing...

Histogram-like summary for interval data

How do I get a histogram-like summary of interval data in R? My MWE data has four intervals. interval range Int1 2-7 Int2 10-14 Int3 12-18 Int4 25-28 I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should...

linux - running a process and tailing a file simultaneously

I want a run a long task on a remote machine (with python fabric using ssh). It logs to a file on the remote machine. What I want to do is to run that script and tail (actively display) the log file content until the script execution ends. The problem...

Serial modification of objects in R

I have a number of matrices of the same size: m1.m <- matrix(c(1,2,3,4), nrow=2, ncol=2) m2.m <- matrix(c(5,6,7,8), nrow=2, ncol=2) ... I want to set uniform column and row names to all of them. Currently I am doing it like this: new_col_names <- c("Col1","Col2") new_row_names <- c("Row1","Row2") change_names <- function(m,...

copy a list of data.tables

I have the following situation: 1) a list of data tables 2) For testing purposes I deliberately want to (deeply) copy the whole list including the data tables 3) I want to take some element from the copied list and add a new column. Here is the code: library(data.table) x...

R: recursive function to give groups of consecutive numbers

Given a sorted vector x: x <- c(1,2,4,6,7,10,11,12,15) I am trying to write a small function that will yield a similar sized vector y giving the last consecutive integer in order to group consecutive numbers. In my case it is (defining groups 2, 4, 7, 12 and 15): > y...

Using R to Assign Treatments to Groups

We have seven exposures and 24 groups. We would like to randomly assign five of the seven exposures to groups while also ensuring that we end up with a consistent count for each exposure, meaning that each exposure ends up being exposed about the same number of times. I have...

R — frequencies within a variable for repeating values

I've got a column A, which has several values, some of them repeating. So, example: A = c(5, 9, 6, 5, 5). I need to go through A and count the frequencies of each of the values in A. So, for this example, for the set of 5s in A,...

Aggregating data in R

user_id date datetime page 217568 6/12/2015 49:23.9 Vodafone | How to get in touch with Vodafone 135437 6/10/2015 43:35.7 My Vodafone – Manage your Vodafone Pay Monthly Account Online – Vodafone 196094 6/13/2015 33:39.4 Check the status of Vodafone’s mobile network in real-time 74197 6/6/2015 52:46.1 undefined 153501 6/5/2015 02:55.5...

how to get values from selectInput with shiny

I am playing around with the shiny packages for some hours now, and wanted to make a select input widget that enables me to download a certain data set from the server. So i figured out a way to get me this data frame containing all my IDs for downloading:...

Subtract time in r, forcing unit of results to minutes [duplicate]

This question already has an answer here: Getting consist units from diff command in R 4 answers I successfully subtracted two POSIXct cols of df1 (below). However, since the time differences are >= 1 hour in all rows, R gives the results in hours. I know that this make...

Store every value in a sequence except some values

If I do the following to a string of letters: x <- 'broke' y <- nchar(x) z <- sequence(y) How do I store every value of the z that isn't the first, last, or middle values of the sequence. In this example if z is 1 2 3 4 5...

Convert strings of data to “Data” objects in R [duplicate]

This question already has an answer here: as.Date with dates in format m/d/y in R 2 answers My problem is that the as.Date function does not convert the values in a "date" column of a data frame into Date objects. I have a data.frame nmmaps. Here is a short...

Fitting a subset model with just one lag, using R package FitAR

I am trying to fit a subset model with only lag 4. In the manual it's written "you must use p=c(0,0,0,4) since p=4 will fit a full AR(4)". I did this. #fit a subset model with just lag 4 Fit=FitAR(p=c(0,0,0,4), lag.max = "default", ARModel = "ARz") However, I get the...

Remove quotes to use result as dataset name

I've got a vector with a long list of dataset names. E.g myvector<-c('ds1','ds2,'ds3') I'd like to use the names ds1..ds3 to write a file, taking the file name from the vector. Like this: write.csv(dataset[i],file=paste(myvector[i],'.csv',sep='') with dataset being d1...ds3, but without quotes. How can I remove the quotes and refer to...

Macports switch PHP CLI version

I'm trying to switch my Terminal PHP version to 5.4 because I ran into some issues with Drush while updating my Drupal core. http://drupal.stackexchange.com/questions/112090/drush-command-errors The reason for these issues is my Terminal PHP version is different then my localhost. php -v in Terminal returns PHP 5.5.13 (cli) but my localhost...

R Program Vector, record Column Percent

This is my vector head(sep) I must find percent of all SEP 11 in each row. For instance, in first row, percent of SEP 11 is 100 * ((63 + 124)/ (63 + 124 + 0 + 0)) And would like this stored in newly created 8th column Thanks dput...

shell script for counting replacements

Ten files located in a directory. Each file content has "apple" word. write a bash shell script to replace "apple" with "banana" in all ten files and Print the number of replacements for each file. Have tried in this way but dont know how to get number of replacement. can...

ggplot equivalent for matplot

Is there an equivalent in ggplot2 to plot this dataset? I use matplot, and read that qplot could be used, but it really does not work. ggplot/matplot data<-rbind(c(6,16,25), c(1,4,7), c(NA, 1,2), c(NA, NA, 1)) as.data.frame(data) matplot(data, log="y",type='b', pch=1) ...

How to split a text into two meaningful words in R

I had a text data frame having sentences, and as I wanted the list of separate words in another dataframe I used the "qdap package" function "all_words" Words = all_words(df$problem_note_text, begins.with=NULL , alphabetical = FALSE, apostrophe.remove = TRUE, char.keep = char2space, char2space = "~~") Now have a dataframe which has...

Converting column from military time to standard time

I'm trying to convert a column showing the time of road traffic accidents from military time to standard time. The data looks like this: Col1 Time..24hr. 1 1404 2 322 3 1945 4 1005 5 945 I'd then like to convert to 12hr so for '322' I'd like to make...

Find multiple consecutive empty lines

I'm trying to chop up a text file into the articles it contains. Usually this is done by identifying a pattern each article begins with. Unfortunately the database I downloaded the articles from doesn't have that. The only pattern I can find is that after each article there are 3...

Bash modify CSV to change a field

I have a very big CSV file (aprox. 10.000 rows and 400 columns) and I need to modify certain columns (like 15, 156, 220) to change format from 20140321132233 to 2014-03-21 13:22:33. All fields that I need to modify are datetime. I saw some examples using awk but for math...

R stops displaying maps

Few days ago I was familiarizing myself with displaying maps, plotting points on the map from http://rpubs.com/nickbearman/r-google-map-making Today, I have intermittent success in displaying maps. library(ggmap) map <- qmap('Anaheim', zoom = 10, maptype = 'roadmap') Outputs Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Anaheim&zoom=10&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false And when I go to the URL...

Return Column Names when True in R

I am using R for a project and I have a data frame in in the following format: A B C 1 1 0 0 2 0 1 1 I want to return a data frame that gives the Column Name when the value is 1. i.e. Impair1 Impair2 1...

Correlate by levels of a variable in R

I would like to correlate two variables and have the output reported separately for levels of a third variable. My data are similar to this example: var1 <- c(7, 8, 9, 10, 11, 12) var2 <- c(18, 17, 16, 15, 14, 13) categories <- c(1, 2, 3, 1, 2, 3)...

optimization algorithm for circular data

Background: I am interested in localizing a sound source from a suite of audio recorders. Each audio array consists of 6 directional microphones spaced evenly every 60 degrees (0, 60, 120, 180, 240, 300 degrees). I am interested in finding the neighboring pair of microphones with the maximum set of...

Python: can't access newly defined environment variables

I can't access my env var: import subprocess, os print os.environ.get('PATH') # Works well print os.environ.get('BONSAI') # doesn't work But the env var is well added in my /home/me/.bashrc: BONSAI=/home/me/Utils/bonsai_v3.2 export BONSAI And I can access this env var from a new terminal....

Subsetting rows by passing an argument to a function

I have the following data frame which I imported into R using read.table() (I incorporated read.table() within read_data() which is a function I created that also throw messages in case the file name is not written appropriately): > raw_data <- read_data("n44.txt") [1] #### Reading txt file #### > head(raw_data) subject...

Rbind in variable row size not giving NA's

The initial data frame mergedDf is PROD_CODE 1 PRD0900033,PRD0900135,PRD0900220,PRD0900709 2 PRD0900097,PRD0900550 3 PRD0900121 4 PRD0900353 5 PRD0900547,PRD0900614 After calling mergedDf<-data.frame(do.call('rbind', strsplit(as.character(mergedDf$PROD_CODE),',',fixed=TRUE))) Output becomes X1 X2 X3 X4 1 PRD0900033 PRD0900135 PRD0900220 PRD0900709 2 PRD0900097 PRD0900550 PRD0900097 PRD0900550 3 PRD0900121 PRD0900121 PRD0900121 PRD0900121 4 PRD0900353 PRD0900353 PRD0900353 PRD0900353 5 PRD0900547 PRD0900614...

Identifying when a file is changed- Bash

So in my bash shell script, I have it running through a for loop. Inside the for loop, I use find "$myarray[i]" >> $tmp to look for a certain directory each time through the loop. Sometimes, it finds the variable in myarray[i] and sometimes it doesn't. When it does find...

Linear multivariate regression in R

I want to model that a factory takes an input of, say, x tonnes of raw material, which is then processed. In the first step waste materials are removed, and a product P1 is created. For the "rest" of the material, it is processed once again and another product P2...

Keep the second occurrence in a column in R

I have quite a simple dataset: ID Value Time 1 censored 1 1 censored 2 1 uncensored 3 1 uncensored 4 1 censored 5 1 censored 6 2 censored 1 2 uncensored 2 2 uncensored 3 2 uncensored 4 2 censored 5 I want to keep the first uncensored occurrence,...

Translating Stata to R: collapse

Just came across a .do file that I need to translate into R because I don't have a Stata license; my Stata is rusty, so can someone confirm that the code is doing what I think it is? Here's the Stata code: collapse (min) MinPctCollected = PctCollected /// (mean) AvgPctCollected...

Select / subset spatial data in R

I am working on a large data set with spatial data (lat/long). My data set contains some positions that I don´t want in my analysis (it makes the files to heavy to process in ArcMap- many Go of data). This is why I want to subset the relevant data for...

shell script cut from variables

The file is like this aaa&123 bbb&234 ccc&345 aaa&456 aaa$567 bbb&678 I want to output:(contain "aaa" and text after &) 123 456 I want to do in in shell script, Follow code be consider #!/bin/bash raw=$(grep 'aaa' 1.txt) var=$(cut -f2 -d"&" "$raw") echo $var It give me a error like...

How to build a 'for' loop with input$i in R Shiny

In my shiny app, I build a a number of checkboxes using a for loop, like this: landelist <- c("Danmark", "Tjekkiet", "Østrig", "Belgien", "Tyskland", "Sverige", "USA", "Norge", "Island") landecheckbox <- c() for (land in landelist){ landechek <- paste0("<label class=\"checkbox inline\"><input id=\"", land, "\" type=\"checkbox\" checked><span>", land, "</span></label>") landecheckbox <- c(landechek,...

Highlighting specific ranges on a Graph in R

library(season) plot(CVD$yrmon, CVD$cvd, type = 'o',pch = 19,ylab = 'Number of CVD deaths per month',xlab = 'Time') if i wanted to highlight a region of the graph based on x values say from 1994-1998 how do i do this? Any thought would be appreciated Thanks....

How to quickly read a large txt data file (5GB) into R(RStudio) (Centrino 2 P8600, 4Gb RAM)

I have a large data set, one of the files is 5GB. Can someone suggest me how to quickly read it into R (RStudio)? Thanks

Sleep Shiny WebApp to let it refresh… Any alternative?

I have a WebApp that have some renderUI({})... and some of them depend on the input of another. This makes that, briefly, a red error in the webpage appear when I select some options. Because the if() clause of some renderUI({}) depend on the input of a selectizer. The error...

how to call Java method which returns any List from R Language? [on hold]

How to call java method which returns list from R Language.

Count number of rows meeting criteria in another table - R PRogramming

I have two tables, one with property listings and another one with contacts made for a property (i.e. is someone is interested in the property they will "contact" the owner). Sample "listings" table below: listings <- data.frame(id = c("6174", "2175", "9176", "4176", "9177"), city = c("A", "B", "B", "B" ,"A"),...

While loop in bash using variable from txt file

I am new to bash and writing a script to read variables that is stored on each line of a text file (there are thousands of these variables). So I tried to write a script that would read the lines and automatically output the solution to the screen and save...

Appending a data frame with for if and else statements or how do put print in dataframe

How do I put what I printed in a dataframe with a for loop and if else statements? Basically, this code: list<-c("10","20","5") for (j in 1:3){ if (list[j] < 8) print("Greater") else print("Less") }) #[1] "Less" #[1] "Less" #[1] "Greater" Or should it be something more like this? f3 <-...

Assign and use of a variable in the same subshell

I was doing something very simple like: v=5 echo "$v" and expected it to print 5. However, it does not. The value that was just set is not available for the next command. I recently learnt that "In most shells, each command of a pipeline is executed in a separate...

Bash alias function with predefined argument

I have an alias gl which is an wrapper for git log. Basically like this function gitLog(){ if [ $# -eq 0 ] then git log else git log -n $1 } alias gl=gitLog I want to add an alias which just calls gitLog with an argument like this alias...