FAQ Database Discussion Community


How to subset a data table according to several conditions

r,data.table
I have a question which is giving me unexpected troubles. The question, which seems imbarassingly simple, is that I would like to subset a data.table according to a few conditions. Let's consider this example; imagine I have a data.table like the one below. Now, I would like to extrapolate from...

data.table sum and subset

r,sum,filtering,data.table
I have a data.table that I am wanting to aggregate library(data.table) dt1 <- data.table(year=c("2001","2001","2001","2002","2002","2002","2002"), group=c("a","a","b","a","a","b","b"), amt=c(20,40,20,35,30,28,19)) I am wanting to sum the amt by year and group and then filter where the summed amt for any given group is greater than 100. I've got the data.table sum nailed. dt1[, sum(amt),by=list(year,group)]...

How to solve gaps and island problems in R and performance vs SQL?

r,data.table,gaps-and-islands
I was wondering whether the island and gaps problems can be solved in R efficiently, similar to SQL. I have the following data available, if we examine one ID: ID StartDate StartTime EndDate EndTime 1 19-05-2014 19:00 19-05-2014 20:00 1 19-05-2014 19:30 19-05-2014 23:30 1 19-05-2014 16:00 19-05-2014 18:00 1...

Fill in missing rows with R data.table

r,statistics,data.table
I have a data.table in R that was fetched from a database that looks like this: date,identifier,description,location,value1,value2 2014-03-01,1,foo,1,100,200 2014-03-01,1,foo,2,200,300 2014-04-01,1,foo,1,100,200 2014-04-01,1,foo,2,100,200 2014-05-01,1,foo,1,100,200 2014-05-01,1,foo,2,100,200 2014-03-01,2,bar,1,100,200 2014-04-01,2,bar,1,100,200 2014-05-01,2,bar,1,100,200 2014-03-01,3,baz,1,100,200 2014-03-01,3,baz,2,200,300 2014-04-01,3,baz,1,100,200 2014-04-01,3,baz,2,100,200 2014-05-01,3,baz,1,100,200...

R: Roll up column values containing NA's by sum while grouping by ID's

r,data.table,aggregate,plyr,dplyr
I have a data frame that I got from ID <- c("A","A","A","A","B","B","B","B") Type <- c(45,45,46,46,45,45,46,46) Point_A <- c(10,NA,30,40,NA,80,NA,100) Point_B <- c(NA,32,43,NA,65,11,NA,53) df <- data.frame(ID,Type,Point_A,Point_B) ID Type Point_A Point_B 1 A 45 10 NA 2 A 45 NA 32 3 A 46 30 43 4 A 46 40 NA 5 B...

Pairwise combinations along with counts in data.table

r,data.table
I have a data.frame d as follows. d <- structure(list(sno = 1:7, list = c("SD1, SD44, SD384, SD32", "SD23, SD1, SD567", "SD42, SD345, SD183", "SD345, SD340, SD387", "SD455, SD86, SD39", "SD12, SD315, SD387", "SD32, SD1, SD40")), .Names = c("sno", "list"), row.names = c(NA, -7L), class = "data.frame") d sno list...

R: RunningTotal in the last 365 days window by Name

r,data.table,dplyr,zoo
This is what my data looks like. The rightmost column is my Desired Column. Name EventType EventDate SalesAmount RunningTotal Runningtotal(prior365Days) John Email 1/1/2014 0 0 0 John Sale 2/1/2014 10 10 10 John Sale 7/1/2014 20 30 30 John Sale 4/1/2015 30 60 50 John Webinar 5/1/2015 0 60 50...

subset data.table keeping only elements greater than certain value applied to all columns

r,data.table,subset
I would like to subset news (below) to create news2 (further below) which will only include the rows/columns where the abs(value) in each element of news > 0.01. Below is the code that I have tried: gr <- data.frame(which(abs(news[, 1:ncol(news), with = FALSE]) > 0.01, arr.ind = TRUE)) news2a <-...

Include column after grouping using datatable

r,include,data.table,subset
My goal is to calculate a group % column by zip. I created the % column by zip, but keep losing my group ('cgrp') variable. How can I keep this in my end results? My data table script is giving me the below results: zip V1 1: 12007 19.35484 2:...

Is there a way to automagically group by (almost) all columns in data.table

r,data.table
Basically I have a dataset which has a large number of columns, and it might even grow in the future. Now before I analyse the data, in most cases it makes sense to group by all columns. I can manually type everything, I know, but I was wondering if there...

data table string concatenation of SD columns for by group values

r,data.table
I have a big data set with many variables that looks similar to this : > data.table(a=letters[1:10],b=LETTERS[1:10],ID=c(1,1,1,2,2,2,2,3,3,3)) a b ID 1: a A 1 2: b B 1 3: c C 1 4: d D 2 5: e E 2 6: f F 2 7: g G 2 8: h...

R create graphs of average in time difference

r,data.frame,data.table
I have a big data.table that contains the following cols: timestamp, value, house The value is a cumulative value of eg energy of that one house. So here is a small sample: time value house 2014-10-27 11:40:00 100 2 2014-10-27 15:40:00 150 2 2014-10-27 19:40:30 160 2 2014-10-28 00:00:01 170...

Grouping key/value columns into single rows

r,data.frame,data.table,tidyr
I'm trying to take key-value combinations and put all the values on the same row as the keys. I'm pretty sure I knew how to do this at one point (I think with data.table) and I've been looking at the usual suspects reshape2, tidyr, data.table, etc, but I can't seem...

Speed up random Markov Chain in R using data.table or parellelisation

r,performance,matrix,data.table,mcmc
I am trying to speed up a Monte Carlo simulation of a discrete time-inhomogeneous Markov chain using data.table or some form of parallelisation. Using random dummy transition matrices TM, I am simulating nSteps time steps in each of N simulations and starting from a an initial state vector initialState record...

Subset of data.table excluding some columns

r,data.table
I need just to exclude set of columns from my data.table. I can do it by using -c() with numbers of these columns: dti <- data.table(iris) dti1 <- dti[, -c(3,5), with=F] but I would like to use column names instead of numbers, something like colsExcl <- c('Petal.Length', 'Species'); dti1 <-...

Summing rows by month in R

r,data.table,aggregate,dplyr,lubridate
So I have a data frame that has a date column, an hour column and a series of other numerical columns. Each row in the data frame is 1 hour of 1 day for an entire year. The data frame looks like this: Date Hour Melbourne Southern Flagstaff 1 2009-05-01...

Transposing a data.table with reshape2:::dcast

r,data.table,transpose,reshape2
I am transposing a data.table and have chosen to use reshape2:::dcast, however I am plagued by a strange handling of the data.table... here is a toy data set that replicates the behavior: > library(data.table) > library(reshape2) > DT <- structure(list(STORE = c(32123L, 32469L, 33177L, 33484L, 34772L, 34875L), VOLUME = c(343.87205,...

Fastest way to filter a data.frame list column contents in R / Rcpp

r,performance,data.table,dplyr,rcpp
I have a data.frame: df <- structure(list(id = 1:3, vars = list("a", c("a", "b", "c"), c("b", "c"))), .Names = c("id", "vars"), row.names = c(NA, -3L), class = "data.frame") with a list column (each with a character vector): > str(df) 'data.frame': 3 obs. of 2 variables: $ id : int 1...

use column-name range in data.table like in dplyrs select

r,data.table,dplyr
I would like to select multiple columns from a data.table (with 1200 column-names) specifying a range by column-name, like one can do with dplyr, for example: library(data.table) library(dplyr) dt <- data.table(w = sample(100, 50), x = sample(100, 50), y = sample(100, 50), z = sample(100, 50)) select(dt, w:y) Currently I...

How to update all columns of a R data.table by row?

r,data.table
I am trying to create a data.table object by taking pieces from other data.tables and combining them. Here's a simple example: a <- data.frame(x=1:30) b <- data.frame(x=10:39) c <- data.frame(x=20:49) d <- data.frame(x=50:79) e <- data.frame(x=60:89) f <- data.frame(x=70:99) DT <- data.table(matrix(ncol = 3, nrow = 30)) for (i in...

Duplicated rows emerging when using group_by and summarise

r,data.table,dplyr
I have a data table and I get duplicated rows when using the summarise function combined with group_by. I will give a simplified example of my problem. First, I am using group_by and mutate to add the sum of 'value' for each id. dt <- data.table(id = rep(1:5, each=10), cpc...

Translating Stata to R: collapse

r,data.table,stata,code-translation
Just came across a .do file that I need to translate into R because I don't have a Stata license; my Stata is rusty, so can someone confirm that the code is doing what I think it is? Here's the Stata code: collapse (min) MinPctCollected = PctCollected /// (mean) AvgPctCollected...

data.table in place modification in R

r,data.table,deep-copy
Consider a data.table DT as follows. DT <- iris setDT(DT) ad <- address(DT) DT[, a := NA_integer_] identical(address(DT), ad) I am trying to insert the some information sequentially in DT$a using a loop. a1 <- sample(1:1000, 50) a2 <- sample(1:1000, 50) a3 <- sample(1:1000, 50) As you can see this...

Generating random variable efficiently by groups in data.table (R)

r,data.table
I have a following sample data, and I am trying to create random numbers by large number of groups, but it is taking too much time. I wonder whether there is an efficient way for this: library(data.table) d <- mtcars n <- 10000 k1<-rbindlist(replicate(n, d, simplify = FALSE)) k1[,factor_var:=rep(seq(1,80000),4)] #sample...

Aggregating while merging two dataframes in R

r,merge,data.table,aggregate,dplyr
The ultimate goal is to sum the total quantity(transact_data$qty) for each record in product_info where the transact_data$productId exists in product_info, and where transact_data$date is between product_info$beg_date and product_info$end_date. The dataframes are below: product_info <- data.frame(productId = c("A", "B", "A", "C","C","B"), old_price = c(0.5,0.10,0.11,0.12,0.3,0.4), new_price = c(0.7,0.11,0.12,0.11,0.2,0.3), beg_date = c("2014-05-01", "2014-06-01",...

Simplify conditional table loop without matrix notation in R

r,data.table,dplyr,lapply
Using the example below, I was wondering if there is a more efficient package or function to do conditional counting and tables on matching string elements--e.g, using the data.table package, dplyr package, lapply() like function? produce = c("apple", "blueberry", "blueberry", "corn", "horseradish", "rutabega", "rutabega", "tomato") # Long list veggies =...

How to extact all columns when using data.table in R?

r,data.table
library(data.table) I am trying to do temporal matching between two datasets. wd <- structure(list(Year = c(2006L, 2006L, 2006L), day = c(361L, 361L, 360L), hour = c(14L, 8L, 8L), mint = c(30L, 0L, 30L), valu1 = c(0.5, 0.3, 0.4), Date = structure(c(1167229800, 1167206400, 1167121800 ), class = c("POSIXct", "POSIXt"), tzone =...

Data.Table Merge - Result is larger than input Datatables

r,merge,left-join,data.table
I have two data tables. df1, ref_df Internal Structures are as follows: Classes ‘data.table’ and 'data.frame': 10153986 obs. of 18 variables: $ chr_no : chr "1" "1" "1" "1" ... $ pos : int 238 324 340 353 355 357 380 420 435 571 ... $ ref : chr "C"...

R: Avoid loop or row apply function

r,merge,data.frame,data.table
I've following two data frame df_sales and df_supply. I want to merge the sale to supply in such a manner that my df_sales table have DATE_SUPPLY and QNT_SUPPLY from df_supply on below conditions *Condition: DATE_SUPPLY should be recent DATE_SUPPLY of corresponding "ITEM" for corresponding "STORE", i.e, DATE_SALE <- max(df_supply[df_supply$DATE_SUPPLY <=...

Use function instead of {} in j data.table

r,data.table
could you help me to understand why in data.table I could access column by name in {}, e.g. dt <- data.table(x=1:2, y=1:2) dt[,{ list(z = x + y) }] But can't do the same in a function test_sum <- function() { list(z = x + y) } dt[, test_sum()] Is...

Error with durations created from a data.table using lubridate & dplyr

r,data.table,dplyr,lubridate
I'm trying to aggregate some data stored in a data.table, and then create durations (from lubridate) from the aggregated data. When I try that, however, I get an error. Here's a reproducible example: library(lubridate) library(data.table) library(dplyr) data(lakers) lakers.dt <- data.table(lakers, key = "player") durations <- lakers.dt %>% mutate(better.date = ymd(date))...

Issue with filtering data.table for dates - works with fixed date but not with variable

r,date,data.table
I want to filter a data.table for dates - given start- and end-dates, I want to have all the rows which include a given date. Now my problem: It works when I use a fixed date, but it returns an empty table when I want to store the date in...

join two data tables and use only one column from second dt in R

r,join,merge,data.table
Edited to make it cleaner. Let's say I have two data tables (dt1 and dt2), and I want to get dt3 using data tables. A,B,C,E,F,G,H are column names. dt1 key is column A, and dt2 key is column E. Data tables have different number of rows. I want to keep...

Simultaneous order, row-filter and column-select with data.table

r,data.table
I am trying to do multiple steps in one line in R to select a value from a data.table (dt) with multiple criteria. For example: set.seed(123) dt <- data.table(id = rep(letters[1:2],2), time = rnorm(4), value = rnorm(4)*100) # id time value # 1: a -0.56047565 12.92877 # 2: b -0.23017749...

Reading aligned column data with fread

r,data.table
I came across a file like this: COL1 COL2 COL3 weqw asrg qerhqetjw weweg ethweth rqerhwrtjw rhqerhqerhq qergqer qerhqew5h qerh qergqer wetjwryerj I could not load it directly with fread so I replaced \s+ by , with sed than I gave to fread and it solved it. But is there...

finding corresponding value for minimal number - r

r,data.table
I have a data frame, where I would like to get the corresponding values the min value. library(data.table) df <- data.frame(cbind(c(1,1,1,1,2,2,2,2), c(180,170,180,190,160,170,170,180), c(80,75,76,81,67,65,66,70), c(5,6,7,8,9,5,6,7) )) df_stat <- setDT(df)[, list( min = min(X2) ), by = X1] The outcome should be looking like df_stat <- data.frame(cbind(c(1,2), c(170,160), c(75,67), c(6,9) )) I...

R Data Aggregation With WHERE Clause on Group

r,data.table,plyr
As an example, I have the data.table shown below. I want to do a simple aggregation where b=sum(b). For c, however I want the value of the record in c where b is maximum. The desired output is shown below (data.aggr). This leads to a few questions: 1) Is there...

fill a data.table based on value in another data.table

r,data.frame,data.table
I'm very new to data.table but would like to solve my problem with it, as I have the feeling it would be 1000 times faster than with "regular" data.frames. Here is my problem: What I have: 2 data.tables dt1 and dt2 like so: dt1 <- data.table(SID=paste0("S", 1:15), Chromo=rep(1:3, e=5), PP=rep(1:5,...

Evaluating text in data table

r,eval,data.table
I'm trying to aggregate a lot of variables in a very large data table by a single variable. The issue I'm having is with eval(parse(text=...)) which I believe is because I'm evaluating several expressions rather than just one. Here is a quick example: library(data.table) data(mtcars) mtcars<-as.data.table(mtcars) mtcars[,j=list( eval(parse(text='mean_mpg=mean(mpg), sum_mpg=sum(mpg)')) ),by=gear]...

Update class of a subset of columns in data.table

r,data.table
I want to change a couple of data.table columns from factor to character library(data.table) ir <- as.data.table(iris) ir[, Species2 := Species] I can identify which columns I need to change facs <- which(sapply(ir, is.factor)) facs And I can update the columns by name: ir[, c("Species", "Species2") := lapply(.SD, as.character), .SDcols...

How do I take a rolling product using data.table

r,data.table
dt <- data.table(x=c(1, .9, .8, .75, .5, .1)) dt x 1: 1.00 2: 0.90 3: 0.80 4: 0.75 5: 0.50 6: 0.10 For each row, how do I get the product of x for that row and the next two rows? x Prod.3 1: 1.00 0.7200 2: 0.90 0.5400 3:...

Replace Inf in R data.table / Show number of Inf in colums

r,data.table,infinite,na
I can't figure out how to use an is.na(x) like function for infinite numbers in R with a data table or show per column how many Inf's there are: colSums(is.infinite(x)) I use the following example data set: DT <- data.table(a=c(1/0,1,2/0),b=c("a","b","c"),c=c(1/0,5,NA)) DT a b c 1: Inf a Inf 2: 1...

R data.table get unique rows droping some columns as well

r,data.table
How to get unique rows from data.table also removing some columns from result, where uniqueness is defined by values in more then one column? For example: tbl = data.table( reader_id = c(10,20,20,30,50), book_id = c(1,2,2,4,5), date = c('d1','d2','d3','d4','d5'), inf = c('i1','i2','i3','i4','i5') ) >tbl reader_id book_id date inf 1: 10 1...

R - Filtering character dates in data.table

r,data.table
The default way of reading dates in data.table with fread is that dates are stored as character values. Using this default, I noticed that filtering in i for a date range using logical comparison versus %in% operator are dramatically different in execution time: library(data.table) CharDateRange <- function(start.date, end.date) { sapply(seq(as.Date(start.date),...

data.table: merge() one-key table with two-keys table

r,data.table
I refer also to this questions that is now almost 3 years old. I have a similar problem. I have these three tables: require(data.table) profile <- data.table(id = seq(11,20)) setkey(profile, "id") post <- data.table(id = seq(1,10)) setkey(post, "id") comment <- data.table(post_id = seq(1,10), profile_id = seq(11,20)) setkeyv(comment, c("post_id","profile_id")) Now I...

R: Create 2 columns with difference and percentages values of another column

r,data.frame,data.table,plyr,dplyr
I have a dataframe like this ID <- c(101,101,101,102,102,102,103,103,103) Pt_A <- c(50,100,150,20,30,40,60,80,90) df <- data.frame(ID,Pt_A) +-----+------+ | ID | Pt_A | +-----+------+ | 101 | 50 | | 101 | 100 | | 101 | 150 | | 102 | 20 | | 102 | 30 | | 102 |...

r %in% operator behavior for data table factors?

r,data.table
I can't seem to get the %in% operator to behave for data table factor columns. I know I'm probably missing some secret syntax for data tables, but I can't to find it... I've searched all over. Here's a tiny example illustrating my pain. Of course the simple answer would be...

copy a list of data.tables

r,data.table
I have the following situation: 1) a list of data tables 2) For testing purposes I deliberately want to (deeply) copy the whole list including the data tables 3) I want to take some element from the copied list and add a new column. Here is the code: library(data.table) x...

R data.table replace “NULL” with `NA` when columns are factors

r,data.table
I pulled some data from a SQL database through ODBC and the columns are automatically set to factor. It is something like the following: library(RODBC) library(data.table) data <- data.table(sqlQuery(channel, query)) My data looks like this, just with a lot more columns: data <- data.table("C1"=as.factor(c(letters[1:4], "NULL", letters[5])), "C2"=as.factor(c(rnorm(3), "NULL", rnorm(2))), "C3"=as.factor(c(letters[1],...

How can I replace TRUE and FALSE values with 1 and 0 when exporting data in R?

r,data.table
I am exporting data from R with the command: write.table(output,file="data.raw", na "-9999",sep="\t",row.names=F,col.names=F) that exports my data correctly, but it exports all of the logical variables as TRUE and FALSE. I need to read the data into another program that can only process numeric values. Is there an efficient way to...

how can I select all rows of a table that match those in another table

r,data.table
I am trying to do something very simple with data.table and I lost the idiomatic way to do it library(data.table) set.seed(1) DT = data.table(a=sample(letters,1e5,T), b=sample(letters,1e5,T), c=rnorm(1e5)) DT2 = data.table(a=sample(letters,5,T), b=sample(letters,5,T)) DT2 a b 1: k h 2: e v 3: f n 4: m q 5: w v I want...

Emulate the window function of SQL in R

r,data.table
I have a table as following: id name amount year 001 A 10 2010 001 A 10 2011 001 A 12 2012 ----------------------- 002 A 3 2012 002 A 4 2013 ----------------------- 003 B 20 2011 003 B 20 2012 (Note two entities have the same name A but they...

How to calculate count by group, then keep only one per group

r,merge,data.table,aggregate
Say that I have this data.frame, data: data <- data.frame(val=c(rep(6,10), rep(7, 15), rep(8, 20), rep(9, 25), rep(10, 100), rep(11, 20), rep(12, 15), rep(13, 10))) data$plus <- data$val + 100 My goal is to create a new data.frame that has the frequencies of each val, and the associated plus value. My...

R: Convert obscure table into matrix

r,data.table
I have table that looks like this: Row Col Value 1 1 31 1 2 56 1 8 13 2 1 83 2 2 51 2 9 16 3 2 53 I need to convert this table into matrix (Row column represents rows and Col column represents columns). For the...

How to select a column of a data.table via a variable

r,variables,data.table
I've got a data table, and instead of addressing a particular column (say column "x") by (a hardcoded) "dt$x", I'd like to get this done in a dynamic manner by facilitating a variable instead, ie. "dt${var}". I've already tried "dt[, eval(quote(var)), with=FALSE)", as well as "dt[, c(var), with=FALSE)", but unfortunately...

Creating an index variable in R based on row index

r,indexing,row,data.table
I think it's time to ask for help. Suppose I have this data.frame or data.table State Date Event CA Oct27 1 CA Oct28 0 CA Oct29 0 CA Oct30 0 CA Oct31 1 TX Oct27 0 TX Oct28 1 TX Oct29 1 TX Oct30 0 TX Oct31 0 TX Nov1...

Data table, removing leading missing values by group

r,data.table
Below is an example data table, which I would like to remove the rows where value is NA and no earlier row has a value, i.e. also NA, and by group. As not all have the same number leading missing values I'm getting stuck and not having any luck searching....

Events in last 21 days for every row by Name

r,data.frame,data.table,dplyr
This is what my dataframe looks like. The two rightmost columns are my desired columns.These two columns check the condition whether in the last 21 days there is an "Email" ActivityType and whether in the last 21 days there is a "Webinar" ActivityType. Name ActivityType ActivityDate Email(last21days) Webinar(last21day)** John Email...

Faster way to summarise variables based on column in R

r,data.table,dplyr
I wish to summarise some variables in my data frame based on a column. However my data frame is rather large (>30,000,000 rows) and using the summarise function in dplyr takes ages to run. Is there a faster way in R to speed up the summarising process? I have a...

Mean of Groups of means in R

r,data.table
I have the following data Exp = my data frame dt<-data.table(Game=c(rep(1,9),rep(2,3)), Round=rep(1:3,4), Participant=rep(1:4,each=3), Left_Choice=c(1,0,0,1,1,0,0,0,1,1,1,1), Total_Points=c(5,15,12,16,83,7,4,8,23,6,9,14)) > dt Game Round Participant Left_Choice Total_Points 1: 1 1 1 1 5 2: 1 2 1 0 15 3: 1 3 1 0 12 4: 1 1 2 1 16 5: 1 2 2...

Merge two data frames with non-unique keys

r,data.table
I want to select the rows of data1 if the value of the first column exists in data2: data1<-data.table(a=c(1,2,3,1),b=c(1,4,6,3)) data2<-data.table(a=c(1,3,1,5,6)) Desired output: a b 1 1 3 6 1 3 The merge does not work since the keys are not unique, any other idea?...

Finding gaps between intervals using data.table

r,data.table
I have the following problem: given a set of non-overlapping intervals in a data.table, report the gaps between the intervals. I have implemented this once in SQL, however I am struggling with data.table due to the lack of a lead function or lag function. For completeness, I have here the...

simple lookup in R by ID and Time

r,data.table,lookup
I have a dataset that looks like this: set.seed(1234) DT<-data.table(id=c(rep(c("a","b","c","d"),5)), year=rep(seq(from = 2010.5,to=2012.5,by = .5),each=4), value=rnorm(20,10,1)) DT id year value 1: a 2010.5 8.792934 2: b 2010.5 10.277429 3: c 2010.5 11.084441 4: d 2010.5 7.654302 5: a 2011.0 10.429125 6: b 2011.0 10.506056 7: c 2011.0 9.425260 8: d...

How to aggregate data.table on multiple dimensoins

r,data.table
I have a data table for which I want to aggregate the data based on multiple fields. Here is a simplified example of my data: # each record is the number of pages read # by a student in a given day pages_per_day <- data.table( student_id = c(1,1,1,2,2,2), week_of_semester =...

Why doesn't this conditional lookup work?

r,data.table
EDIT: Revamped reflecting comments below I've got some data on workers across time. They may work in more than one position in any given year; I want to subset the data to get workers who have at least one position with certain characteristics. Here's my toy data: set.seed(1643) dt<-data.table(id=rep(1:1000,10), area=sample(letters,1e4,replace=T),...

data.table operations by column name with spaces fails

r,data.table
Reproducible example #Use the Iris data set library(data.table) iris colnames(iris)[3] <- "Petal Length" iris <- as.data.table(iris) Accessing column without space is fine iris[,Petal.Width] however access a column where the name contains a space doesn't work iris[,Petal Length] iris[,'Petal Length'] The only solution seems to be iris[,iris$'Petal Length'] Comments I'm new...

Plotting pre-computed statistics on POSIX dates with ggplot's geom_boxplot

r,ggplot2,data.table,lubridate
I'm trying to create a series of boxplots showing a distribution of date values. I compute the quantiles using data.table, then I feed them to ggplot to be plotted. When I try to plot them, however, I get an error that says "Error: '/' not defined for "POSIXt" objects". Here's...

R data.table - different join behavior for integer/numeric and character columns

r,data.table
I have two data.tables DT and ADT and I want to join them on columns a, new.a: R> DT a b 1: 1 1.0 2: 1 1.0 3: 2 2.0 4: 3 3.5 5: 4 4.5 6: 5 5.5 R> ADT new.a type 1: 1 3 2: 1 5 3:...

Access data.table columns with strings

r,string,indexing,data.table
Apologies for a question that probably makes it obvious that I usually work in Python/pandas, but I'm stuck with this. How do I select a data.table column using a string? dt$"string" dt$as.name("string") dt$get("string") I'm sure this is super simple, but I'm not getting it. Any help is greatly appreciated! --------------...

Changing multiple Columns in data.table r

r,data.table
I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer. The idea is to index two or more series on a certain date by...

Cumulative total by group

r,data.table
For the following dataset: d = data.frame(date = as.Date(as.Date('2015-01-01'):as.Date('2015-04-10'), origin = "1970-01-01"), group = rep(c('A','B','C','D'), 25), value = sample(1:100)) head(d) date group value 1: 2015-01-01 A 4 2: 2015-01-02 B 32 3: 2015-01-03 C 46 4: 2015-01-04 D 40 5: 2015-01-05 A 93 6: 2015-01-06 B 10 .. can anyone...

In data.table: iterating over the rows of another data.table

r,data.table
I have two data.tables: k1 <- mtcars[1:4,1:6] k11 <- as.data.table(k1) k2 <- iris[1:3,1:2] k22 <- as.data.table(k2) I am trying to perform some column operation on the first data.table by iterating over rows of columns of second data.table k3 <- lapply(1:nrow(k2),function(j){ mpg=k1[,"mpg"]*k2[j,"Sepal.Width"] #get the new value of mpg equals to mpg*first...

R get value at minimum and maximum time of day

r,data.table
I have some data that I need to analyse easily. I want to create a graph of the average usage per day of a week. The data is in a data.table with the following structure: time value 2014-10-22 23:59:54 7433033.0 2014-10-23 00:00:12 7433034.0 2014-10-23 00:00:31 7433035.0 2014-10-23 00:00:49 7433036.0 ......

Read blank as missing(NA) in R data table

r,function,data.table
I am trying to create a function which can find missing location and impute the missing in a data table. Now this function uses is.na() extensively to find out the missing location and also to replace it with imputation value. It is working fine for all type of variable until...

error in count_combinations in statar

r,data.table
I'm trying to test out the count_combinations function in the statar package, and I'm encountering the following problem: Here's the code: id <- c(1, 1, 2, 2) name <- c("coca cola company", "coca cola incorporated", "apple incorporated", "apple corp") count_combinations(name, id = id) And here's the error: Error in setorderv(x,...

Refer to previous observation and concatenate strings with a condition in R

r,data.table
So I have this kind of data: a <- data.table("a"=c("1 42"," 84","2 10","3 12"," 24")) a 1: 1 42 2: 84 3: 2 10 4: 3 12 5: 24 The data is a single sorted vector consisting of strings with IDs 1, 2 and 3 and data on the IDs....

Extracting text strings using data.table in R

regex,r,data.table
I have a data.table similar to the one as follows Data library(data.table) DT <- structure(list(N = 1:6, VN = c("v1", "v3", "v6", "v7a", "v18", "v23"), T1 = c("bigby (wolf)", "white", "red (rose)", "piggy (straw)", "(curse) beast", "prince"), T2 = c("jack (bean)", "snow (dwarves)", "beard (blue)", "bhageera (jungle) mowgli (book)", "beauty",...

R: cut by date and grouping by ID with data.table

r,data.table
I have a data.table with with a list of actors uniquely identified by id doing things on a date. There is no limit to number of things done by an actor on a particular date. require(data.table) set.seed(28100) df.in <- data.table(id = sample(1:10, 100, replace=TRUE), date = sample(2001:2012, 100, replace=TRUE)) Now...

Paste multiple elements of a list of vectors

r,list,data.table,paste
I have a data table like this ID Name 1: 2760925 01_HOOFD_010 2: 2760925 01_HOOFD_015 3: 2771451 01_HOOFD_010 4: 2771451 01_HOOFD_190_2 5: 2771451 01_HOOFD_030_2 6: 2771451 08_AWB45_020_2 7: 2771451 08_AWB45_040 8: 2771451 01_HOOFD_065_2 For the "Name" field, I want to have the part after the second underscore in a separate...

dplyr: Difference between unique and distinct

r,data.table,dplyr
Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand. dt2a <- select(dt, mutation.genome.position, mutation.cds, primary.site, sample.name, mutation.id) %>% group_by(mutation.genome.position, mutation.cds, primary.site) %>% mutate(occ = nrow(.)) %>% select(-sample.name) %>% distinct() dim(dt2a)...

For each `pop` get frequencies of the elements of `id`

r,table,data.table
Consider this data: m = data.frame(pop=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4), id=c(0,1,1,1,1,1,0,2,1,1,1,2,1,2,2,2)) > m pop id 1 1 0 2 1 1 3 1 1 4 1 1 5 2 1 6 2 1 7 2 0 8 2 2 9 2 1 10 3 1 11 3 1 12 3 2 13 3 1...

How to create a random matching between the rows of two data.tables (or data.frames)

r,data.table
For this example, I'll use the data.table package. Suppose you have a table of coaches coaches <- data.table(CoachID=c(1,2,3), CoachName=c("Bob","Sue","John"), NumPlayers=c(2,3,0)) coaches CoachID CoachName NumPlayers 1: 1 Bob 2 2: 2 Sue 3 3: 3 John 0 and a table of players players <- data.table(PlayerID=c(1,2,3,4,5,6), PlayerName=c("Abe","Bart","Chad","Dalton","Egor","Frank")) players PlayerID PlayerName 1: 1...

R: subsetting and ordering large data.frame without forloop

r,for-loop,data.table,dplyr,split-apply-combine
I have long table with 97M rows. Each row contains the information of an action taken by a person and the timestamp for that action, in the form: actions <- c("walk","sleep", "run","eat") people <- c("John","Paul","Ringo","George") timespan <- seq(1000,2000,1) set.seed(28100) df.in <- data.frame(who = sample(people, 10, replace=TRUE), what = sample(actions, 10,...

r count combinations of elements in groups

r,data.table
I wish to count the number of times each combination of two elements appears in the same group. For example, with: > dat = data.table(group = c(1,1,1,2,2,2,3,3), id=c(10,11,12,10,11,13,11,13)) > dat group id 1: 1 10 2: 1 11 3: 1 12 4: 2 10 5: 2 11 6: 2 13...

Row indices in data.table in R

r,data.table
How do I control the row indices in data.table in R? I want to check if the value in a row matches the previous one: patient produkt output 1 Meg Initiation 1 Meg Continue 1 Gem Switch 2 Pol Initiation 2 Pol Continue 2 Pol Continue Where the output column...

Loading Data.Table and dplyr one after other gives errors

r,data.table,dplyr
I ran into a problem that I have not encountered before. When I load the data.table package (version 1.9.4) alone, and then try to subset a dataset to remove a variable I get no issues. However, when I load plyr (version 1.8.2) and dplyr (version 0.4.1) as well, I get...

Speed up for loop with two if statements

r,for-loop,data.table
I have a data table DT with 15,000+ rows. I have a for loop that runs correctly, but it takes 30+ seconds and is the slowest part of the code overall. Here is the for loop: for (i in 2:nrow(DT)) { if(DT$C1[i] == DT$C1[i+1] & DT$C2[i] != DT$C2[i+1] & DT$C3[i+1]...

Combine the result of the function on a row in one column

r,data.table
I have a large data.table where one column contains text, here is a simple example: x = data.table(text = c("This is the first text", "Second text")) I would like to get a data.table with one column containing all the words of all the texts. Here was my try: x[, strsplit(text,...

Transforming irregular data into usable format in R

r,data.table,transformation
Suppose I work in a company that provides a number of different services to their customers. I have been given a data report on the services provided that I need to analyse. The report is formatted in a way to be easily read and printed, but not suitable for data...

In place modification of matrices in R

r,matrix,data.table,copy-on-write
I there any way to avoid copy-on-modify for in-place modifications of matrices in R? I am trying to copy a smaller matrix to a slice of larger matrix as follows. library(data.table) y <- matrix(c(11,21,31,12,22,32),nrow=3,ncol=2) address(y) [1] "08429190" y[2:3,] <- matrix(c(1,1,8,12),nrow=2) address(y) [1] "0E033D28" ...

How to delete columns from a data.table based on values in column

r,data.table
Background I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how...

data.table complex manipulation with rows elimination [closed]

r,data.table
Data sample is below: I have 3million rows. Date,id,type,qty 9/30/14,1,’A’,10 9/30/14,2,’Z’,12 9/30/14,3,’B’,15 9/30/14,1,’B’,20 9/30/14,1,’Z’,20 9/30/14,1,’A’,20 9/30/14,2,’B’,20 9/30/14,3,’B’,5 9/30/14,3,’A’,40 I want result as below: Date,id,type,Qty 9/30/14,1,A,20 9/30/14,1,B,20 9/30/14,2,B,20 9/30/14,3,B,5 9/30/14,3,’A’,40 Logic is below: On the same date, pick the latest qty (from the later record) for each id and type.Ignore types...

updating “multiple” columns of “selected” rows of a data table with duplicate key values

r,data.table
I am working on a data table with a million odd rows. It has a non-unique key. One particular column(say v1) has NA for few rows. I need to update this column(v1) and another column(v2) when v1 has NA. Values for updating these columns come from another data table. The...

Select columns of data.table based on regex

regex,r,data.table
How can I select columns of a data.table based on a regex? Consider a simple example as follows: library(data.table) mydt <- data.table(foo=c(1,2), bar=c(2,3), baz=c(3,4)) Is there a way to use columns of bar and baz from the datatable based on a regex? I know that the following solution works but...

Binning data across interval boundaries

r,data.table
Say I have these data: start end duration 1 2.67026 2.903822 0.233562 2 4.40529 5.606470 1.201180 3 9.24340 10.010818 0.767418 4 11.87930 13.414140 1.534840 5 14.78210 15.182492 0.400392 6 16.51720 16.817494 0.300294 7 22.08930 25.125610 3.036310 8 32.13240 33.667240 1.534840 9 45.47880 45.912558 0.433758 10 52.85270 54.454270 1.601570 11 55.62210...

Create sequence with data.table

r,data.table,aggregate-functions
I have a data.table of the format id | pet | name 2011-01-01 | "dog" | "a" 2011-01-02 | "dog" | "b" 2011-01-03 | "cat" | "c" 2011-01-04 | "dog" | "a" 2011-01-05 | "dog" | "some" 2011-01-06 | "cat" | "thing" I want to perform an aggregate that concatenates...

Eliminate all duplicates according to a key + keep records from a table which are not in another table

r,data.table
Question 1: I am relatively new to R and I have two distinct questions. I need to eliminate duplicates according to a key (single or multiple) but all of them so unique wouldn't do it. I also found the function duplicated but it will mark as true only from the...

R Data.table divide values in column based on another column

r,data.table
I have a main data.table which has 364 rows and the 3 columns: Date Weekday Weight 2012-01-01 Monday 100 2013-01-02 Tuesday 200 ... and a help data.table with 7 rows 2 columns: Weekday Coefficient Monday 0.91 Tuesday 0.84 Wednesday 0.99 ... Now i would like to create a 4th column...

Order by group sum

r,data.table,dplyr
take a diamond data set in ggplot as an example I want to order the data frame by the total price for each color , so if for example this is the total price for each color H 5000 I 4000 E 1000 J 3000 I want to sort the...

R data.table Conditional Sum: Cleaner way

r,data.table
This of course is a very often encountered problem, so I have expected many questions here on SO regarding this. However, all the answers that I could find were very specific to the question and often encountered workarounds (you don't have to do this, foobar is much better in this...