FAQ Database Discussion Community


R: split string into numeric and return the mean as a new column in a data frame

r,recursion,dplyr,strsplit
I have a large data frame with columns that are a character string of numbers such as "1, 2, 3, 4". I wish to add a new column that is the average of these numbers. I have set up the following example: set.seed(2015) library(dplyr) a<-c("1, 2, 3, 4", "2, 4,...

slice reoccuring values using dplyr

r,dplyr
taking the following data A <- c(4,4,4,5,5,5,5,6,6) B <- c(1,2,3,1,3,4,3,2,7) data1 <- data.frame(A,B) I want to remove the duplicated B values for each A. So my new table should remove data1[7,] I want to use the dplyr() package And have tried the following code data2 <- data1 %>% group_by(A) %>%...

randomly assign teachers to a school with dplyr or similar?

r,dplyr
Suppose I have a data frame with 8 schools and its characteristics, and another with 48 teachers and its characteristics. I can generate some fake data with the following code: library(dplyr) library(geosphere) set.seed(6232015) n.schools <-8 n.teachers <- 48 makeRandomString <- function(pre, n=1, length=12) { randomString <- c(1:n) # initialize vector...

Select first observed data and utilize mutate

r,dplyr
I am running into an issue with my data where I want to take the first observed ob score score for each individual id and subtract that from that last observed score. The problem with asking for the first observation minus the last observation is that sometimes the first observation...

dplyr: optional parameter in mutate_each

r,dplyr
I use the dplyr package in R. Using that i want to create a function like require(dplyr) aFunction <- function(x, optionalParam1="abc"){ cat(optionalParam1, "\n") return(x) } myFun <- function(data, ...){ result <- data %>% mutate_each(funs(aFunction(., ...))) } and then call it like data = data.frame(c1=c(1,2,3), c2=c(1,2,3)) myFun(data) # works myFun(data, optionalParam1="xyz")...

dplyr: filter rows according to aggregated function result

r,filter,aggregate-functions,dplyr
I have a table listing (amount, year and month) and I want to filter the rows corresponding to complete years. I.e. I want to ommit the last 4 rows of the sample dataframe I give below, that refer to 2015, and get the rest 60. Is it possible to do...

Duplicated rows emerging when using group_by and summarise

r,data.table,dplyr
I have a data table and I get duplicated rows when using the summarise function combined with group_by. I will give a simplified example of my problem. First, I am using group_by and mutate to add the sum of 'value' for each id. dt <- data.table(id = rep(1:5, each=10), cpc...

using limits to define unique identities

r,data.frame,dplyr
Person <- c(1,2,3) Age <- c(10,22,30) Height <- c(140,185,160) Weight <- c(65, 80, 75) People <- data.frame(Person, Age, Height, Weight) Age_cats_type1 [5-15], [20-30], [35-45] Age_cats_type2 [8-13], [14-16], [18-40] Height_cat_Type1 [100-120], [121-140], [141-186] Height_cat_type2 [110-125], [126-145], [146-190] Weight_cat_Type1 [50-60], [61-78], [79-85] Weight_cat_Type2 [55-75], [76-90], [91-100] For People[1,2] (age=10), this fits within...

R ggvis interactive slider for calculating y values (e.g. for background correction)

r,dplyr,ggvis
I would like to use a ggvis slider to visually do a background correction of my data: library("dplyr") library("ggvis") library("lubridate") data <- data.frame(timestamp = Sys.time() - hours(10:1), signal = rnorm(10), temperature = rnorm(10)) mySlider <- input_slider(0, 2, value = 1, step = 0.1, label = "T-correction") data %>% ggvis(~timestamp, ~signal)...

dplyr filter () ignores the rows with the value 0, how to change that

r,dplyr
In my case, I need the filter function to consider the rows with value 0, but it ignores that. So I have this data attached and I want to filter out the values of flowers separately for s1 and s2 names. My table is case1 s1= filter(case1, Flower,Name=="s1") s2= filter(case1,...

Plotting (ggplot) numeric values from mixed long format column of class character

r,ggplot2,dplyr,tidyr
Following the tidy data standard, I have my data in long format with a key and a value column. The values for some keys are numeric, for others are characters, and so R has the entire column set as character class. When I use filter() to pipe only the numeric...

Use group_by to filter specific cases while keeping NAs

r,dplyr
I want to filter my dataset to keep cases with observations in a specific column. To illustrate: help <- data.frame(deid = c(5, 5, 5, 5, 5, 12, 12, 12, 12, 17, 17, 17), score.a = c(NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, 1, NA)) Creates deid score.a...

Extra column based on paired data (mutate)

r,merge,dplyr
I have a dataset with paired data (members of same household). Id is the individual identifier and householdid is the identifier of the partner (and vice versa). What I need is to add an extra column (the occupation) for each id of his\her partner. My data look like this dta...

Grouping by factor absent in dataset

r,group-by,dplyr
I'm using dplyr to calculate the grouping of a variable [mode], for a series of files. The goal is to build a data frame where you copy (cbind) the results for every file (any ideas on a better data structure welcome). So files are like this: FILE 1: id |mode...

scalars comparison in dplyr filter

r,dplyr
In the following reproducible example I want to filter the rows in df where an expression of the variables gives a specific result. Rows 1 and 3 meet the requirements and should be returned. However my first attempt was not correct since I used == for comparing scalars. df <-...

Do dplyr functions on a database tbl execute locally or remotely?

sql,r,dplyr
I've been using dplyr for a bit locally and I've found it a very powerful tool. One thing that gets showcased in a lot of the intro talks I've found is how you can use it to operate on a database table "to only work with the data you want"...

dplyr multiple inputs from Shiny

r,shiny,dplyr
I have a Shiny app that takes input from radio button and then use that to perform filter to the data frame using dplyr in the server side. It works, but now I want to expand it to take multiple inputs to filter, and I have no idea how to...

how to write the following code using %>% in R [duplicate]

r,dplyr,piping,magrittr
This question already has an answer here: filter for complete cases in data.frame using dplyr (case-wise deletion) 4 answers I am trying to use more and more of the %>% operator with dplyr in my code but I find I am not able figure out how to use %>%...

Create new variable that is linear combination of many other variables

r,dplyr
Suppose I have a data frame that looks like this: df1 <- as.data.frame(matrix( rnorm(100*50,mean=0,sd=1), 100, 50)) I want to create a new variable y that is the sum alpha_i*V_i with i from 1 to 50 and where alpha is a random number drawn from a uniform distribution (0,1). What is...

R: subsetting and ordering large data.frame without forloop

r,for-loop,data.table,dplyr,split-apply-combine
I have long table with 97M rows. Each row contains the information of an action taken by a person and the timestamp for that action, in the form: actions <- c("walk","sleep", "run","eat") people <- c("John","Paul","Ringo","George") timespan <- seq(1000,2000,1) set.seed(28100) df.in <- data.frame(who = sample(people, 10, replace=TRUE), what = sample(actions, 10,...

Create Conditional new variables in R

r,dplyr,tidyr
I need to recreate the original variables of a very large data frame (900+ variables). Here is an example of what I'm trying to do: dat <- data.frame( id=c('user1','user2','user3'), agePanel1=c(20,25,32), agePanel2=c(21,NA,33), favColPanel1=c('blue','red','blue'), favColPanel2=c('red',NA,'red') ) id agePanel1 agePanel2 favColPanel1 favColPanel2 1 user1 20 21 blue red 2 user2 25 NA red...

Summarise with dplyr and export table with means and sd (+/-)

r,export,dplyr,mean
I have a data frame with different observations in three compartments (COMP1, COMP2 and COMP3) for ten products (PRO01-PRO10) and for three different treatments (A, B and C). My data is this: TRA PRO COMP1 COMP2 COMP3 A PRO01 2.98 6.35 45.57 A PRO01 2.97 9.17 66.52 A PRO01 3.83...

Best practice to get a dropped column in dplyr tbl_df

r,data.frame,dplyr
I remember a comment on r-help in 2001 saying that drop = TRUE in [.data.frame was the worst design decision in R history. dplyr corrects that and does not drop implicitly. When trying to convert old code to dplyr style, this introduces some nasty bugs when d[, 1] or d[1]...

Summing rows by month in R

r,data.table,aggregate,dplyr,lubridate
So I have a data frame that has a date column, an hour column and a series of other numerical columns. Each row in the data frame is 1 hour of 1 day for an entire year. The data frame looks like this: Date Hour Melbourne Southern Flagstaff 1 2009-05-01...

Using dplyr window functions to calculate percentiles

r,dplyr,tidyr
I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions. Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon...

Find value in previous and next year

r,datetime,dplyr
I have a dataframe with timeseries observations. i would like for each observation to add a variable with the value at the closest similar date in the previous year and the closest similar date in the next year (e.g. for a value of 15 May 2014, this might be 13...

bind_rows in dplyr throwing unusual error

r,dplyr
Hopefully I'm not duplicating some previously existing issue. I'm working on a 32-bit Win7 machine, R V=3.2.0, dplyr V=0.4.1, RStudio 0.98.1103. The files in question are two CSV files read into vars (x,y / sep = "|", header = TRUE, stringsasFactors = FALSE), that originated from the same Oracle table....

dplyr: Create a new variable as a function of all existing variables without defining their names

r,dplyr
In the following dataframe I want to create a new variable as the following function of all existing ones: as.numeric(paste0(df[i,],collapse="")) However, I don't want to define the column names explicitly because their number and names maybe different each time. How can I do that using dplyr? The equivalent in base...

How to convert class of several variables at once using dplyr

r,dplyr
So I have a data frame with several variable that are characters that I want to convert to numeric. Each of these variables starts with "sect1". I can do this easily one at a time, but I'm wondering if this can be accomplished all at once. I've done this in...

using intervals to assign categorical values

r,data.frame,dplyr,intervals
Take the following generic data A <- c(5,7,11,10,23,30,24,6) B <- c(1,2,3,1,2,3,1,2) C <- data.frame(A,B) and the following intervals library(intervals) interval1 <- Intervals( matrix( c( 5, 15, 15, 25, 25, 35, 35, 100 ), ncol = 2, byrow = TRUE ), closed = c( TRUE, FALSE ), type = "Z" )...

Add new column as result of a condition between groups in dplyr

r,dplyr
I need to know if a person belong to a unique group or several groups and add a new column with boolean values that describe this condition. Example data: df <- structure(list(group = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 1L, 3L), person = c(955563L, 955563L, 855563L, 855563L,...

How to aggregate data with the dyplr package in R

r,data.frame,grouping,dplyr
I would like to understand how to write the following code using the dplyr package: averageStepsDayType <- aggregate( NAreplacement$steps, by=list(interval=NAreplacement$interval, dayType=NAreplacement$dayType), mean ) This is the original data frame: > head(NAreplacement) steps date interval dayType 1 1.7169811 2012-10-01 0 weekday 2 0.3396226 2012-10-01 5 weekday 3 0.1320755 2012-10-01 10 weekday...

Using dplyr summary function on yearmon from zoo

r,dplyr,zoo
I have a data frame with values associated to a year and month. I use yearmon class from zoo package to store the year-month info. My aim is to count the average of those values from the same year-month. However, using dplyr seems to give me an error. The variable...

How do I turn sets into indicators of set membership?

r,postgresql,dplyr
The data I have have for each observation a set of "flavors". I would like to turn those sets (which exist as text[] arrays in PostgreSQL) into indicators for the presence of individual flavors, as I'd like to examine how flavors either do or do not go together. What I...

Loading Data.Table and dplyr one after other gives errors

r,data.table,dplyr
I ran into a problem that I have not encountered before. When I load the data.table package (version 1.9.4) alone, and then try to subset a dataset to remove a variable I get no issues. However, when I load plyr (version 1.8.2) and dplyr (version 0.4.1) as well, I get...

Select (dplyr) operator with '-'

r,dplyr
How to use SELECT (dplyr library) operator with name containing '-'? For example: AdultUCI %>% select(capital-gain) caused: ...

merge and plot multiple text files

r,ggplot2,dplyr
I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000...

dplyr: max value in a group, excluding the value in each row?

r,dplyr
I have a data frame that looks as follows: > df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4)) > df Source: local data frame [6 x 2] g x 1 A 7 2 A 3 3 B 5 4 B...

recursive error in dplyr mutate

r,dplyr
Just learning dplyr (and R) and I do not understand why this fails or what the correct approach to this is. I am looking for a general explanation rather than something specific to this contrived dataset. Assume I have 3 files sizes with multipliers and I'd like to combine them...

R spreading multiple columns with tidyr [duplicate]

r,dplyr,tidyr
This question already has an answer here: tidyr repeated measures multiple variables (wide format) 3 answers Take this sample variable df <- data.frame(month=rep(1:3,2), student=rep(c("Amy", "Bob"), each=3), A=c(9, 7, 6, 8, 6, 9), B=c(6, 7, 8, 5, 6, 7)) I can use spread from tidyr to change this to wide...

dplyr full_join does not work as expected

r,dplyr
Here is a toy example (where merge is from the base package and full from dplyr): require(dplyr) a = data.frame(Day=Sys.Date()+1:5,x=1:5) b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5)) x1 = b x2 = b for(i in 1:10){ x1=full_join(x1,a,by="Day") x2 = merge(x2,a,by="Day",all=T) } x1 and x2 are different. I would expect x2 since "a" is appended...

conditional calculations in data frame

r,dplyr
I frequently have to calculate new variables from existing ones in a data frame based on a condition of a factor variable. Edit Getting 4 answers in like 2 mins, I realised I have oversimplified my example. Please see below. Simple example: df <- data.frame(value=c(1:5),class=letters[1:5]) df value class 1 a...

How to use logical functions with %>% operator (dplyr)

r,logic,dplyr,which
I was wondering how to use logical (for example which, any, all) functions with %>% operator from dplyr package in R. I have a vector of values aaa <- sample(1:5, 10, replace = TRUE) I would like to find out which of them are equal to 4. When I try...

R - Merging Two Data.Frames with Row-Level Conditional Variables

python,mysql,r,merge,dplyr
Short version: I have a slightly trickier than usual merge operation I'd like help optimizing with dplyr or merge. I have a number of solutions already, but these run quite slow over large datasets and I am curious if there exist a faster method in R (or in SQL or...

Simple Table with dplyr on Sequence Data

r,count,dplyr,summary
I would like to make a simple table with dplyr and summarise But I can't really figure out how ... (Even though it should be quite simple). I have a matrix of sequences. When I simply tabulate table(dta) I have the result I want. dta acquaintance alone child notnotnot nuclear...

R: Using plyr to perform fuzzy string matching between matching subsets of two data sources

r,plyr,dplyr,fuzzy-comparison
Say I have a list of counties with varying amounts of spelling errors or other issues that differentiate them from the 2010 FIPS dataset (code to create fips dataframe below), but the states in which the misspelled counties reside are entered correctly. Here's a sample of 21 random observations from...

change a value based on a subset of another variable in dplyr

r,dplyr,piping
I have a data set that has muscle activity data from a group of athletes who have had an ACL reconstruction. I want to reassign the limb side to indicate the ACLR limb and uninjured limb. Looking at the dataset called EMG below, suppose John had a left ACLR and...

Faster way to summarise variables based on column in R

r,data.table,dplyr
I wish to summarise some variables in my data frame based on a column. However my data frame is rather large (>30,000,000 rows) and using the summarise function in dplyr takes ages to run. Is there a faster way in R to speed up the summarising process? I have a...

How can I perform mathematical operations on array elements referencing i+1 or i-1 indexes?

r,dplyr
I could write the following code in almost any language. subtractPrevious takes an array and subtracts the i+1st value from the ith value. How do I do this kind of calculation using R? It seems like a job for lapply or maybe a dplyr function. x <- c(1,2,3,4,5,6,7,8,9,10) subtractPrevious <-...

Tidy data frame to matrix in R

r,matrix,dplyr,tidyr
This is the glimpse() of my data frame: $ Row (int) 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 1, 1, 1, 1,... $ Col (int) 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,...

R: Create 2 columns with difference and percentages values of another column

r,data.frame,data.table,plyr,dplyr
I have a dataframe like this ID <- c(101,101,101,102,102,102,103,103,103) Pt_A <- c(50,100,150,20,30,40,60,80,90) df <- data.frame(ID,Pt_A) +-----+------+ | ID | Pt_A | +-----+------+ | 101 | 50 | | 101 | 100 | | 101 | 150 | | 102 | 20 | | 102 | 30 | | 102 |...

adding default values to item x group pairs that don't have a value (df %>% spread %>% gather seems strange)

r,dplyr,tidyr
Short version How to do the operation df1 %>% spread(groupid, value, fill = 0) %>% gather(groupid, value, one, two) in a more natural way? Long version Given a data frame df1 <- data.frame(groupid = c("one","one","one","two","two","two", "one"), value = c(3,2,1,2,3,1,22), itemid = c(1:6, 6)) for many itemid and groupid pairs we...

Filtering on a date using dplyr without changing the variable format

r,filter,dplyr,mutated
I want use a list of years to filter a database by date years<-c("2014") yearsdata <- data.frame(animal=c("cow","pig"), mydate=c(as.Date("2015-01-01"), as.Date("2014-01-01"))) yearsdata %>% mutate(mydate =format(mydate, "%Y") %>% as.character()) %>% filter(is.null(years) | mydate %in% years) The above code works and lets me filter my dataset but it also formats the date column. Is...

conditional grouping and summarising data frame in [R]

r,group,dplyr
I have a data frame like this: df <- data.frame(ID = c("A", "A", "B", "B", "C", "C"), time = c(3.1,3.2,6.5,12.3, 3.2, 3.4), intensity = c(10, 20, 30, 40, 50, 60)) |ID | time| intensity| |:--|----:|---------:| |A | 3.1| 10| |A | 3.2| 20| |B | 6.5| 30| |B | 12.3|...

Extract a List of values Filtered by criteria

r,dplyr,lapply
I have a list of data some rows need to be filtered. I have some criteria to extract those rows which I called them crit. For each crit that qualifies data, I want to get sub-set of data as output. Sometimes there is a set of functions that can extract...

Why are sum() working in this dplyr expression while quantile() isn't?

r,plyr,dplyr
I want to calculate the quantiles of each row of a data frame and return the result as a matrix. Since I want to calculate and arbitrary number of quantiles (and I imagine that it is faster to calculate them all at once, rather than re-running the function), I tried...

dplyr rename not working with regular expression

r,dplyr
The select function works fine when I try to rename variables according to certain conditions require(dplyr) select(iris, petal = starts_with("Petal")) However when I try to keep all the other variables using rename(iris, petal = starts_with("Petal")) Error: Arguments to rename must be unquoted variable names. Arguments petal are not. I have...

dplyr - summarise weighted data

r,dplyr,summary,weight
Is there a possibility to use weights with dplyr summarise function ? Let us imagine I want to calculate a weighted table dta = structure(list(PHHWT14 = c(530, 457, 416, 497, 395, 480, 383, 420, 499, 424, 504, 497, 449, 406, 492, 470, 418, 407, 403, 362, 393, 368, 423, 448,...

Transform and Count Difference of Unique Customers over Time in R

r,dplyr,reshape2
I've got a data frame in R that looks like the following: cust = c("A", "B", "C", "A", "B", "E", "A", "F", "A", "G") period = as.Date(c("2013/1/1", "2013/1/1", "2013/1/1", "2013/1/2", "2013/1/2", "2013/1/2", "2013/1/3", "2013/1/3", "2013/1/4", "2013/1/4")) df = data.frame(cust, period) I wanted to transform it in a way that I...

Double left join in dplyr to recover values

r,left-join,dplyr
I've checked this issue but couldn't find a matching entry. Say you have 2 DFs: df1:mode df2:sex 1 1 2 2 3 And a DF3 where most of the combinations are not present, e.g. mode | sex | cases 1 1 9 1 1 2 2 2 7 3 1...

dplyr - Get last value for each year

r,dplyr
I have a tbl_df that looks like this: > d Source: local data frame [3,703 x 3] date value year 1 2001-01-01 0.1218 2001 2 2001-01-02 0.1216 2001 3 2001-01-03 0.1216 2001 4 2001-01-04 0.1214 2001 5 2001-01-05 0.1214 2001 .. ... ... ... where dates range accross several years....

conditional cumulative sum using dplyr

r,dplyr,zoo
My dataframe looks like this and I want two separate cumulative columns, one for fund A and the other for fund B Name Event SalesAmount Fund Cum-A(desired) Cum-B(desired) John Webinar NA NA NA NA John Sale 1000 A 1000 NA John Sale 2000 B 1000 2000 John Sale 3000 A...

Identify unique raws for a data.frame that is grouped by multiple variables

r,dplyr
everyone, I have been trying to get this to work. Basically, I have a data.frame like the following: C1 C2 C3 C4 a aa aaa aaaa a bb aaa bbbb b aa aaa aaaa b aa aaa aaaa b bb aaa aaaa What I want for output is something like...

Events in last 21 days for every row by Name

r,data.frame,data.table,dplyr
This is what my dataframe looks like. The two rightmost columns are my desired columns.These two columns check the condition whether in the last 21 days there is an "Email" ActivityType and whether in the last 21 days there is a "Webinar" ActivityType. Name ActivityType ActivityDate Email(last21days) Webinar(last21day)** John Email...

Generate a pandoc table without repeated values

r,dplyr,pandoc,pander
I use pander to generate pandoc style tables, often I need to suppress repeated values of variables like this example: ta <- data.frame(class=c(rep("C1" ,3), rep("C2", 6)), name=rep(c("L", "V", "C"), each=3), num=rpois(9, 10)) pandoc.table(ta) The output is -------------------- class name num ------- ------ ----- C1 L 15 C1 L 11 C1...

How to generate column dependent random variable with dplyr

r,dplyr
I want to generate a column of normal random variables with mean defined in the dep variable. However, I got the non-random results. I know there are other ways to do it like apply functions (sapply(1:5, rnorm, n=1)), but I am just curious to know how to do it using...

How to use Dplyr's Summarize and which() to lookup min/max values

r,dplyr
I have the following data: Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed") Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43) Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") data <- data.frame(Name, Age, Group) And I'd like to use dplyr to...

Incomplete filtering in dplyr…why?

r,dplyr
Below I have included a subset of my data and some dplyr code. The AA filter statement works as expected. However, the BB filter statement fails to remove all groups specified in Filter.taxa. Any ideas why this is occurring and potential fixes? I have already ensured that spelling and spacing...

Efficient way to perform running total in the last 365 day window

r,vectorization,dplyr,zoo,rollapply
This is what my data frame looks like: Name EventType EventDate SalesAmount RunningTotal Runningtotal(prior365Days) John Email 1/1/2014 0 0 0 John Sale 2/1/2014 10 10 10 John Sale 7/1/2014 20 30 30 John Sale 4/1/2015 30 60 50 John Webinar 5/1/2015 0 60 50 Tom Email 1/1/2014 0 0 0...

How to use dplyr SE with “invalid” names (ie containing spaces)?

r,dplyr
I can't figure out how to use SE dplyr function with invalid variable names, for example selecting a variable with a space in it. Example: df <- dplyr::data_frame(`a b` = 1) myvar <- "a b" If I want to select a b variable, I can do it with dplyr::select(df, `a...

dplyr>Get rows with minimum and maximum of variable

r,dplyr
I have a data.frame which I want to return min and max time observations of value. df<- data.frame( time=c(24594.55, 29495.45, 24594.55, 39297.27, 24594.55, 34396.36, 19693.64, 14792.73, 29495.45), Mz=c(-0.04729751, -0.50902297, -0.04376393, -0.22218980, -0.36407263, -0.38341534, -0.34597255, -0.01480776, -0.00999671), set_nbr=c(1, 1,1, 2, 2, 2, 3, 3, 3)) library(dplyr) min_time <- df %>% group_by(set_nbr)...

dplyr summarize: create variables from named vector

r,dplyr
Here's my problem: I am using a function that returns a named vector. Here's a toy example: toy_fn <- function(x) { y <- c(mean(x), sum(x), median(x), sd(x)) names(y) <- c("Right", "Wrong", "Unanswered", "Invalid") y } I am using group_by in dplyr to apply this function for each group (typical split-apply-combine)....

Convert columns i to j to percentage

r,dplyr
Suppose I have the following data: df1 <- data.frame(name=c("A1","A1","B1","B1"), somevariable=c(0.134,0.5479,0.369,NA), othervariable=c(0.534, NA, 0.369, 0.3333)) In this example, I want to convert columns 2 and 3 to percentages (with one decimal point). I can do it with this code: library(scales) df1 %>% mutate(somevariable=try(percent(somevariable),silent = T), othervariable=try(percent(othervariable),silent = T)) But I'm hoping...

R: RunningTotal in the last 365 days window by Name

r,data.table,dplyr,zoo
This is what my data looks like. The rightmost column is my Desired Column. Name EventType EventDate SalesAmount RunningTotal Runningtotal(prior365Days) John Email 1/1/2014 0 0 0 John Sale 2/1/2014 10 10 10 John Sale 7/1/2014 20 30 30 John Sale 4/1/2015 30 60 50 John Webinar 5/1/2015 0 60 50...

Aggregating while merging two dataframes in R

r,merge,data.table,aggregate,dplyr
The ultimate goal is to sum the total quantity(transact_data$qty) for each record in product_info where the transact_data$productId exists in product_info, and where transact_data$date is between product_info$beg_date and product_info$end_date. The dataframes are below: product_info <- data.frame(productId = c("A", "B", "A", "C","C","B"), old_price = c(0.5,0.10,0.11,0.12,0.3,0.4), new_price = c(0.7,0.11,0.12,0.11,0.2,0.3), beg_date = c("2014-05-01", "2014-06-01",...

Sample by groupy with a condition (r)

r,dplyr,sample
I need to randomly select a diary for each individual (id) but only for those who filled more than one. Let us suppose my data look like this dta = rbind(c(1, 1, 'a'), c(1, 2, 'a'), c(1, 3, 'b'), c(2, 1, 'a'), c(3, 1, 'b'), c(3, 2, 'a'), c(3, 3,...

Change Row Text using dplyr

r,dplyr
I want to ask if dplyr can change the row text. For example, if I have a table like this: Fruit Cost apple 6 apple 7 orange 3 orange 4 How can I change all "apple" in Fruit column to "lemon" using dplyr. If dplyr cannot do that, is there...

how to use select for multiple fields using dplyr

r,dplyr
I have a character vector of field names that I want to select using dplyr. I'm using the underscore version of select_(). select(mtcars, mpg) # works OK select(mtcars, mpg, disp, am) # works OK for multiple fields now let's use the underscore version fie <- c("mpg") select_(mtcars, fie) # works...

dplyr: Difference between unique and distinct

r,data.table,dplyr
Seems the number of resulting rows is different when using distinct vs unique. The data set I am working with is huge. Hope the code is OK to understand. dt2a <- select(dt, mutation.genome.position, mutation.cds, primary.site, sample.name, mutation.id) %>% group_by(mutation.genome.position, mutation.cds, primary.site) %>% mutate(occ = nrow(.)) %>% select(-sample.name) %>% distinct() dim(dt2a)...

Clean way to calculate both group and overall statistics

r,dplyr
I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both...

dplyr rename_ produces an error when renaming columns with spaces

r,dplyr
rename_ works as expected for non-pathological column names %>% rename_(foo = 'testcol') But what if I'd like to rename a column which has a space? %>% rename_(foo = 'test col') I get an error that says: Error in parse(text = x) (from #12) : <text>:1:6: unexpected symbol I could use...

R: Roll up column values containing NA's by sum while grouping by ID's

r,data.table,aggregate,plyr,dplyr
I have a data frame that I got from ID <- c("A","A","A","A","B","B","B","B") Type <- c(45,45,46,46,45,45,46,46) Point_A <- c(10,NA,30,40,NA,80,NA,100) Point_B <- c(NA,32,43,NA,65,11,NA,53) df <- data.frame(ID,Type,Point_A,Point_B) ID Type Point_A Point_B 1 A 45 10 NA 2 A 45 NA 32 3 A 46 30 43 4 A 46 40 NA 5 B...

Julia equivalent of dplyr's bind_cols and bind_rows

r,dplyr,julia-lang
Is there a Julia equivalent of dplyr's bind_cols and bind_rows? Specifically, I'm looking for a bind_rows function which will match column names regardless of order and fill in NA's for non-matching columns Edit: R Example of both: library(dplyr) df1 = data.frame(a = 1, b = 1) df2 = data.frame(b =...

dplyr + “meta”-columns: when a column contains names of other columns to use instead of the data

r,dplyr
I wonder if the following question has an elegant solution in dplyr. To provide a simple reproducible example, consider the following data.frame: df <- data.frame( a=1:5, b=2:6, c=3:7, ref=c("a","a","b","b","c"), stringsAsFactors = FALSE ) Here a,b,c are regular numeric variables while ref is meant to reference which column is the "main"...

When trying to call an object with get() within group_by and mutate, it brings up the entire object and not the grouped object. How do I fix this?

r,get,group-by,dplyr,mutated
Here is my code: data(iris) spec<-names(iris[1:4]) iris$Size<-factor(ifelse(iris$Sepal.Length>5,"A","B")) for(i in spec){ attach(iris) output<-iris %>% group_by(Size)%>% mutate( out=mean(get(i))) detach(iris) } The for loop is written around some graphing and report writing that uses object 'i' in various parts. I am using dplyr and plyr. Sepal.Length Sepal.Width Petal.Length Petal.Width Species Size out 1...

dplry in reactive function for shiny app using rmarkdown

r,shiny,dplyr
I am trying to combine the answers to the following two questions: Reactive subset in ddply for rmarkdown shiny Maintain data frame rows after subet In the first question I was shown how to properly use reactive to subset in shiny / rmarkdown. I the second I was shown how...

Extracting event types from last 21 day window

r,dplyr,zoo
My dataframe looks like this. The two rightmost columns are my desired columns. **Name ActivityType ActivityDate Email(last 21 says) Webinar(last21)** John Email 1/1/2014 NA NA John Webinar 1/5/2014 NA NA John Sale 1/20/2014 Yes Yes John Webinar 3/25/2014 NA NA John Sale 4/1/2014 No Yes John Sale 7/1/2014 No No...

R dplyr: rename variables using string functions

regex,r,rename,dplyr
(Somewhat related question: Enter new column names as string in dplyr's rename function) In the middle of a dplyr chain (%>%), I would like to replace multiple column names with functions of their old names (using tolower or gsub, etc.) library(tidyr); library(dplyr) data(iris) # This is what I want to...

use column-name range in data.table like in dplyrs select

r,data.table,dplyr
I would like to select multiple columns from a data.table (with 1200 column-names) specifying a range by column-name, like one can do with dplyr, for example: library(data.table) library(dplyr) dt <- data.table(w = sample(100, 50), x = sample(100, 50), y = sample(100, 50), z = sample(100, 50)) select(dt, w:y) Currently I...

Refer particular value in `dplyr::mutate()`

r,dplyr
I have the following code: library(dplyr) library(quantmod) # inflation data getSymbols("CPIAUCSL", src='FRED') avg.cpi <- apply.yearly(CPIAUCSL, mean) cf <- avg.cpi/as.numeric(avg.cpi['1991']) # using 1991 as the base year cf <- as.data.frame(cf) cf$year <- rownames(cf) cf <- tail(cf, 25) rownames(cf) <- NULL cf$year <- lapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1))) rm(CPIAUCSL) # end of...

Select the rows at the end of a dataframe which is frequently accumulating rows

r,dplyr
I want to cut out the last 14 rows of a dataframe and make them into their own dataframe. The problem is that every few days, we add more rows to the end, but we always want the last 14 rows. The dataframe is ordered by dates, for example: Date...

ifelse works with mutate in R, why not in knitr?

r,knitr,dplyr
I have this line of code working in "normal" R: damage <- damage %>% mutate( EVTYPE = ifelse(grepl("winter", EVTYPE, ignore.case=T), "Winter Weather", EVTYPE)) However, the exact same line of code in knitr results in strange behavior. Instead of the value of EVTYPE being left alone if the grep fails, it...

What is the dplyr equivalent of plyr::ldply(tapply) in R?

r,plyr,dplyr,tidyr
Ultimately, I am trying to achieve something similar to the following, but leveraging dplyr instead of plyr: library(dplyr) probs = seq(0, 1, 0.1) plyr::ldply(tapply(mtcars$mpg, mtcars$cyl, function(x) { quantile(x, probs = probs) })) # .id 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% # 1 4 21.4 21.50...

Filter function

r,dplyr
For the data: Id res 1 9 1 8 1 8 1 6 1 5 1 4 2 6 2 6 2 0 2 0 2 0 2 0 I want top four from each group. When I use dat %>% group_by(Id) %>% top_n(4,res) I get Id res 1 9...

How do I use dplyr to generate a new column based on rowwise data?

r,dplyr
I want to add a new column to a data frame which is based on a row-wise calculation. Suppose I have a data frame such as this one: x <-as.data.frame(matrix(1:10, 5, 2)) V1 V2 1 1 6 2 2 7 3 3 8 4 4 9 5 5 10 If...

Getting rid of ghost/artifact variables as a result of dplyr::filter()

r,dplyr
I have a dataframe: x <- c(rep("A", 3), rep("B", 3), rep("C", 3)) y <- 1:9 z <- seq(from = 9, to = 81, by = 9) df <- data.frame(x, y, z); rm(x, y, z) Where I would like to keep just "A" and "C": library(dplyr) df <- df %>% filter(x...

dpyr: Create a new column by evaluating an expression in another

r,dplyr
In the following example, variable expr contains an expression in the form of a string. However, the last mutate command does not evaluate that expression correctly. Could you help me understand why? symbols <- expand.grid(data.frame(matrix(rep(c("+","-",""),8),ncol=8))) df <- symbols %>% transmute(expr=paste0(1,X1,2,X2,3,X3,4,X4,5,X5,6,X6,7,X7,8,X8,9)) %>% mutate(eval=eval(parse(text=expr))) ...

Error with durations created from a data.table using lubridate & dplyr

r,data.table,dplyr,lubridate
I'm trying to aggregate some data stored in a data.table, and then create durations (from lubridate) from the aggregated data. When I try that, however, I get an error. Here's a reproducible example: library(lubridate) library(data.table) library(dplyr) data(lakers) lakers.dt <- data.table(lakers, key = "player") durations <- lakers.dt %>% mutate(better.date = ymd(date))...

Changing behaviour of stats::lag when loading dplyr package

r,dplyr
I am having trouble with the stats::lag function when using the dplyr package. Specifically, I get different results from the lag function before and after loading dplyr. For example, here is a sample time series. If I calculate the lag with k = -1, the lagged series starts in 1971....

Keep only groups of data with multiple observations

r,dplyr
I am attempting to keep only deids with multiple observations. I have the below code help <- data.frame(deid = c(1, 5, 5, 5, 5, 5, 5, 12, 12, 12, 12), session.number = c(1, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4), days.since.last = c(0, 0, 7, 14, 93,...