I am interested in knowing how to calculate a ranking score from ratings of a product. E.g., take the apple appstore. There are two products A and B. Both have same average rating but 100 reviewers have rated A whereas 1000 reviewers have rated B. Intuitively it seems B should...

I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data. df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot() I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this...

I have several weighted values for which I am taking a weighted average. I want to calculate a weighted standard deviation using the weighted values and weighted average. How would I modify the typical standard deviation to include weights on each measurement? This is the standard deviation formula I am...

Problem statement: A person has to go to set of locations in a sequence: Seq: L1 L2 L3 L4 L5 L6 (assume L1, L2 are locations) But he followed went to locations in different sequence: ActualSeq: L3 L1 L2 L4 L5 L6 Now I need to find out what is...

Is there a way to start a Java application when launching Apache Server? I need to generate a report of the server's statistics, periodically, with a separate application.

With Scipy gamma distribution, one can only specify shape, loc, and scale. How do I create a gamma variable with shape and rate?

I am trying to calculate a set of FDR-corrected p-values using Benjamini & Hochberg's method. However, the vector I am trying to run this on contains over 10 billion values. Given the amount of data the normal method from statsmodel's multicomp module quickly runs out of memory. Looking at the...

Using R... I have a list of tables. # Example data z <- list(cbind(c(1,2), c(3,4)), cbind(c(1,2), c(3,4,5,6)), cbind(c(1,2), c(1,2,3,4,5,6)), cbind(c(1,2), c(3,4)), cbind(c(1,2), c(3,4,5,6,9,4,5,6))) z <- setNames(z, c("Ethnicity", "Country", "Age Band", "Marital Status", "Hair Color")) z $Ethnicity [,1] [,2] [1,] 1 3 [2,] 2 4 $Country [,1] [,2] [1,] 1 3...

I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling? X = X./repmat(sqrt(sum(X.^2)),size(X,1),1); ...

Does anyone know, if there's a site with precipitation (amount of rain in mm/inches) history for different cities? I need the data of Helsinki, Finland. I'm currently using Dark Sky Forecast API to get the current precipitation levels, but it doesn't seem to support that with history calls. I'll cache...

I'm working on a project to build several models for data analysis and reporting using R and the amazing Shiny framework for web development. I'm getting started with R and Shiny but I've had an amazing experience so far, yet, I'd like to get some help in case someone has...

I have a csv file with precipitation data per year and per weather station. It looks like this: station_id year Sum 210018 1916 65.024 210018 1917 35.941 210018 1918 28.448 210018 1919 68.58 210018 1920 31.115 215400 1916 44.958 215400 1917 31.496 215400 1918 38.989 215400 1919 74.93 215400 1920...

I apologize in advance if this is poorly worded. If I have a stdDev = 1, mean = 0, scipy.stats.cdf(-1, loc = 0, scale = 1) will give me the probability that a normally distributed random variable will be <= -1, and that is 0.15865525393145707. Given 0.15865..., how do I...

library(Hmisc) #10% difference n1 = 30 n2 = 30 n = 60 p1 = seq(0.1, 0.9, 0.1) p2 = p1 + 0.1 > bpower(p1, p2, n, n1, n2, alpha = 0.05) Power1 Power2 Power3 Power4 Power5 Power6 Power7 Power8 Power9 0.9997976 0.9992461 0.9933829 0.9670958 0.8995984 0.7799309 0.6141349 0.4211642 0.2252629 #20%...

I need to implement random sampling from a number of common probability distributions (normal, binomial, gamma, ...) in my java program. I found Random.nextGaussian() and was just wondering if there's any other built in support for distributions other than normal? Or are my only options third party library or DIY?

I have a data frame of variables, some occur more than once, e.g.: a, b, b, b, c, c, d, e, f I would then like to get an output (in two columns) like this: a 1; b 3; c 2; d 1; e 1; f 1. Bonus question: I'd...

I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python. Matlab provides the nice function : lassoglm(X,y, distr) where distr can be poisson, binomial etc. I had a look at both statmodels and scikit-learn but I did not find any...

I would like to know how to find out the number of degrees of freedom for a t-student distribution of standardized residuals of a GARCH model (using garchFit on R from the fGarch package). Is there an other package or any other way to estimate this parameter?...

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the...

I have spring boot application with ehcache as below @Bean public EhCacheManagerFactoryBean ehCacheManagerFactoryBean() { EhCacheManagerFactoryBean ehCacheManagerFactoryBean = new EhCacheManagerFactoryBean(); ehCacheManagerFactoryBean.setConfigLocation(new ClassPathResource("ehcache.xml")); //ehCacheManagerFactoryBean.setCacheManagerName("messageCache"); ehCacheManagerFactoryBean.setShared(true); return ehCacheManagerFactoryBean; } @Bean public EhCacheCacheManager cacheManager() { return new...

Goodafternoon! I am having some trouble with my dataset. I am using a Google AdWords export for data analysis and I want to fit a logit regression model to the data to determine whether an experiment I have conducted impacts the conversion. The problem is that the data is aggregated...

I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps. Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days =...

I am confused about marks and covariates in point process. I am trying to create a model of a marked point pattern with few covariates in R by using spatstat, but I am not sure the relationship between marks and covariates. Could anyone help me? Thanks. ---- update I have...

How do I get a histogram-like summary of interval data in R? My MWE data has four intervals. interval range Int1 2-7 Int2 10-14 Int3 12-18 Int4 25-28 I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should...

If it has one feature it's easy. Just graph it. One of the records there looks like (18, 15). Simple. But if we have multiple features that adds more dimensions to the graph, right? So how can you visualize your data set and determine whether or not linear regression is...

I have data in this form: proprete.freq <- table(cnData$proprete) proprete.freq.genre <-table(cnData$genre,cnData$proprete) I am using these functions (barplot and pie) to plot the data: barplot(proprete.freq.genre, col = heat.colors(length(rownames(proprete.freq.genre))) , main="Proprete", beside = TRUE) pie(proprete.freq, col=rainbow(3), names.arg=avis, main="Propreté") Here is the result: Question: How to include the value just on top of...

I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

How can I make ggplot plot geom_smooth(method="lm"), but only if it fits some criteria? For instance, if I only want to draw lines if the slope is statistically significant (i.e. p from the lm fit is less than 0.01). EDIT: Updated to a more complex example involving facets. Instead of...

I have an array that measures about 2000 elements long, and I would like to figure out the standard deviation of it centered at each pixel by sliding a make-believe window of some relatively small width over it, and computing the StDev of the elements in each region, yielding an...

I calculated a mean angle of below two angles. 337.477792 324.8119785 I used the formula to calculate a mean angle (see below #Distribution of the mean) http://en.wikipedia.org/wiki/Directional_statistics What I got was 28.85511475 / -28.8551147. The values don't look right ... Wonder if someone can explain this result for me? Thank...

So I have two matrices. Lets name them controls and patients. Each row is a sample, and each column is a concentration of a certain protein. It looks like this: V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 sample1 1533.34 9.88 6.82 17.88 70.75 350.07 20.67 13.96...

An interesting problem came up the other day. I have two data stream that is being output continuously. Say, A and B. (Different values) In the ideal world, A and B are exactly the opposite. If A increases by x percent, then B will decrease by x percent, and vice...

I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and...

I am trying to do some statistics in Python. I have data with several missing values, filled with np.nan, and I am not sure should I remove it manually, or scipy can handle it. So I tried both: import scipy.stats, numpy as np a = [0.75, np.nan, 0.58337, 0.75, 0.75,...

I am concerned about the following question. If I apply the function by(productivity$LOC, productivity$extension, stat.desc, norm = TRUE, basic = TRUE) how can I round the output values of the by() function? ...

On a player stat page. How can I make my anchor point the year "2014" and grab specific numbers in the 2014 column (scrape numbers to the right of 2014) The code below is skipping the "Passing" table (with all of the career passing stats) and trying to grab stats...

I have a variable I created based on a certain data. now in this new data I need to calculate different statistic parameters, but with conditions for example: *median of this new var only for obs that their birth country is not Italy. *mean of a different var only when...

I am confused about the concept of p-value. In general, if the p-value is greater than alpha which is generally 0.05, we are fail to reject null hypothesis and if the p-value is less than alpha, we reject null hypothesis. As I understand, if the p-value is greater than alpha,...

I'm using the caret package to play around with many classification methods. Currently I want to use a leave group out cross validation method (I know there are better methods). This is the train control I am using: train_control <- trainControl(method = "LGOCV", p = .7, number = 1) My...

I currently have species abundance data for multiple lakes along with measurements of some environmental variables of those lakes. I decided to do Canonical Correspondence Analysis of the data in R, as demonstrated by ter Braak and Verdenschot (1995), see link: http://link.springer.com/article/10.1007%2FBF00877430 (section: "Ranking environmental variables in importance") I am not...

First of all, what I want to know is if I am doing a systematic fault or am I messing up with the math, or do you have any ideas what could be wrong? I was trying to write a little random number generator which numbers can be influenced /...

So I have the following table: 118.00 12.00 25.00 161.00 26.00 2.00 9.00 47.00 76.00 218.00 1.00 21.00 11.00 64.00 0.00 9.00 53.00 124.00 2.00 51.00 86.00 25.00 25.00 0.00 20.00 14.00 212.00 104.00 38.00 46.00 I parse it in the following way: data2 <- read.table('to_r.txt', fill=T) I then want...

I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method. I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t)....

I am working on making a prediction in R using time-series models. I used the auto.arima function to find a model for my dataset (which is a ts object). fit<-auto.arima(data) I can then plot the results of the prediction for the 20 following dates using the forecast function: plot(forecast(fit,h=20)) However...

How can I do this with LinQ? I have a txt file.(Its about 100 lines long.) 6 7 0/3 # ##t#kon tu#i#do#b#n ko#yk####a#s##ttak###lk##ul$$$$#$#$$$####$$$$#$$$$$$#$$#$$$$$#$ I stored it in a Dictionary (The two lines). alap = File.ReadAllLines("veetel.txt"); Dictionary<string,string> statisztikaDictionary = new Dictionary<string,string>(); for (int z = 1; z < alap.Length; z+=2) {...

I loaded a table like this: V1 V2 V3 pat1 1 2 pat1 3 1 pat1 4 2 pat2 3 3 pat3 1 4 pat3 2 3 and I need to format it into something like the following, with V1 indicating the row, V2 indicating the column, and the values...

I am trying to get the diskstats data by same way that the file does. Is there any way to reach that values without reading that file? How the values are placed there? Is there any ".c" file that processes the data to place on diskstats? And for proc/stat and...

Im trying to generate random integers with logarithmic distribution. I use the following formula: idx = Math.floor(Math.log((Math.random() * Math.pow(2.0, max)) + 1.0) / Math.log(2.0)); This works well and produces sequence like this for 1000 iterations (each number represents how many times that index was generated): [525, 261, 119, 45, 29,...

I implemented an Authorship attribution project where I was able to train my KNN model with articles from two authors using KNN. Then, I classify the author of a new article to be either author A or author B. I use knn() function to generate the model. The output of...

I found this material online, I do not understand why equation (5) is equal to equation (6)? How to deduct? Given a dictionary D, a vector x has sparsity s if it can be written exactly as a linear combination of s columns of D. An important result that underlies...

I have used Statsmodels to generate a OLS linear regression model to predict a dependent variable based on about 10 independent variables. The independent variables are all categorical. I am interested in looking closer at the significance of the coefficients for one of the independent variables. There are 4 categories,...

this is probably really easy to do but I am looking to calculate the length of consecutive positive occurrences in a list in python. For example, I have a and I am looking to return b: a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0] b=[0,0,4,4,4,4,0,0,1,0,3,3,3,0] I note a similar question on Counting consecutive positive value in Python...

I am using the approach from this Yale page on fractals: http://classes.yale.edu/fractals/MultiFractals/Moments/TSMoments/TSMoments.html which is also expounded on this set of lecture slides (slide 32): http://multiscale.emsl.pnl.gov/docs/multifractal.pdf The idea is that you get a dataset, and examine it through many histograms with increasing numbers of bars i.e. resolution. Once resolution is high...

I've been trying to scipy.mstats.zscore a dataset that is intentionally organized into a nested list, and it gives: TypeError: unsupported operand type(s) for /: 'list' and 'long' which probably suggests that scipy.stats doesn't work for nested lists. What can I do about it? Does a for loop affect the nature...

I am trying to solve the following problem: A person can be classified as either GroupA, GroupB or GroupC. I want to know how attribute1 (or attribute2) affects the proportion of observations in these groups. Note that attribute1:attribute2 has a 1:N relationship. Attribute1 has five possible values, A,B,C,D,E whilst attribute2...

I have a data set of n = 1000 realizations of a random variable X and is univariate -- X = {x1, x2,...,xn}. Data is generated by varying a parameter on which the random variable depends. For example, let the r.v be Area of a circle. So, by varying the...

I've recently come across an interesting problem while trying to create a custom database. my rows are in form: 183746IGH 105928759UBS and so on (so basically an integer concatenated with a string, both of relatively random sizes.). What I'm trying to do is somehow separate the whole number in column...

I am working with a linear model, say y<-rnorm(20) x1<-rgamma(20,2,1) x2<-rpois(20,3) fit<-lm(y~x1*x2) summary(fit) and I was wondering, is there a way to access the regression variables through lm? One option would be to simply use fit$model and what you get is y x1 x2 1 1.52366782 1.1741392 4 2 -0.23640711...

I have some t-values and degrees of freedom and want to find the p-values from them (it's two-tailed). In the real world I would use a t-test table in the back of a Statistics textbook; however, I am using stdtr or stats.t.sf function in python. Both of them work fine...

Let's say that i have two variables weight and age, i have to find the confidence interval with level 99% by this case: By the ordinate (Y-Axis), if we did a linear regression a=lm(weight~age) I know that the ordinate is directly the intercept but why this won't work: predict(a, newdata=data.frame(age=intercept),...

I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress. What could be the possible reason for this...

If one uses the scipy.mstats.theilslopes routine on a data set with missing values, the results of the lower and upper bounds for the slope estimate are incorrect. The upper bound is often/always(?) NaN, while the lower bound is simply wrong. This happens, because the theilslopes routine computes an index into...

What is the test used in auto.arima() function in R to determine stationarity i.e to determine the value of "d" Can that logic be implemented in python?

I have a data.table in R that was fetched from a database that looks like this: date,identifier,description,location,value1,value2 2014-03-01,1,foo,1,100,200 2014-03-01,1,foo,2,200,300 2014-04-01,1,foo,1,100,200 2014-04-01,1,foo,2,100,200 2014-05-01,1,foo,1,100,200 2014-05-01,1,foo,2,100,200 2014-03-01,2,bar,1,100,200 2014-04-01,2,bar,1,100,200 2014-05-01,2,bar,1,100,200 2014-03-01,3,baz,1,100,200 2014-03-01,3,baz,2,200,300 2014-04-01,3,baz,1,100,200 2014-04-01,3,baz,2,100,200 2014-05-01,3,baz,1,100,200...

I used the statsmodels package to estimate my OLS regression. Now I want Breusch Pagan test. I used the pysal package for this test but this function returns an error: import statsmodels.api as sm import pysal model = sm.OLS(Y,X,missing = 'drop') rs = model.fit() pysal.spreg.diagnostics.breusch_pagan(rs) returned error: AttributeError: 'OLSResults' object...

How can I get this to calculate properties about arrays of doubles. If everything else is an int inside, would that still as an array of doubles? Or is it still an array of doubles anyway because of the method type? Here is my class. Thanks so much! import java.util.*;...

I have a question regarding dates manipulation in R. I've looked around for days but couldn't find any help online. I have a dataset where I have id and two dates and another dataset with the same id variable, date and price. For example: x = data.frame(id = c("A","B","C","C"), date1...

I have a problem in understanding how the init and fixed parameters are specified in the arima function in R. For example, I will use R's built-in dataset lh to illustrate the idea: The line below works fine arima(lh, order = c(1,0,0)) But this line does not work as expected...

I work on calibration of probabilities. I'm using a probability mapping approach called generalized additive models. The algorithm I wrote is: probMapping = function(x, y, datax, datay) { if(length(x) < length(y))stop("train smaller than test") if(length(datax) < length(datay))stop("train smaller than test") datax$prob = x # trainset: data and raw probabilities datay$prob...

I am using the following code to perform t-test: def t_stat(na,abar,avar,nb,bbar,bvar): logger.info("T-test to be performed") logger.info("Set A count = %f mean = %f variance = %f" % (na,abar,avar)) logger.info("Set B count = %f mean = %f variance = %f" % (nb,bbar,bvar)) adof = na - 1 bdof = nb -...

I am trying to find the best way to draw from a normal distribution. I want to be able to use the normal probability density function (and its cumulative) in Haskell. To put it simply, I want to use the functions on this page without using the GSL binding... I...

I would like to demean the variables from the big.matrix (panel) structure. I tried different methods but the one which works in bigmemory setting is tapply (provided by bigtabulate package). I have the following code to calculate means of variable var1 by groups represented by panel_id data <- read.big.matrix ("data.csv",...

I have a matrix with x rows (i.e. the number of draws) and y columns (the number of observations). They represent a distribution of y forecasts. Now I would like to make sort of a 'heat map' of the draws. That is, I want to plot a 'confidence interval' (not...

I would like to know the used commands in R to work with linear regression problems and confidence intervals, and why these ones are incorrect. For example let's say we have the following data: A <- c(12,11,12,15,13,16,13,18,11,14) # this is the width B <- c(50,51,62,45,63,76,53,68,51,74) # this is the height...

rm(list=ls()) myData <-read.csv(file="C:/Users/Documents/myfile.csv",header=TRUE, sep=",") for(i in names(myData)) { colNum <- grep(i,colnames(myData)) ##asigns a value to each column if(is.numeric(myData[3,colNum])) ##if row 3 is numeric, the entire column is { ##print(nxeData[,i]) fit <- lm(myData[,i] ~ etch_source_Avg, data=myData) #does a regression for each column in my csv file against my independent variable 'etch'...

I am trying to find the local minimum of a function, and the parameters have a fixed sum. For example, Fx = 10 - 5x1 + 2x2 - x3 and the conditions are as follows, x1 + x2 + x3 = 15 (x1,x2,x3) >= 0 Where the sum of x1,...

I'm struggling to use the functions MultinormalDistribution and InverseCDF in MultivariateStatistics package. Essentially << MultivariateStatistics` sig = .5; u = .5; dist = MultinormalDistribution[{0, 0}, sig*IdentityMatrix[2]]; delta=InverseCDF[dist, 1 - u] The output is InverseCDF[ MultinormalDistribution[{0, 0}, {{0.5, 0}, {0, 0.5}}], {0.5}] can someone correct the above code? If I've understood...

Suppose I pass "2015-01-01 01:50:50", then it should return "2015-01-01 01:00:00" and "2015-01-01 02:00:00". How to calculate these values in R?

I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model : fit <- PART(DELAY_CLASS ~...

My question is about how handle missing values when using train for fitting models with caret. A small sample of my data would be like that: df <- dput(dat) structure(list(LagO3 = c(NA, NA, NA, 40, 45, NA), RH = c(69.4087524414062, 79.9608383178711, 64.4592437744141, 66.4207077026367, 66.0899200439453, 91.3353729248047), SR = c(298.928888888889, 300.128888888889, 303.688888888889,...

I am given two estimators, T1 = ((X1+X2)/2)*(Y1+Y2)/2) and T2 = (X1*Y1+X2*Y2)/2. This is used for finding an estimate area, for example, T1 = ((503+505)/2)*((334+330)/2) and T2 = ((503*334 + 505*330)/2). X1, X2 are normally distributed with average µ1 and variance σ^2 and Y1,Y2 are normally distributed with average µ2...

I have a homework assignment that I was doing with Minitab to find quartiles and the interquartile range of a data set. When I tried to replicate the results using NumPy, the results were different. After doing some googling, I see that there are many different algorithms for computing quartiles:...

Let me explain the question: I know the functions table or xtabs compute contingency tables, but they expect a data.frame, which is always stored in RAM. It's really painful when trying to do this on a big file (say 20 GB, the maximum I have to tackle). On the other...

I have list of probabilities in column 1. How I can fill column 2 with 0 and 1 based on the corresponding probabilities? 0.5 1 0.2 0 0.9 1 0.35 1 0.1 0 ...

I have a graph input where the X axis is time (going forwards). The Y axis is generally stable but has large drops and raises at different points (marked as the red arrows below) Visually it's obvious but how do I efficiently detect this from within code? I'm not sure...

I want to compute the variance of an input txt file like this one: 1, 5 2, 5 3, 5 4, 10 And I want the output to be like: 1, 0 2, 0 3, 0 4, 4.6875 I've used this line: awk '{c[NR]=$2; s=s+c[NR]; avg= s / NR; var=var+(($2...

I have a big continuous array of values that ranges from (-100, 100) Now for this array I want to calculate the weighted average described here since it's continuous I want also to set breaks for the values every 20 i.e the values should be discrete as -100 -80 -60...

I have a continuous variable (in this case, fees spent). How do I determine % spending cutoffs? i.e. how do I know what dollar amount separates the bottom 50% from the top 50% (similarly for any other % I may be interested in). Thank you very much for any help

I am exploring the results of Dynamic Time Warping as implemented in the dtw package. While doing some sanity checks I came across a result which I cannot rationalize. At some points along the warp path, the cumulative distance appears to decrease. Example below: mat=...

I am looking for any function or method to create 2D array of random numbers whose median value is predefined like : array=generateNumbers(medianValue) will return 2D array with median value = medianValue Is it possible ? ...

I'm searching for a method or library for java (can be java-8) that is capable of generating a random sample (preferable with a fixed seed for deterministic testing) based on the numbers that make up a boxplot. So imagine having the boxplot: ---------- |-----| | |-----------| ---------- min A avg...

I have a list of numbers like -17, -50, 100, 120, 5, 20 Now how to convert this series to percent. I have a problem in negative number in converting to percent. For example i want to convert these number between 0 to 1 or 0% to 100%

I have a dataset in which there are some outliers due to input errors. I have written a function to remove these outliers from my data frame (source): remove_outliers <- function(x, na.rm = TRUE, ...) { qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) H <- 1.5 * IQR(x,...

I have an xts time series object: head(mtrx) ADS.DE.Close ALV.DE.Close BAS.DE.Close BAYN.DE.Close BEI.DE.Close BMW.DE.Close CBK.DE.Close CON.DE.Close DAI.DE.Close 2007-12-28 01:00:00 51.26 147.95 101.41 62.53 53.00 42.35 21.04 86.06 66.50 2008-01-02 01:00:00 50.00 145.92 100.94 61.45 52.39 42.73 20.75 83.76 64.68 2008-01-03 01:00:00 50.09 144.93 101.60 61.71 51.18 42.09 20.48 81.74 62.91...

I have a dataset including three factor variables in r and the output of my glm model consistently gives estimates for each individual categorical value. I tried to correct this by using the as.numeric command as shown below and I used the factor command in the glm model but I...

I am generation some data whose plots are as shown below In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this? I am basically trying to...

I have a data frame with eight variables. I would like to calculate the average mean of annual weighted average percent loss. However, not all variables exist for each year in my dataset. What would be the simplest method to do so? Included below is a sample dataset and final...

I'm successfully using Welford's method to compute running variance and standard deviation as described many times on Stack Overflow and John D Cook's excellent blog post. However in the stream of samples, sometimes I encounter a "rollback", or "remove sample" order, meaning that a previous sample is no longer valid...

I have a classification task with 4 classes which I solve with machine learning classifiers (SVM etc.). Which statistical measures can be used for 4 classes? I will for sure use p-value (with permutation test) but I need some more. Some interesting measures are true positive rate, true negative rate,...

I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package. My...