FAQ Database Discussion Community


Test for statistical difference in proportions using R

r,statistics,glm
I am trying to solve the following problem: A person can be classified as either GroupA, GroupB or GroupC. I want to know how attribute1 (or attribute2) affects the proportion of observations in these groups. Note that attribute1:attribute2 has a 1:N relationship. Attribute1 has five possible values, A,B,C,D,E whilst attribute2...

R plot 'Heat map' of set of draws

r,plot,statistics,confidence-interval
I have a matrix with x rows (i.e. the number of draws) and y columns (the number of observations). They represent a distribution of y forecasts. Now I would like to make sort of a 'heat map' of the draws. That is, I want to plot a 'confidence interval' (not...

Angular statistics doesn't make sense

statistics,mean,angle,direction
I calculated a mean angle of below two angles. 337.477792 324.8119785 I used the formula to calculate a mean angle (see below #Distribution of the mean) http://en.wikipedia.org/wiki/Directional_statistics What I got was 28.85511475 / -28.8551147. The values don't look right ... Wonder if someone can explain this result for me? Thank...

R: how to use the bpower function to calculate 2-sample binomial test power

r,statistics
library(Hmisc) #10% difference n1 = 30 n2 = 30 n = 60 p1 = seq(0.1, 0.9, 0.1) p2 = p1 + 0.1 > bpower(p1, p2, n, n1, n2, alpha = 0.05) Power1 Power2 Power3 Power4 Power5 Power6 Power7 Power8 Power9 0.9997976 0.9992461 0.9933829 0.9670958 0.8995984 0.7799309 0.6141349 0.4211642 0.2252629 #20%...

Precipitation history of a city

statistics,weather
Does anyone know, if there's a site with precipitation (amount of rain in mm/inches) history for different cities? I need the data of Helsinki, Finland. I'm currently using Dark Sky Forecast API to get the current precipitation levels, but it doesn't seem to support that with history calls. I'll cache...

How to convert a list of tables into one big table in R

r,list,table,statistics,do.call
Using R... I have a list of tables. # Example data z <- list(cbind(c(1,2), c(3,4)), cbind(c(1,2), c(3,4,5,6)), cbind(c(1,2), c(1,2,3,4,5,6)), cbind(c(1,2), c(3,4)), cbind(c(1,2), c(3,4,5,6,9,4,5,6))) z <- setNames(z, c("Ethnicity", "Country", "Age Band", "Marital Status", "Hair Color")) z $Ethnicity [,1] [,2] [1,] 1 3 [2,] 2 4 $Country [,1] [,2] [1,] 1 3...

Pruning rule based classification tree (PART algorithm)

r,statistics,classification,decision-tree,rweka
I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model : fit <- PART(DELAY_CLASS ~...

Comparing datasets to nonstandard probability distributions in Python

python,statistics,scipy,probability
I have a few large sets of data which I have used to create non-standard probability distributions (using numpy.histogram to bin the data, and scipy.interpolate's interp1d function to interpolate the resulting curves). I have also created a function which can sample from these custom PDFs using the scipy.stats package. My...

How to explain a higher percentage of point variability using kmeans clustering? [closed]

r,statistics,cluster-analysis,k-means
I'm doing some kmeans clustering: Regardless of how many clusters I choose to use, the percentage of point variability does not change: Here's how I am plotting my data: # Prepare Data mydata <- read.csv("~/student-mat.csv", sep=";") # Let's only grab the numeric columns mydata <- mydata[,c("age","Medu","Fedu","traveltime","studytime","failures","fam mydata <- na.omit(mydata) #...

Sequence following percentage

algorithm,statistics
Problem statement: A person has to go to set of locations in a sequence: Seq: L1 L2 L3 L4 L5 L6 (assume L1, L2 are locations) But he followed went to locations in different sequence: ActualSeq: L3 L1 L2 L4 L5 L6 Now I need to find out what is...

Generalized additive models for calibration

r,statistics,probability,prediction,calibration
I work on calibration of probabilities. I'm using a probability mapping approach called generalized additive models. The algorithm I wrote is: probMapping = function(x, y, datax, datay) { if(length(x) < length(y))stop("train smaller than test") if(length(datax) < length(datay))stop("train smaller than test") datax$prob = x # trainset: data and raw probabilities datay$prob...

How the function auto.arima() in R determines d?

python,r,statistics,statsmodels
What is the test used in auto.arima() function in R to determine stationarity i.e to determine the value of "d" Can that logic be implemented in python?

Removing outliers in one step

r,data,statistics,analytics,outliers
I have a dataset in which there are some outliers due to input errors. I have written a function to remove these outliers from my data frame (source): remove_outliers <- function(x, na.rm = TRUE, ...) { qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...) H <- 1.5 * IQR(x,...

Memory-efficient Benjamini-Hochberg FDR correction using numpy/h5py

python,numpy,statistics,hdf5,h5py
I am trying to calculate a set of FDR-corrected p-values using Benjamini & Hochberg's method. However, the vector I am trying to run this on contains over 10 billion values. Given the amount of data the normal method from statsmodel's multicomp module quickly runs out of memory. Looking at the...

Include values to the barplot and pie charts in R

r,statistics,bar-chart,pie-chart
I have data in this form: proprete.freq <- table(cnData$proprete) proprete.freq.genre <-table(cnData$genre,cnData$proprete) I am using these functions (barplot and pie) to plot the data: barplot(proprete.freq.genre, col = heat.colors(length(rownames(proprete.freq.genre))) , main="Proprete", beside = TRUE) pie(proprete.freq, col=rainbow(3), names.arg=avis, main="Propreté") Here is the result: Question: How to include the value just on top of...

how to: Separate a continuous variable by % proportions?

statistics,stata
I have a continuous variable (in this case, fees spent). How do I determine % spending cutoffs? i.e. how do I know what dollar amount separates the bottom 50% from the top 50% (similarly for any other % I may be interested in). Thank you very much for any help

How to choose a value between the two, with given probability in Excel?

excel,statistics,probability
I have list of probabilities in column 1. How I can fill column 2 with 0 and 1 based on the corresponding probabilities? 0.5 1 0.2 0 0.9 1 0.35 1 0.1 0 ...

stdtr in python giving nan for p-value while doing t-test

python,statistics,scipy,p-value
I am using the following code to perform t-test: def t_stat(na,abar,avar,nb,bbar,bvar): logger.info("T-test to be performed") logger.info("Set A count = %f mean = %f variance = %f" % (na,abar,avar)) logger.info("Set B count = %f mean = %f variance = %f" % (nb,bbar,bvar)) adof = na - 1 bdof = nb -...

Python p value from t-statistic giving nan

python,statistics,scipy,p-value
I have some t-values and degrees of freedom and want to find the p-values from them (it's two-tailed). In the real world I would use a t-test table in the back of a Statistics textbook; however, I am using stdtr or stats.t.sf function in python. Both of them work fine...

Generate random numbers with logarithmic distribution and custom slope

javascript,algorithm,math,statistics,distribution
Im trying to generate random integers with logarithmic distribution. I use the following formula: idx = Math.floor(Math.log((Math.random() * Math.pow(2.0, max)) + 1.0) / Math.log(2.0)); This works well and produces sequence like this for 1000 iterations (each number represents how many times that index was generated): [525, 261, 119, 45, 29,...

missing values when creating training and testing data with caret

r,statistics,r-caret
My question is about how handle missing values when using train for fitting models with caret. A small sample of my data would be like that: df <- dput(dat) structure(list(LagO3 = c(NA, NA, NA, 40, 45, NA), RH = c(69.4087524414062, 79.9608383178711, 64.4592437744141, 66.4207077026367, 66.0899200439453, 91.3353729248047), SR = c(298.928888888889, 300.128888888889, 303.688888888889,...

Multiply Probability Distribution Functions

r,statistics,probability-density
I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps. Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days =...

Weighted mean in numpy/python

python,numpy,statistics,mean,weighted
I have a big continuous array of values that ranges from (-100, 100) Now for this array I want to calculate the weighted average described here since it's continuous I want also to set breaks for the values every 20 i.e the values should be discrete as -100 -80 -60...

SAS statistic parameters of specific variables from data with conditions

variables,statistics,sas,conditional-statements,proc
I have a variable I created based on a certain data. now in this new data I need to calculate different statistic parameters, but with conditions for example: *median of this new var only for obs that their birth country is not Italy. *mean of a different var only when...

OLS Breusch Pagan test in Python

python,statistics,canopy,pysal
I used the statsmodels package to estimate my OLS regression. Now I want Breusch Pagan test. I used the pysal package for this test but this function returns an error: import statsmodels.api as sm import pysal model = sm.OLS(Y,X,missing = 'drop') rs = model.fit() pysal.spreg.diagnostics.breusch_pagan(rs) returned error: AttributeError: 'OLSResults' object...

Pandas TimeGrouper: Drop “non full groups”

python,pandas,statistics
I'm grouping my data on some frequency, but it appears that TimeGrouper creates a last group on the right for some "left over" data. df.groupby([pd.TimeGrouper("2AS", label='left')]).sum()['shopping'].plot() I expect the data to be fairly constant over time, but the last data point at 2013 drops by almost half. I expect this...

ehCache Statistics with spring boot

statistics,spring-boot,ehcache
I have spring boot application with ehcache as below @Bean public EhCacheManagerFactoryBean ehCacheManagerFactoryBean() { EhCacheManagerFactoryBean ehCacheManagerFactoryBean = new EhCacheManagerFactoryBean(); ehCacheManagerFactoryBean.setConfigLocation(new ClassPathResource("ehcache.xml")); //ehCacheManagerFactoryBean.setCacheManagerName("messageCache"); ehCacheManagerFactoryBean.setShared(true); return ehCacheManagerFactoryBean; } @Bean public EhCacheCacheManager cacheManager() { return new...

Calculate variance in bash

linux,bash,awk,statistics,variance
I want to compute the variance of an input txt file like this one: 1, 5 2, 5 3, 5 4, 10 And I want the output to be like: 1, 0 2, 0 3, 0 4, 4.6875 I've used this line: awk '{c[NR]=$2; s=s+c[NR]; avg= s / NR; var=var+(($2...

Draw geom_smooth only for fits that are significant

r,ggplot2,statistics
How can I make ggplot plot geom_smooth(method="lm"), but only if it fits some criteria? For instance, if I only want to draw lines if the slope is statistically significant (i.e. p from the lm fit is less than 0.01). EDIT: Updated to a more complex example involving facets. Instead of...

R dtw package: cumulative cost matrix decreases at some points along the path?

r,statistics
I am exploring the results of Dynamic Time Warping as implemented in the dtw package. While doing some sanity checks I came across a result which I cannot rationalize. At some points along the warp path, the cumulative distance appears to decrease. Example below: mat=...

Matlab: What are the ways to determine the distribution of the data

matlab,statistics,distribution
I have a data set of n = 1000 realizations of a random variable X and is univariate -- X = {x1, x2,...,xn}. Data is generated by varying a parameter on which the random variable depends. For example, let the r.v be Area of a circle. So, by varying the...

Access variables of a regression model in R

r,statistics
I am working with a linear model, say y<-rnorm(20) x1<-rgamma(20,2,1) x2<-rpois(20,3) fit<-lm(y~x1*x2) summary(fit) and I was wondering, is there a way to access the regression variables through lm? One option would be to simply use fit$model and what you get is y x1 x2 1 1.52366782 1.1741392 4 2 -0.23640711...

The confidence interval by the intercept with linear regression in R

r,statistics
Let's say that i have two variables weight and age, i have to find the confidence interval with level 99% by this case: By the ordinate (Y-Axis), if we did a linear regression a=lm(weight~age) I know that the ordinate is directly the intercept but why this won't work: predict(a, newdata=data.frame(age=intercept),...

python consecutive counts of an occurence with length

python,statistics
this is probably really easy to do but I am looking to calculate the length of consecutive positive occurrences in a list in python. For example, I have a and I am looking to return b: a=[0,0,1,1,1,1,0,0,1,0,1,1,1,0] b=[0,0,4,4,4,4,0,0,1,0,3,3,3,0] I note a similar question on Counting consecutive positive value in Python...

Automatically truncating a curve to discard outliers in matlab

matlab,plot,statistics,outliers
I am generation some data whose plots are as shown below In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this? I am basically trying to...

How to perform operation on each row in R (apply()?)

r,statistics
So I have the following table: 118.00 12.00 25.00 161.00 26.00 2.00 9.00 47.00 76.00 218.00 1.00 21.00 11.00 64.00 0.00 9.00 53.00 124.00 2.00 51.00 86.00 25.00 25.00 0.00 20.00 14.00 212.00 104.00 38.00 46.00 I parse it in the following way: data2 <- read.table('to_r.txt', fill=T) I then want...

walking through two tables (matrices) matching columns and applying a function in R

r,statistics
So I have two matrices. Lets name them controls and patients. Each row is a sample, and each column is a concentration of a certain protein. It looks like this: V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 sample1 1533.34 9.88 6.82 17.88 70.75 350.07 20.67 13.96...

Count occurrence of unique variables

r,count,statistics
I have a data frame of variables, some occur more than once, e.g.: a, b, b, b, c, c, d, e, f I would then like to get an output (in two columns) like this: a 1; b 3; c 2; d 1; e 1; f 1. Bonus question: I'd...

Quantifying importance of variables in Canonical Correspondence Analysis using R? (x-post from researchgate)

r,statistics,correlation,vegan
I currently have species abundance data for multiple lakes along with measurements of some environmental variables of those lakes. I decided to do Canonical Correspondence Analysis of the data in R, as demonstrated by ter Braak and Verdenschot (1995), see link: http://link.springer.com/article/10.1007%2FBF00877430 (section: "Ranking environmental variables in importance") I am not...

P-value, significance level and hypothesis

statistics,p-value
I am confused about the concept of p-value. In general, if the p-value is greater than alpha which is generally 0.05, we are fail to reject null hypothesis and if the p-value is less than alpha, we reject null hypothesis. As I understand, if the p-value is greater than alpha,...

How can I reformat a table in R?

r,statistics,reformatting
I loaded a table like this: V1 V2 V3 pat1 1 2 pat1 3 1 pat1 4 2 pat2 3 3 pat3 1 4 pat3 2 3 and I need to format it into something like the following, with V1 indicating the row, V2 indicating the column, and the values...

Factor variables in r

r,statistics,glm,categorical-data
I have a dataset including three factor variables in r and the output of my glm model consistently gives estimates for each individual categorical value. I tried to correct this by using the as.numeric command as shown below and I used the factor command in the glm model but I...

How do you know if a data set is right for linear regression if it has multiple features?

machine-learning,statistics,linear-regression
If it has one feature it's easy. Just graph it. One of the records there looks like (18, 15). Simple. But if we have multiple features that adds more dimensions to the graph, right? So how can you visualize your data set and determine whether or not linear regression is...

What is the relationship between marks and covariates in point process

process,statistics,point,spatstat
I am confused about marks and covariates in point process. I am trying to create a model of a marked point pattern with few covariates in R by using spatstat, but I am not sure the relationship between marks and covariates. Could anyone help me? Thanks. ---- update I have...

Normal probability density function - GSL equivalent in Haskell

haskell,statistics,gsl
I am trying to find the best way to draw from a normal distribution. I want to be able to use the normal probability density function (and its cumulative) in Haskell. To put it simply, I want to use the functions on this page without using the GSL binding... I...

Detect two data stream deviated from each other

algorithm,math,statistics,dynamic-programming
An interesting problem came up the other day. I have two data stream that is being output continuously. Say, A and B. (Different values) In the ideal world, A and B are exactly the opposite. If A increases by x percent, then B will decrease by x percent, and vice...

Random number generator, what's wrong with my approach/statistics? [JS]

javascript,random,statistics,sha512
First of all, what I want to know is if I am doing a systematic fault or am I messing up with the math, or do you have any ideas what could be wrong? I was trying to write a little random number generator which numbers can be influenced /...

python: finding the value of a random variable for a cdf

python,statistics,scipy,normal-distribution,cdf
I apologize in advance if this is poorly worded. If I have a stdDev = 1, mean = 0, scipy.stats.cdf(-1, loc = 0, scale = 1) will give me the probability that a normally distributed random variable will be <= -1, and that is 0.15865525393145707. Given 0.15865..., how do I...

How many degrees of freedom R package: 'fGarch'

r,statistics
I would like to know how to find out the number of degrees of freedom for a t-student distribution of standardized residuals of a GARCH model (using garchFit on R from the fGarch package). Is there an other package or any other way to estimate this parameter?...

How to work with linear regression and confidence intervals in R?

r,statistics
I would like to know the used commands in R to work with linear regression problems and confidence intervals, and why these ones are incorrect. For example let's say we have the following data: A <- c(12,11,12,15,13,16,13,18,11,14) # this is the width B <- c(50,51,62,45,63,76,53,68,51,74) # this is the height...

How do I calculate the standard deviation between weighted measurements?

statistics,standard-deviation,weighted-average
I have several weighted values for which I am taking a weighted average. I want to calculate a weighted standard deviation using the weighted values and weighted average. How would I modify the typical standard deviation to include weights on each measurement? This is the standard deviation formula I am...

Cannot generalize my Genetic Algorithm to new Data

statistics,genetic-algorithm,prediction,generalization
I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the...

R - Need help simulating estimators

r,statistics,normal-distribution
I am given two estimators, T1 = ((X1+X2)/2)*(Y1+Y2)/2) and T2 = (X1*Y1+X2*Y2)/2. This is used for finding an estimate area, for example, T1 = ((503+505)/2)*((334+330)/2) and T2 = ((503*334 + 505*330)/2). X1, X2 are normally distributed with average µ1 and variance σ^2 and Y1,Y2 are normally distributed with average µ2...

Which statistical measures for 4 class classification?

machine-learning,statistics,classification,multilabel-classification
I have a classification task with 4 classes which I solve with machine learning classifiers (SVM etc.). Which statistical measures can be used for 4 classes? I will for sure use p-value (with permutation test) but I need some more. Some interesting measures are true positive rate, true negative rate,...

Weighted Average by Year in ragged data frame in R

r,statistics,weighted-average
I have a data frame with eight variables. I would like to calculate the average mean of annual weighted average percent loss. However, not all variables exist for each year in my dataset. What would be the simplest method to do so? Included below is a sample dataset and final...

Java generate random number based on boxplot

java,statistics,boxplot,random-sample
I'm searching for a method or library for java (can be java-8) that is capable of generating a random sample (preferable with a fixed seed for deterministic testing) based on the numbers that make up a boxplot. So imagine having the boxplot: ---------- |-----| | |-----------| ---------- min A avg...

Removing a prior sample while using Welford's method for computing single pass variance

algorithm,math,statistics,variance,standard-deviation
I'm successfully using Welford's method to compute running variance and standard deviation as described many times on Stack Overflow and John D Cook's excellent blog post. However in the stream of samples, sometimes I encounter a "rollback", or "remove sample" order, meaning that a previous sample is no longer valid...

Determining percentile based on reference table

r,statistics
I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and...

Random Forest Classifier Matlab v/s Python

python,matlab,machine-learning,statistics,random-forest
I used a Random Forest Classifier in Python and MATLAB. With 10 trees in the ensemble, I got ~80% accuracy in Python and barely 30% in MATLAB. This difference persisted even when MATLAB's random forests were grown with 100 or 200 tress. What could be the possible reason for this...

Python, beautifulsoup scraping specific or exact numbers from a stat table

python,table,statistics,beautifulsoup,screen-scraping
On a player stat page. How can I make my anchor point the year "2014" and grab specific numbers in the 2014 column (scrape numbers to the right of 2014) The code below is skipping the "Passing" table (with all of the career passing stats) and trying to grab stats...

R optimization with equality and inequality constraints

r,statistics,mathematical-optimization,minimization
I am trying to find the local minimum of a function, and the parameters have a fixed sum. For example, Fx = 10 - 5x1 + 2x2 - x3 and the conditions are as follows, x1 + x2 + x3 = 15 (x1,x2,x3) >= 0 Where the sum of x1,...

Starting an application when starting Apache Server

java,apache,statistics,server
Is there a way to start a Java application when launching Apache Server? I need to generate a report of the server's statistics, periodically, with a separate application.

Error with applying data type in method to variables in main tester class

java,arrays,methods,statistics
How can I get this to calculate properties about arrays of doubles. If everything else is an int inside, would that still as an array of doubles? Or is it still an array of doubles anyway because of the method type? Here is my class. Thanks so much! import java.util.*;...

C# Linq Dictionary IO

c#,linq,dictionary,io,statistics
How can I do this with LinQ? I have a txt file.(Its about 100 lines long.) 6 7 0/3 # ##t#kon tu#i#do#b#n ko#yk####a#s##ttak###lk##ul$$$$#$#$$$####$$$$#$$$$$$#$$#$$$$$#$ I stored it in a Dictionary (The two lines). alap = File.ReadAllLines("veetel.txt"); Dictionary<string,string> statisztikaDictionary = new Dictionary<string,string>(); for (int z = 1; z < alap.Length; z+=2) {...

Disaggregate one row of data to multiple rows

r,excel,statistics,dataset,google-adwords
Goodafternoon! I am having some trouble with my dataset. I am using a Google AdWords export for data analysis and I want to fit a logit regression model to the data to determine whether an experiment I have conducted impacts the conversion. The problem is that the data is aggregated...

Which scaling technique does it use?

matlab,statistics,matlab-guide,matlab-deployment
I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling? X = X./repmat(sqrt(sum(X.^2)),size(X,1),1); ...

Calculating Kendall's tau using scipy and groupby

python,pandas,statistics,scipy
I have a csv file with precipitation data per year and per weather station. It looks like this: station_id year Sum 210018 1916 65.024 210018 1917 35.941 210018 1918 28.448 210018 1919 68.58 210018 1920 31.115 215400 1916 44.958 215400 1917 31.496 215400 1918 38.989 215400 1919 74.93 215400 1920...

Integrating Power pdf to get energy pdf?

statistics,probability
I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method. I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t)....

How does the proc/diskstats work to present that values? And for proc/stat and meminfo?

linux,unix,statistics,monitoring,proc
I am trying to get the diskstats data by same way that the file does. Is there any way to reach that values without reading that file? How the values are placed there? Is there any ".c" file that processes the data to place on diskstats? And for proc/stat and...

Why equation (5) is equal to equation (6)? [closed]

math,statistics
I found this material online, I do not understand why equation (5) is equal to equation (6)? How to deduct? Given a dictionary D, a vector x has sparsity s if it can be written exactly as a linear combination of s columns of D. An important result that underlies...

R: How should I specify the “init” and “fixed” parameters in the arima function in R?

r,statistics
I have a problem in understanding how the init and fixed parameters are specified in the arima function in R. For example, I will use R's built-in dataset lh to illustrate the idea: The line below works fine arima(lh, order = c(1,0,0)) But this line does not work as expected...

Detecting significant changes in data

c#,algorithm,data,graph,statistics
I have a graph input where the X axis is time (going forwards). The Y axis is generally stable but has large drops and raises at different points (marked as the red arrows below) Visually it's obvious but how do I efficiently detect this from within code? I'm not sure...

Lasso Generalized linear model in Python

python,statistics,scikit-learn,statsmodels,cvxopt
I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python. Matlab provides the nice function : lassoglm(X,y, distr) where distr can be poisson, binomial etc. I had a look at both statmodels and scikit-learn but I did not find any...

Rolling average pairwise correlation - code doesn't work as expected

r,statistics,time-series,correlation,xts
I have an xts time series object: head(mtrx) ADS.DE.Close ALV.DE.Close BAS.DE.Close BAYN.DE.Close BEI.DE.Close BMW.DE.Close CBK.DE.Close CON.DE.Close DAI.DE.Close 2007-12-28 01:00:00 51.26 147.95 101.41 62.53 53.00 42.35 21.04 86.06 66.50 2008-01-02 01:00:00 50.00 145.92 100.94 61.45 52.39 42.73 20.75 83.76 64.68 2008-01-03 01:00:00 50.09 144.93 101.60 61.71 51.18 42.09 20.48 81.74 62.91...

Is R able to compute contingency tables on big file without putting the whole file in RAM?

r,file-io,statistics,contingency
Let me explain the question: I know the functions table or xtabs compute contingency tables, but they expect a data.frame, which is always stored in RAM. It's really painful when trying to do this on a big file (say 20 GB, the maximum I have to tackle). On the other...

Convert list of numbers to percent between positive and negative number

math,statistics
I have a list of numbers like -17, -50, 100, 120, 5, 20 Now how to convert this series to percent. I have a problem in negative number in converting to percent. For example i want to convert these number between 0 to 1 or 0% to 100%

Histogram-like summary for interval data

r,statistics,histogram
How do I get a histogram-like summary of interval data in R? My MWE data has four intervals. interval range Int1 2-7 Int2 10-14 Int3 12-18 Int4 25-28 I want a histogram-like function which counts how the intervals Int1-Int4 span a range split across fixed-size bins. The function output should...

Subtract mean from a variable by group in bigmemory in R

r,functional-programming,statistics
I would like to demean the variables from the big.matrix (panel) structure. I tried different methods but the one which works in bigmemory setting is tapply (provided by bigtabulate package). I have the following code to calculate means of variable var1 by groups represented by panel_id data <- read.big.matrix ("data.csv",...

Get a variables value from one dataset if falling in a range defined by two variables in another dataset in R

r,date,statistics,dataset
I have a question regarding dates manipulation in R. I've looked around for days but couldn't find any help online. I have a dataset where I have id and two dates and another dataset with the same id variable, date and price. For example: x = data.frame(id = c("A","B","C","C"), date1...

Regression loop in R for data frames

r,loops,statistics,data.frame,regression
rm(list=ls()) myData <-read.csv(file="C:/Users/Documents/myfile.csv",header=TRUE, sep=",") for(i in names(myData)) { colNum <- grep(i,colnames(myData)) ##asigns a value to each column if(is.numeric(myData[3,colNum])) ##if row 3 is numeric, the entire column is { ##print(nxeData[,i]) fit <- lm(myData[,i] ~ etch_source_Avg, data=myData) #does a regression for each column in my csv file against my independent variable 'etch'...

Splitting strings from integers in R

r,statistics
I've recently come across an interesting problem while trying to create a custom database. my rows are in form: 183746IGH 105928759UBS and so on (so basically an integer concatenated with a string, both of relatively random sizes.). What I'm trying to do is somehow separate the whole number in column...

Find previous hour and next hour in R

r,datetime,statistics
Suppose I pass "2015-01-01 01:50:50", then it should return "2015-01-01 01:00:00" and "2015-01-01 02:00:00". How to calculate these values in R?

How scipy.stats handles nans?

python,numpy,statistics,scipy,missing-data
I am trying to do some statistics in Python. I have data with several missing values, filled with np.nan, and I am not sure should I remove it manually, or scipy can handle it. So I tried both: import scipy.stats, numpy as np a = [0.75, np.nan, 0.58337, 0.75, 0.75,...

How to specify gamma distribution using shape and rate in Python?

python,statistics,scipy
With Scipy gamma distribution, one can only specify shape, loc, and scale. How do I create a gamma variable with shape and rate?

How to calculate KNN Variable Importance in R

r,machine-learning,statistics,classification
I implemented an Authorship attribution project where I was able to train my KNN model with articles from two authors using KNN. Then, I classify the author of a new article to be either author A or author B. I use knn() function to generate the model. The output of...

Matlab: trying to estimate multifractal spectrum from time series by histogram box-counting

matlab,statistics,time-series,histogram,fractals
I am using the approach from this Yale page on fractals: http://classes.yale.edu/fractals/MultiFractals/Moments/TSMoments/TSMoments.html which is also expounded on this set of lecture slides (slide 32): http://multiscale.emsl.pnl.gov/docs/multifractal.pdf The idea is that you get a dataset, and examine it through many histograms with increasing numbers of bars i.e. resolution. Once resolution is high...

How to compute Minitab-equivalent quartiles using NumPy

python,numpy,statistics,minitab
I have a homework assignment that I was doing with Minitab to find quartiles and the interquartile range of a data set. When I tried to replicate the results using NumPy, the results were different. After doing some googling, I see that there are many different algorithms for computing quartiles:...

Can't use scipy stats function on nested list

python,numpy,statistics,scipy,nested-lists
I've been trying to scipy.mstats.zscore a dataset that is intentionally organized into a nested list, and it gives: TypeError: unsupported operand type(s) for /: 'list' and 'long' which probably suggests that scipy.stats doesn't work for nested lists. What can I do about it? Does a for loop affect the nature...

Arima.sim issues in R

r,math,statistics,time-series,forecasting
I am working on making a prediction in R using time-series models. I used the auto.arima function to find a model for my dataset (which is a ts object). fit<-auto.arima(data) I can then plot the results of the prediction for the 20 following dates using the forecast function: plot(forecast(fit,h=20)) However...

R by() round stat.desc

r,statistics,rounding
I am concerned about the following question. If I apply the function by(productivity$LOC, productivity$extension, stat.desc, norm = TRUE, basic = TRUE) how can I round the output values of the by() function? ...

“Icon” (ISOTYPE) charts in R shiny with Javascript

javascript,r,plot,statistics,shiny
I'm working on a project to build several models for data analysis and reporting using R and the amazing Shiny framework for web development. I'm getting started with R and Shiny but I've had an amazing experience so far, yet, I'd like to get some help in case someone has...

Statistics of region of numpy array

python,numpy,statistics
I have an array that measures about 2000 elements long, and I would like to figure out the standard deviation of it centered at each pixel by sliding a make-believe window of some relatively small width over it, and computing the StDev of the elements in each region, yielding an...

Statsmodels - Wald Test for significance of trend in coefficients in Linear Regression Model (OLS)

python,statistics,linear-regression,statsmodels
I have used Statsmodels to generate a OLS linear regression model to predict a dependent variable based on about 10 independent variables. The independent variables are all categorical. I am interested in looking closer at the significance of the coefficients for one of the independent variables. There are 4 categories,...

Java: Probability distribution support

java,math,statistics
I need to implement random sampling from a number of common probability distributions (normal, binomial, gamma, ...) in my java program. I found Random.nextGaussian() and was just wondering if there's any other built in support for distributions other than normal? Or are my only options third party library or DIY?

scipy.mstats.theilslopes error in confidence limit if data have missing values

python,statistics,scipy
If one uses the scipy.mstats.theilslopes routine on a data set with missing values, the results of the lower and upper bounds for the slope estimate are incorrect. The upper bound is often/always(?) NaN, while the lower bound is simply wrong. This happens, because the theilslopes routine computes an index into...

Generate random numbers with pre-defined median in Matlab ?

matlab,random,statistics,median
I am looking for any function or method to create 2D array of random numbers whose median value is predefined like : array=generateNumbers(medianValue) will return 2D array with median value = medianValue Is it possible ? ...

Fill in missing rows with R data.table

r,statistics,data.table
I have a data.table in R that was fetched from a database that looks like this: date,identifier,description,location,value1,value2 2014-03-01,1,foo,1,100,200 2014-03-01,1,foo,2,200,300 2014-04-01,1,foo,1,100,200 2014-04-01,1,foo,2,100,200 2014-05-01,1,foo,1,100,200 2014-05-01,1,foo,2,100,200 2014-03-01,2,bar,1,100,200 2014-04-01,2,bar,1,100,200 2014-05-01,2,bar,1,100,200 2014-03-01,3,baz,1,100,200 2014-03-01,3,baz,2,200,300 2014-04-01,3,baz,1,100,200 2014-04-01,3,baz,2,100,200 2014-05-01,3,baz,1,100,200...

Caret package for R. Which samples are held out?

r,statistics,r-caret
I'm using the caret package to play around with many classification methods. Currently I want to use a leave group out cross validation method (I know there are better methods). This is the train control I am using: train_control <- trainControl(method = "LGOCV", p = .7, number = 1) My...

Mathematica: difficulty using Multinormal Distribution and InverseCDF functions

statistics,wolfram-mathematica,normal-distribution,cdf
I'm struggling to use the functions MultinormalDistribution and InverseCDF in MultivariateStatistics package. Essentially << MultivariateStatistics` sig = .5; u = .5; dist = MultinormalDistribution[{0, 0}, sig*IdentityMatrix[2]]; delta=InverseCDF[dist, 1 - u] The output is InverseCDF[ MultinormalDistribution[{0, 0}, {{0.5, 0}, {0, 0.5}}], {0.5}] can someone correct the above code? If I've understood...

How to compare two products based on their ratings?

statistics
I am interested in knowing how to calculate a ranking score from ratings of a product. E.g., take the apple appstore. There are two products A and B. Both have same average rating but 100 reviewers have rated A whereas 1000 reviewers have rated B. Intuitively it seems B should...