python,csv,pandas,resampling,merging-data , Resampling and merging data frame with python
Resampling and merging data frame with python
Question:
Tag: python,csv,pandas,resampling,merging-data
Hi I have created a dictionary of dataFrame with this code
import os
import pandas
import glob
path="G:\my_dir\*"
dataList={}
for files in glob.glob(path):
dataList[files]=(read_csv(files,sep=";",index_col='Date'))
The different dataframe present in the dictory have different time sample. An example of dataFrame(A) is
Date Volume Value
2014-01-04 06:00:02 6062 108000.0
2014-01-04 06:06:05 6062 107200.0
2014-01-04 06:12:07 6062 97400.0
2014-01-04 06:18:10 6062 99200.0
2014-01-04 06:24:12 6062 91300.0
2014-01-04 06:30:14 6062 84100.0
2014-01-04 06:36:17 6062 57000.0
Example of dataFrame(B) is
Date Volume Value
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
The different data frame don't have the same number of rows. I want to merge the different data frame in a single one like this:
Data Volume B A Value(DataframeN)
2014/04/01 05:52:50 6062 4.70 NaN
2014/04/01 05:58:53 6062 4.70 NaN
2014/04/01 06:04:56 6062 4.90 107465.51
2014/04/01 06:10:58 6062 5.10 100652.60
2014/04/01 06:17:01 6062 5.20 98899.57
2014/04/01 06:23:03 6062 5.20 92618.56
2014/04/01 06:29:05 6062 5.50 85301.73
2014/04/01 06:35:08 6062 5.50 61523.06
I have done this easily with Matlab using with the command
ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));
I have to do this for several sets of csv files so I wanted to automate the process with python.
Tnx
Answer:
You can concat
the two DataFrames
, interpolate
, then reindex
on the DataFrame
you want.
I assume we have a certain number of DataFrames
, where the Date
is a DateTimeIndex
in all of them. I will use two in this example, since you used two in the question, but the code will work for any number.
df_a
:
Volume Value
Date
2014-01-04 06:00:02 6062 108000
2014-01-04 06:06:05 6062 107200
2014-01-04 06:12:07 6062 97400
2014-01-04 06:18:10 6062 99200
2014-01-04 06:24:12 6062 91300
2014-01-04 06:30:14 6062 84100
2014-01-04 06:36:17 6062 57000
df_b
:
Volume Value
Date
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
And I will put these into a dict
for the example. You read them directly into a dict
, so you don't need to do this step. I just want to show how my example dict
is formatted. The dict
keys
don't matter, any valid dict
key
will work:
dataList = {'a': df_a,
'b': df_b}
This gets us to where you currently are, with my dataList
hopefully having the same format as yours.
The first thing you need to do is to combine the DataFrames
. I use the dict
keys
as MultiIndex
column names so you can keep track of which instance of a given column came from which DataFrame
. You can do that like so:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys())
This gives you a DataFrame
like this:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:00:02 6062 108000 NaN NaN
2014-01-04 06:04:56 NaN NaN 6062 4.9
2014-01-04 06:06:05 6062 107200 NaN NaN
2014-01-04 06:10:58 NaN NaN 6062 5.1
2014-01-04 06:12:07 6062 97400 NaN NaN
2014-01-04 06:17:01 NaN NaN 6062 5.2
2014-01-04 06:18:10 6062 99200 NaN NaN
2014-01-04 06:23:03 NaN NaN 6062 5.2
2014-01-04 06:24:12 6062 91300 NaN NaN
2014-01-04 06:29:05 NaN NaN 6062 5.5
2014-01-04 06:30:14 6062 84100 NaN NaN
2014-01-04 06:35:08 NaN NaN 6062 5.5
2014-01-04 06:36:17 6062 57000 NaN NaN
Next, you need to interpolate to fill in the missing values. I interpolate using 'time'
mode
so it properly handles the time indexes:
df = df.interpolate('time')
This gives you a DataFrame
like this:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.700000
2014-01-04 05:58:53 NaN NaN 6062 4.700000
2014-01-04 06:00:02 6062 108000.000000 6062 4.738017
2014-01-04 06:04:56 6062 107352.066116 6062 4.900000
2014-01-04 06:06:05 6062 107200.000000 6062 4.938122
2014-01-04 06:10:58 6062 99267.955801 6062 5.100000
2014-01-04 06:12:07 6062 97400.000000 6062 5.119008
2014-01-04 06:17:01 6062 98857.851240 6062 5.200000
2014-01-04 06:18:10 6062 99200.000000 6062 5.200000
2014-01-04 06:23:03 6062 92805.801105 6062 5.200000
2014-01-04 06:24:12 6062 91300.000000 6062 5.257182
2014-01-04 06:29:05 6062 85472.375691 6062 5.500000
2014-01-04 06:30:14 6062 84100.000000 6062 5.500000
2014-01-04 06:35:08 6062 62151.239669 6062 5.500000
2014-01-04 06:36:17 6062 57000.000000 6062 5.500000
I think generally it would be best to stop here, since you keep all data from all csv
files. But you said you want only the time points from the longest csv
. To get that, you need to find the longest DataFrame
, and then get the rows corresponding to its indexes. Finding the longest DataFrame
is easy, you just find the one with the maximum length. Keeping only the time points in that index
is also easy, you just slice using that index
(you use the loc
method for slicing in this way).
longind = max(dataList.values(), key=len).index
df = df.loc[longind]
This gives you the following final DataFrame
:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:04:56 6062 107352.066116 6062 4.9
2014-01-04 06:10:58 6062 99267.955801 6062 5.1
2014-01-04 06:17:01 6062 98857.851240 6062 5.2
2014-01-04 06:23:03 6062 92805.801105 6062 5.2
2014-01-04 06:29:05 6062 85472.375691 6062 5.5
2014-01-04 06:35:08 6062 62151.239669 6062 5.5
This can be combined into one line if you want:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys()).interpolate('time').loc[max(dataList.values(), key=len).index]
Or, perhaps a slightly clearer 4 lines:
names = dataList.keys()
dfs = dataList.values()
longind = max(dfs, key=len).index
df = pd.concat(dfs, axis=1, keys=names).interpolate('time').loc[longind]
I am not sure why my final results are different than what you show. I ran your example in MATLAB
(R2015A) myself and got the same results as I get here, so I suspect you generated the final data with a different data set than the example.
Related:
python,regex,algorithm,python-2.7,datetime
If I knew the format in which a string represents date-time information, then I can easily use datetime.datetime.strptime(s, fmt). However, without knowing the format of the string beforehand, would it be possible to determine whether a given string contains something that could be parsed as a datetime object with the...
python,histogram,large-files
I have two arrays of data: one is a radius values and the other is a corresponding intensity reading at that intensity: e.g. a small section of the data. First column is radius and the second is the intensities. 29.77036614 0.04464427 29.70281027 0.07771409 29.63523525 0.09424901 29.3639355 1.322793 29.29596385 2.321502 29.22783249...
python,python-3.x,pyqt,pyqt4
I have the following 5 files: gui.py # -*- coding: utf-8 -*- from PyQt4 import QtCore, QtGui try: _fromUtf8 = QtCore.QString.fromUtf8 except AttributeError: def _fromUtf8(s): return s try: _encoding = QtGui.QApplication.UnicodeUTF8 def _translate(context, text, disambig): return QtGui.QApplication.translate(context, text, disambig, _encoding) except AttributeError: def _translate(context, text, disambig): return QtGui.QApplication.translate(context, text, disambig)...
python,syntax
Good afternoon, I am developing a script in python and while I am trying to compile it from the terminator/terminal i always get this error, but I cannot understand where is the syntax error? File "_case1.py", line 128 print ('########################') ^ SyntaxError: invalid syntax Then I just change the position...
python,bash,environment-variables
I can't access my env var: import subprocess, os print os.environ.get('PATH') # Works well print os.environ.get('BONSAI') # doesn't work But the env var is well added in my /home/me/.bashrc: BONSAI=/home/me/Utils/bonsai_v3.2 export BONSAI And I can access this env var from a new terminal....
python,similarity,locality-sensitive-hash
the concise python code i study for is here Question A @ line 8 i do not really understand the syntax meaning for "res = res << 1" for the purpose of "get_signature" Question B @ line 49 (SOLVED BY myself through another Q&A) "xor = r1^r2" does not really...
python,html,django,templates,django-1.4
I have the django template like below: <a href="https://example.com/url{{ mylist.0.id }}" target="_blank"><h1 class="title">{{ mylist.0.title }}</h1></a> <p> {{ mylist.0.text|truncatewords:50 }}<br> ... (the actual template is quite big) It should be used 10 times on the same page, but 'external' html elements are different: <div class="row"> <div class="col-md-12 col-lg-12 block block-color-1"> *django...
python,xml,view,odoo,add-on
I want to create a new view with a DB-view. When I try to install my app, DB-view was created then I get error: 2015-06-22 12:59:10,574 11988 ERROR odoo openerp.addons.base.ir.ir_ui_view: Das Feld `datum` existiert nicht Fehler Kontext: Ansicht `overview.tree.view` [view_id: 1532, xml_id: k. A., model: net.time.overview, parent_id: k. A.] 2015-06-22...
structure with python from this case?
python,python-2.7
How to remove "table" from HTML using python? I had case like this: paragraph = ''' <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br /> <table> <tr> <td> text title </td> <td> text title 2 </td> </tr> </table> <p> lorem ipsum</p> ''' how...
arrays,ruby,csv
I have a multidimensional array like this one : myArray = [["Alaska","Rain","3"],["Alaska","Snow","4"],["Alabama","Snow","2"],["Alabama","Hail","1"]] I would like to end up with CSV output like this. State,Snow,Rain,Hail Alaska,4,3,nil Alabama,2,nil,1 I know that to get this outputted to CSV the way I want it I have to have output array like this: outputArray =[["State","Snow","Rain","Hail"],["Alaska",4,3,nil],["Alabama",2,nil,1]]...
python,image-processing,imagej
I wrote a multilanguage 3-D image denoising ImageJ plugin that does some operations on an image and returns the denoised image as a 1-D array. The 1-D array contains NaN values (around the edges). The 1-D array is converted back into an image stack and displayed. It is simply black....
python,list
I am looking for an elegant solution for the following problem. I have a list of ints and I want to create a list of lists where the indices with the same value are grouped together in the order of the occurrences of said list. [2, 0, 1, 1, 3,...
python,python-2.7
I am making a TBRPG game using Python 2.7, and i'm currently making a quest system. I wanted to make a function that checks all of the quests in a list, in this case (quests), and tells you if any of of the quests in the list have the same...
python,spring-integration,jython
I'm trying to use Python with spring-integration and jython-standalone-2.7.0: Here is my application context: <int:inbound-channel-adapter id="in" channel="exampleChannel" > <int:poller fixed-rate="1000" /> <int-script:script lang="python" location="script/message.py" /> </int:inbound-channel-adapter> <int:channel id="exampleChannel" /> <int-ip:udp-outbound-channel-adapter id="udpOut" channel="exampleChannel" host="192.168.0.1" port="11111" /> Here is my script in Python: print "Python"...
python,amazon-web-services,boto
How can I assign a new IP address (or Elastic IP) to an already existing AWS EC2 instance using boto library.
python,python-2.7,parsing,csv
I have an email that comes in everyday and the format of the email is always the same except some of the data is different. I wrote a VBA Macro that exports the email to a text file. Now that it is a text file I want to parse the...
python,list,numpy,multidimensional-array
I have a list which contains 1000 integers. The 1000 integers represent 20X50 elements of dimensional array which I read from a file into the list. I need to walk through the list with an indicator in order to find close elements to each other. I want that my indicator...
javascript,python,ios,flask,twilio
I have created a simple twilio client application to make phone calls from Web Browser to phones. I used a sample Flask app to generate a secure Capability Token and used twilio.min.js library to handle calls from my HTML. The functionality works fine in Computer Browsers ans Android Phone Browsers,...
python,sql,matplotlib,plot
from sqlalchemy import create_engine import _mssql from matplotlib import pyplot as plt engine = create_engine('mssql+pymssql://**:****@127.0.0.1:1433/AffectV_Test') connection = engine.connect() result = connection.execute('SELECT Campaign_id, SUM(Count) AS Total_Count FROM Impressions GROUP BY Campaign_id') for row in result: print row connection.close() The above code generates an array: (54ca686d0189607081dbda85', 4174469) (551c21150189601fb08b6b64', 182) (552391ee0189601fb08b6b73', 237304) (5469f3ec0189606b1b25bcc0',...
python,automated-tests,robotframework
I have two variables: ${calculatedTotalPrice} = 42,42 ${productPrice1} = 43,15 I executed ${calculatedTotalPrice} Evaluate ${calculatedTotalPrice}+${productPrice1} I got 42,85,15 How can I resolve it?...
python,recursion
I'm trying to solve a puzzle, which is to reverse engineer this code, to get a list of possible passwords, and from those there should be one that 'stands out', and should work function checkPass(password) { var total = 0; var charlist = "abcdefghijklmnopqrstuvwxyz"; for (var i = 0; i...
python,module,python-module
I am coming from a Java background with Static variables, and I am trying to create a list of commonly used strings in my python application. I understand there are no static variables in python so I have written a module as follows: import os APP_NAME = 'Window Logger' APP_DATA_FOLDER_PATH...
python,numpy,kernel-density
Can one explain why after estimation of kernel density d = gaussian_kde(g[:,1]) And calculation of integral sum of it: x = np.linspace(0, g[:,1].max(), 1500) integral = np.trapz(d(x), x) I got resulting integral sum completely different to 1: print integral Out: 0.55618 ...
python,python-2.7,behavior
I am writing a simple function to step through a range with floating step size. To keep the output neat, I wrote a function, correct, that corrects the floating point error that is common after an arithmetic operation. That is to say: correct(0.3999999999) outputs 0.4, correct(0.1000000001) outputs 0.1, etc. Here's...
python,pandas
I have some tables where the first 11 columns are populated with data, but all columns after this are blank. I tried: df=df.dropna(axis=1,how='all') which didn't work. I then used: df = df.drop(df.columns[range(11,36)], axis=1) Which worked on the first few tables, but then some of the tables were longer or shorter...
python,sqlalchemy
I have a simple many-to-many relationship with associated table: with following data: matches: users: users_mathces: ONE user can play MANY matches and ONE match can involve up to TWO users I want to realize proper relationships in both "Match" and "User" classes users_matches_table = Table('users_matches', Base.metadata, Column('match_id', Integer, ForeignKey('matches.id', onupdate="CASCADE",...
python,c++,ctypes
I have a simple test function on C++: #include <stdio.h> #include <string.h> #include <stdlib.h> #include <locale.h> #include <wchar.h> char fun() { printf( "%i", 12 ); return 'y'; } compiling: gcc -o test.so -shared -fPIC test.cpp and using it in python with ctypes: from ctypes import cdll from ctypes import c_char_p...
python,python-2.7
Count function counting only last line of my list N = int(raw_input()) cnt = [] for i in range(N): string = raw_input() for j in range(1,len(string)): if string[j] =='K': cnt.append('R') elif string[j] =='R': cnt.append('R') if string[0] == 'k': cnt.append('k') elif string[0] == 'R': cnt.append('R') print cnt.count('R') if I am giving...
python,python-2.7,pandas,dataframes
I have the following dataframe,df: Year totalPubs ActualCitations 0 1994 71 191.002034 1 1995 77 2763.911781 2 1996 69 2022.374474 3 1997 78 3393.094951 I want to write code that would do the following: Citations of currentyear / Sum of totalPubs of the two previous years I want something to...
python,scikit-learn
I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates. The Situation I want to use logistic regression to do binary classification on a very unbalanced data set. The classes are labelled 0 (negative) and 1 (positive) and the observed data is in...
python,html,xpath,web-scraping,html-parsing
I want to get values from page: http://www.tabele-kalorii.pl/kalorie,Actimel-cytryna-miod-Danone.html I can get all values from first section, but I can't get values from table "Wartości odżywcze" I use this xpath: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]/span/text()")) But I'm not getting anything. With xpath like this: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]//text()")) I'm...
python,node.js,webserver
i'm working in a HTML5 multiplayer game, and i need a server to sync player's movement, chat, battles, etc. So I'm looking for ways to use python instead nodejs, because i have I have more familiarity with python. The server is simple: var express = require('express'); var app = express();...
python,html,css,django,url
First of all, this website that I'm trying to build is my first, so take it easy. Thanks. Anyway, I have my home page, home.html, that extends from base.html, and joke.html, that also extends base.html. The home page works just fine, but not the joke page. Here are some parts...
python,scikit-learn,tf-idf
I have code that runs basic TF-IDF vectorizer on a collection of documents, returning a sparse matrix of D X F where D is the number of documents and F is the number of terms. No problem. But how do I find the TF-IDF score of a specific term in...
python,python-2.7,error-handling,popen
Continuing from my previous question I see that to get the error code of a process I spawned via Popen in python I have to call either wait() or communicate() (which can be used to access the Popen stdout and stderr attributes): app7z = '/path/to/7z.exe' command = [app7z, 'a', dstFile.temp,...
python,windows,python-3.x
I'm attempting to learn python using the book 'a byte of python'. The code: import sys print('the command line arguments are:') for i in sys.argv: print(i) print('\n\nThe PYTHONPATH is', sys.path, '\n') outputs: the command line arguments are: C:/Users/user/PycharmProjects/helloWorld/module_using_sys.py The PYTHONPATH is ['C:\\Users\\user\\PycharmProjects\\helloWorld', 'C:\\Users\\user\\PycharmProjects\\helloWorld', 'C:\\Python34\\python34.zip', 'C:\\Python34\\DLLs', 'C:\\Python34\\lib', 'C:\\Python34', 'C:\\Python34\\lib\\site-packages']...
python,mongodb,pymongo
I want to insert a variable, say, a = {1:2,3:4} into my database with a particular id "56". It is very clear from the docs that I can do the following: db.testcol.insert({"_id": "56", 1:2, 3:4}) However, I cannot figure out any way to insert "a" itself, specifying an id. In...
python,numpy
A = np.array([[1,2,3],[3,4,5],[5,6,7]]) X = np.array([[0, 1, 0]]) for i in xrange(np.shape(X)[0]): for j in xrange(np.shape(X)[1]): if X[i,j] == 0.0: A = np.delete(A, (j), axis=0) I am trying to delete j from A if in X there is 0 at index j. I get IndexError: index 2 is out of...
python,python-3.4,cx-freeze
I have found two other articles about this problem on Stack Exchange but none of them has a clear answer: is it possible to create a .exe of a Python 3.4 script? The only solution I found was to use cx_Freeze. I used it, and it indeed created an executable...
python,user-interface,tkinter
I want to put an image in front of another one, then use this combined image as a button's background image in Tkinter. How can I do it? I am free to import Tkimage, Image. Clarify: I want to stick this on the center of this so that something like...
python,peewee
This is what I have: SomeTable.select.where(reduce(operator.or_, (SomeTable.stuff == entry for entry in big_list))) The problem arises when I have a relatively large list of elements in big_list and I get this: RuntimeError: maximum recursion depth exceeded Is there another way to approach this that doesn't involve splitting up the list...
python,scikit-learn,pipeline,feature-selection
Apologies if this is obvious but I couldn't find a clear answer to this: Say I've used a pretty typical pipeline: feat_sel = RandomizedLogisticRegression() clf = RandomForestClassifier() pl = Pipeline([ ('preprocessing', preprocessing.StandardScaler()), ('feature_selection', feat_sel), ('classification', clf)]) pl.fit(X,y) Now when I apply pl on a new set, pl.predict(X_classify); is RandomizedLogisticRegression going...
python,function,loops
I want to call the function multiple time and use it's returned argument everytime when it's called. For example: def myfunction(first, second, third): return (first+1,second+1,third+1) 1st call: myfunction(1,2,3) 2nd call is going to be pass returned variables: myfunction(2,3,4) and loop it until defined times. How can I do such loop?...
python,list
I'm using requests and the output I get from the sites API is a list, I've been stuck trying to parse it to get the data from it. I use r = requests.get(urlas, params=params) r.json() to get the data I want. Here is a snippet of the list [{'relation_type': None,...
python,collections
After reading the answers on this question How to count the frequency of the elements in a list? I was wondering how to count the frequency of something, and at the same time retreive some extra information, through something like an index. For example a = ['fruit','Item#001'] b = ['fruit','Item#002']...
python,replace,out-of-memory,large-files
I have a ~600MB Roblox type .mesh file, which reads like a text file in any text editor. I have the following code below: mesh = open("file.mesh", "r").read() mesh = mesh.replace("[", "{").replace("]", "}").replace("}{", "},{") mesh = "{"+mesh+"}" f = open("p2t.txt", "w") f.write(mesh) It returns: Traceback (most recent call last): File...
csv,hadoop,hive
I have a set of CSV files in a HDFS path and I created an external Hive table, let's say table_A, from these files. Since some of the entries are redundant, I tried creating another Hive table based on table_A, say table_B, which has distinct records. I was able to...
python,list,sorting,null
I have a list with dictionaries in which I sort them on different values. I'm doing it with these lines of code: def orderBy(self, col, dir, objlist): if dir == 'asc': sorted_objects = sorted(objlist, key=lambda k: k[col]) else: sorted_objects = sorted(objlist, key=lambda k: k[col], reverse=True) return sorted_objects Now the problem...
python,tkinter
I need to activate many entries when button is clicked please do not write class based code, modify this code only because i need to change the whole code for the project as i did my whole project without classes from Tkinter import * import ttk x='disabled' def rakhi(): global...