FAQ Database Discussion Community


Pandas append list to list of column names

python-2.7,pandas
I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add. The DataFrame already contains the columns from col_start. Something like: import pandas as pd df = pd.read_csv(file.csv) col_start = ["col_a", "col_b",...

multiple conditioned slicing (pandas dataframe)

python,pandas,object-slicing
I have a dataframe that has various columns and rows of data. I want to select all rows where the Year column = 2015 and Month column = 7. The following works: new_result.loc[new_result['Year'] == 2015,:].loc[new_result['Month'] == 7,:] However, is there a more elegant way to express the same thing? i.e....

Transforming pandas data frame using stack function

python,pandas
I have the following pandas dataframe with me import pandas as pd import numpy as np pd.np.random.seed(1) N = 5 data = pd.DataFrame(pd.np.random.rand(N, 3), columns=['Monday', 'Wednesday', 'Friday']) data['State'] = 'ST' + pd.Series((pd.np.arange(N) % 19).astype(str)) print data Monday Wednesday Friday State 0 0.417022 0.720324 0.000114 ST0 1 0.302333 0.146756 0.092339 ST1...

Read a csv with numpy array using pandas

python,csv,numpy,pandas
I have a csv file with 3 columns emotion, pixels, Usage consisting of 35000 rows e.g. 0,70 23 45 178 455,Training. I used pandas.read_csv to read the csv file as pd.read_csv(filename, dtype={'emotion':np.int32, 'pixels':np.int32, 'Usage':str}). When I try the above, it says ValueError: invalid literal for long() with base 10: '70...

Pandas renumber unique occurrences

python,pandas
Given the following example: import pandas as pd data = pd.DataFrame({'ID' : [1, 1, 2, 4, 4, 4, 4, 4, 11, 11, 16, 17, 17, 19]}) >>> data ID 0 1 1 1 2 2 3 4 4 4 5 4 6 4 7 4 8 11 9 11 10...

How to split string from column to create long format dataframe

python,pandas,dataframes
If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row). I guess I will have to apply or map a split(",") to the Term column, but what do I do after that? import pandas as pd from StringIO...

Pandas Resampling error: Only valid with DatetimeIndex or PeriodIndex

python,pandas
When using panda's resample function on a DataFrame in order to convert tick data to OHLCV, a resampling error is encountered. How should we solve the error? data = pd.read_csv('tickdata.csv', header=None, names=['Timestamp','Price','Volume']).set_index('Timestamp') data.head() # Resample data into 30min bins ticks = data.ix[:, ['Price', 'Volume']] bars = ticks.Price.resample('30min', how='ohlc') volumes =...

Pandas remove null values when to_json

python,json,pandas
i have actually a pandas dataframe and i want to save it to json format. From the pandas docs it says: Note NaN‘s, NaT‘s and None will be converted to null and datetime objects will be converted based on the date_format and date_unit parameters Then using the orient option records...

Resampling and merging data frame with python

python,csv,pandas,resampling,merging-data
Hi I have created a dictionary of dataFrame with this code import os import pandas import glob path="G:\my_dir\*" dataList={} for files in glob.glob(path): dataList[files]=(read_csv(files,sep=";",index_col='Date')) The different dataframe present in the dictory have different time sample. An example of dataFrame(A) is Date Volume Value 2014-01-04 06:00:02 6062 108000.0 2014-01-04 06:06:05 6062...

ValueError: invalid literal for float(): when inserted substring from “2015-05-21T18:11:55” into dataframe

python,pandas,dataframes
I have a key value pair in a JSON-derived dictionary that looks like this: u'local_start_time': u'2015-05-21T18:11:55.000Z' When I try to insert a portion of this string into a dataframe I get this error: File "fix_runs_prepare.py", line 63, in <module> df.set_value(i, name, str(g[name])[0:19]) File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1679, in set_value engine.set_value(series.values, index,...

Appending Boolean Column in Panda Dataframe

python,pandas,ipython-notebook
I am learning pandas and got stuck with this problem here. I created a dataframe that tracks all users and the number of times they did something. To better understand the problem I created this example: import pandas as pd data = [ {'username': 'me', 'bought_apples': 2, 'bought_pears': 0}, {'username':...

How can I add columns in a data frame?

python,pandas,dataframes
I have the following data: Example: DRIVER_ID;TIMESTAMP;POSITION 156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346) I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude. So far, I got: cur_cab = pd.DataFrame.from_csv( path, sep=";", header=None, parse_dates=[1]).reset_index() cur_cab.columns = ['cab_id', 'datetime', 'point'] path specifies the .txt file containing the...

Pandas: break categorical column to multiple columns

python,indexing,pandas
Imagine a Pandas dataframe of the following format: id type v1 v2 1 A 6 9 1 B 4 2 2 A 3 7 2 B 3 6 I would like to convert this dataframe into the following format: id A_v1 A_v2 B_v1 B_v2 1 6 9 4 2 2...

Extracting only a particular set of data from a URL with an Excel File

python,excel,pandas
I am looking to gather all the data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. My code is below. I am currently just combining all the data from both sheets. I don't know how...

Replacing Strings in Column of Dataframe with the number in the string

python,pandas
I currently have a dataframe as follows and all I want to do is just replace the strings in Maturity with just the number within them. For example, I want to replace FZCY0D with 0 and so on. Date Maturity Yield_pct Currency 0 2009-01-02 FZCY0D 4.25 AUS 1 2009-01-05 FZCY0D...

pandas: summing over multiple columns

pandas
Consider this dataframe: STUDENT T_1 T_2 T_3 T_4 0 A PASS FAIL PASS FAIL 1 B PASS FAIL FAIL FAIL 2 C FAIL FAIL PASS PASS 3 D PASS FAIL PASS PASS The columns T_1 -> T_4 represent tests. In this case, T_1 and T_3 are tests of type 'X',...

Principal component analysis using sklearn and panda

python,pandas,scikit-learn,pca,principal-components
I have tried to reproduce the results from the PCA tutorial on here (PCA-tutorial) but I've got some problems. From what I understand I am following the steps to apply PCA as they should be. But my results are not similar with the ones in the tutorial (or maybe they...

Pandas add multiple empty columns to DataFrame

python,pandas
This may be a stupid question, but how do I add multiple empty columns to a DataFrame from a list? I can do: df["B"] = None df["C"] = None df["D"] = None But I can't do: df[["B", "C", "D"]] = None KeyError: "['B' 'C' 'D'] not in index" ...

Adding rows per group in pandas / ipython if per group a row is missing

pandas,append
I have a dataframe that contains for each group the number of observations during a certain period. Some groups don't contain all periods, and for these groups I want to append x rows with the missing periods in it. So that each group has a row for all 6 periods...

pandas: optimizing my code (groupby() / apply())

pandas
I have a dataframe of shape (RxC) 1.5M x 128. I do the following: I do groupby() based on 6 columns. This creates ~8700 sub-groups each of shape 538 x 122. On each sub-group, I run apply(). This function computes the % frequency of each categorical value PER column (i.e.,...

Python Pandas: select rows based on comparison across rows

python,indexing,pandas
In the dataframe below, the first column is the index with occasional non-unique values. | | col1 | |---|------| | A | 120 | | A | 90 | | A | 80 | | B | 80 | | B | 50 | | C | 120 | |...

Selecting Data from Last Week in Python

python,datetime,pandas,format,dataframes
I have a large database and I am looking to read only the last week for my python code. My first problem is that the column with the received date and time is not in the format for datetime in pandas. My input (Column 15) looks like this: recvd_dttm 1/1/2015...

How avoid error “TypeError: invalid data type for einsum” in Python

python,python-2.7,numpy,pandas,machine-learning
I try to load CSV file to numpy-array and use the array in LogisticRegression etc. Now, I am struggling with error is shown below: import numpy as np import pandas as pd from sklearn import preprocessing from sklearn.linear_model import LogisticRegression dataset = pd.read_csv('../Bookie_test.csv').values X = dataset[1:, 32:34] y = dataset[1:,...

How to create series of pandas dataframe by iteration

python,loops,pandas
I want to create df_2008 to df_2014 from an original df by iteration. df has columns names '2008' to '2014' and I want to seperate them into different dfs. I tried for i in range(2008, 2015): 'df_'+str(i)=df[str(i)] Which won't work. I would really appreciate it if anyone could help me....

ValueError when converting string to integer in Dataframe

python,pandas
I am trying to replace the strings in the Years column of the Dataframe below with just the numbers in the string. For example, I would like to change ZC025YR to 025. My code is as follows: import urllib, urllib2 import csv from StringIO import StringIO import pandas as pd...

pass pandas dataframe as parameter in mysql query

python,mysql,pandas
I have a pandas dataframe df that looks something like: df = pd.DataFrame({'SEC1':['IBM','CSCO','MSFT','AMZN' ], 'SEC2':['GOOG', 'INTC', 'ABX', 'CREE'], 'HOUR':[10 ,10 ,15, 12], 'Size':[100 ,200 ,50 ,500],'Price':[300 ,25 ,150, 80] }) df = df[['SEC1', 'SEC2', 'HOUR', 'Size', 'Price']] I have a large mysql table (name=Table-B) which I want to do a...

Panda's Write CSV - Append vs. Write

python,csv,pandas
I would like to use pd.write_csv to write "filename" (with headers) if "filename" doesn't exist, otherwise to append to "filename" if it exists. If I simply use command: df.to_csv('filename.csv',mode = 'a',header ='column_names') The write or append succeeds, but it seems like the header is written every time an append takes...

Pandas Series of lists to one series

python,string,list,pandas,series
I have a Pandas Series of lists of strings: 0 [slim, waist, man] 1 [slim, waistline] 2 [santa] As you can see, the lists vary by length. I want an efficient way to collapse this into one series 0 slim 1 waist 2 man 3 slim 4 waistline 5 santa...

Is it possible to specify the order of levels in Pandas factorize method?

python,numpy,pandas
I am using pandas to factorize an array consisting of two types of strings. I want to make sure that one of the strings "XYZ" is always coded as a 0 and the other string "ABC" is always coded as 1. Is it possible to do this? I looked up...

Using Python to find correlation pairs

python,pandas,machine-learning,data-mining
NAME PRICE SALES VIEWS AVG_RATING VOTES COMMENTS Module 1 $12.00 69 12048 5 3 26 Module 2 $24.99 12 52858 5 1 14 Module 3 $10.00 1 1381 -1 0 0 Module 4 $22.99 46 57841 5 8 24 ................. So, Let's say I have statistics of sales. I...

Converting multiple columns to categories in Pandas. apply?

python,pandas
Consider a Dataframe. I want to convert a set of columns to_convert to categories. I can certainly do the following: for col in to_convert: df[col] = df[col].astype('category') but I was surprised that the following does not return a dataframe: df[to_convert].apply(lambda x: x.astype('category'), axis=0) which of course makes the following not...

Dropping Dataframe rows based on name

python,pandas
I have the following dataframe df where I am trying to drop all rows having curv_typ as PYC_RT or YCIF_RT. curv_typ maturity bonds 2015M06D19 2015M06D18 2015M06D17 \ 0 PYC_RT Y1 GBAAA -0.24 -0.25 -0.23 1 PYC_RT Y1 GBA_AAA -0.05 -0.05 -0.05 2 PYC_RT Y10 GBAAA 0.89 0.92 0.94 My code...

Move given row to end of DataFrame

python,pandas,dataframes,concat
I would like to take a given row from a DataFrame and prepend or append to the same DataFrame. My code below does just that, but I'm not sure if I'm doing it the right way or if there is an easier, better, faster way? testdf = df.copy() #get row...

Concatenate a list of series into a uid

python,python-2.7,pandas,py.test
I have a Pandas data frame with several columns that together make up a unique identifier. I want to write a generic test case that allows me to concatenate those columns together into a single column (uid) and test that column for uniqueness. I have the following code as a...

Locating merged cell ranges in pyopenxl

python,pandas,openpyxl
I'm working on extracting some data from a .xlsx file using pyopenxl and Pandas. I can't find a cell property (or indeed any other information) that indicates where I can find out which cells are merged in the spreadsheets. How do I know which cells are merged together?...

Restructuring Dataframe in Python

python,pandas,dataframes
I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and...

Python Pandas filter rows based on the string value of an entry

python,pandas,bloomberg
I have an excel sheet (Bloomberg Data License output) I read in with import pandas as pd raw_data = pd.read_excel('my-file.xlsx') There is one column (START-OF-FILE) and a varying number rows, depending on the amount of data returned. I am interested in the data between two rows, specifically START-OF-DATA and END-OF-DATA....

Pandas select and write rows that contain certain text

python,pandas
I want to keep only rows in a dataframe that contains specific text in column "col". In this example either "WORD1" or "WORD2". df = df["col"].str.contains("WORD1|WORD2") df.to_csv("write.csv") This returns True or False. But how do I make it write entire rows that match these critera, not just present the boolean?...

pandas dataframe drop columns by number of nan

python,pandas
I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B'...

pandas does not free memory?

python,pandas
I do have a dataframe with 40000 rows and three columns. If I do df.set_index('my_ind').head() repeatedly in different cells in my ipython notebook the RAM is being filled but not freed: I thought this function just returns a view of the dataframe. gc.collect() did not free any RAM. Any ideas...

How to stack data frames on top of each other in Pandas

python,pandas,dataframes
I have a dataframe with 96 columns: df.to_csv('result.csv') out (excel): Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 Run 8 Run 9 Run 10 Run 11 Run 12 Run 13 Run 14 Run 15 Run 16 Run 17 Run 18 Run 19 Run 20...

Distances between coordinate pairs in pandas

python,pandas,scipy
What is the best way to find the number of points (rows) that are within a distance of a given point in this pandas dataframe: x y 0 2 9 1 8 7 2 1 10 3 9 2 4 8 4 5 1 1 6 2 3 7 10...

ordering dataframe according to a list, matching one of the columns in pandas

python,pandas
I have a list of values. I have a dataframe where one of the columns contains the same values. How do I order dataframe same way list is ordered? list example: [3, 4, ... 21, 23, 25, 26] dataframe example: 0 1 2 0 ND 3 0/1 1 ND 4...

Is there a concise way to show all rows in pandas for just the current command?

python,pandas
Sometimes I want to show all of the rows in a pandas DataFrame, but only for a single command or code-block. Of course I can set the "max_rows" display option to a large number, but then I have to repeat the command afterwards in order to revert to my preferred...

Ignoring future dates in python

python,datetime,pandas,dataframes
I have a large database and I am looking to read only the last week for my python code. However, somebody made a typo in the database so there is a date in the future that is throwing everything off. Input: recvd_dttm 6/5/2015 18:28:50 PM 6/5/2015 14:25:43 PM 9/10/2015 21:45:12...

conditional replace based off prior value in same column of pandas dataframe python

python,pandas,replace,fill,calculated-columns
Feel like I've looked just about everywhere and I know its probably something very simple. I'm working with a pandas dataframe and looking to fill/replace data in one of the columns based on data from that SAME column. I'm typically more of an excel guy and it is sooo simple...

How to extract efficientely content from an xml with python?

python,xml,python-2.7,pandas,lxml
I have the following xml: <?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document> <document><![CDATA[Ugh ]]></document> <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document> <document><![CDATA[@username Shout out to me???? ]]></document> </author> What is the most efficient...

How to add legends and title to grouped histograms generated by Pandas

pandas,matplotlib
I am trying to plot a histogram of multiple attributes grouped by another attributes, all of them in a dataframe. with the help of this question, I am able to set title for the plot. Is there an easy way to switch on legend for each subplot. Here is my...

How to build custom pandas.tseries.offsets class?

python,pandas,matplotlib,datetimeoffset
I want to find a way to build a custom pandas.tseries.offsets class at 1 second frequency for trading hours. The main requirement here is that the time offset object would be smart enough to know the next second of '2015-06-18 16:00:00' would be '2015-06-19 09:30:00 or 09:30:01', and the time...

Defining a function (pandas)

python,pandas
This works already, but I want to optimize a bit: df['Total Time'] = df['Total Time'].str.split(':').apply(lambda x: (int(x[0])*60.0) + int(x[1]) + (int(x[2]) / 60.0)) I am taking a timestamp (string) in Excel which represents Hours:Minutes:Seconds and turning it into a float which represents minutes. This is easier for me to play...

Data type using Pandas

python,pandas
If I ftp into a database and use pandas.read_sql to read in a huge file, what data type would the variable set equal to this be? And, if applicable, what kind of format would it be in? What object type is a pandas data frame?

pandas remove observations depending on multi-index level value

python,pandas,multi-index
I have a multi-index data frame with levels 'id' and 'year': value id year 10 2001 100 2002 200 11 2001 110 12 2001 200 2002 300 13 2002 210 I want to keep the ids that have values for both years 2001 and 2002. This means I want to...

Create a pandas df column from last value in another column for which a third column is 1

pandas
Say, I have a DataFrame like this: import pandas as pd import numpy as np df = pd.DataFrame({'data' : np.arange(10), 'trigger' : np.random.randint(0,2, size=10)}) I'd like to get a third column which contains in row i that value of column 'data' with the greatest index smaller than i for which...

pandas python: adding blank columns to df

python,pandas,insert,dataframes,col
I'm trying to add x number of blank columns to a dataframe. Here is my function: def fill_with_bars(df, number=10): ''' Add blank, empty columns to dataframe, at position 0 ''' numofcols = len(df.columns) while numofcols < number: whitespace = '' df.insert(0, whitespace, whitespace, allow_duplicates=True) whitespace += whitespace return df but...

Pandas add column to df based on list of regex patterns

python,regex,pandas
I have a dataframe that looks like this: Sentence bin_class "i wanna go to sleep. too late to take seroquel." 1 "Adam and Juliana are leaving me for 43 days take me with youuuu!" 0 And I also have a list of regex patterns I want to use on these...

filtering a dataframe after groupby in pandas

python,pandas,group-by
I have the following dataframe: In [4]: df Out[4]: Symbol Date Strike C/P Bid Ask 0 GS 6/15/2015 200 c 5 72 1 GS 6/15/2015 200 p 5 72 2 GS 6/15/2015 210 c 15 0 3 GS 6/15/2015 210 p 15 54 4 GS 7/15/2015 200 c 20 50...

Dropping Columns in a Dataframe based on if they have a particular letter in the title

python,pandas
Is there a way to drop columns in a Dataframe with column names having a particular letter as I wasn't able to find any information on this? I currently have the following code, which creates a dataframe that look as follows: dates BETA0 BETA1 BETA2 BETA3 SVEN1F01 \ 0 2015-06-17...

Formatting a Pivot Table in Python

python,sorting,pandas,format,dataframes
I am trying to reformat a table based on counts in different columns. df = pd.DataFrame({'Number': [1, 2, 3, 4, 5], 'X' : ['X1', 'X2', 'X3', 'X3', 'X3'], 'Y' : ['Y2','Y1','Y1','Y1', 'Y2'], 'Z' : ['Z3','Z1','Z1','Z2','Z1']}) Number X Y Z 0 1 X1 Y2 Z3 1 2 X2 Y1 Z1 2...

Convert subset of columns to list

python,pandas
I have a pivoted Pandas DataFrame with the following columns: month | day | hour | a | b | c | d | e | f | g ... z 1 1 1 3 9 0 9 0 3 3 9 What is the most efficient way to turn...

How to change multi index into flat column names

python,pandas
I have this data frame: import pandas as pd df = pd.DataFrame(data={'Status' : ['green','green','red','blue','red','yellow','black'], 'Group' : ['A','A','B','C','A','B','C'], 'City' : ['Toronto','Montreal','Vancouver','Toronto','Edmonton','Winnipeg','Windsor'], 'Sales' : [13,6,16,8,4,3,1]}) df.drop('Status',axis=1,inplace=True) ndf = pd.pivot_table(df,values=['Sales'],index=['City'],columns=['Group'],fill_value=0,margins=False) That looks like this: In [321]: ndf Out[321]: Sales Group A B C City Edmonton 4 0 0 Montreal 6 0 0 Toronto...

How can I change the color of a grouped bar plot in Pandas?

python,pandas,matplotlib
I have this plot that you'll agree is not very pretty. Other plots I made so far had some color and grouping to them out of the box. I tried manually setting the color, but it stays black. What am I doing wrong? Ideally it'd also cluster the same tags...

pandas parse dates from csv

parsing,datetime,pandas
I am trying to read a csv file which includes dates. The csv looks like this: h1,h2,h3,h4,h5 A,B,C,D,E,20150420 A,B,C,D,E,20150420 A,B,C,D,E,20150420 For reading the csv I use this code: df = pd.read_csv(filen, index_col=None, header=0, parse_dates=[5], date_parser=lambda t:parse(t)) The parse function looks like this: def parse(t): string_ = str(t) try: return datetime.date(int(string_[:4]),...

Pandas logical indexing on a single column of a dataframe to assign values

python,r,pandas,dataframes
I am an R programmer and looking for a similar way to do something like this in R: data[data$x > value, y] <- 1 (basically, take all rows where the x column is greater than some value and assign the y column at those rows the value of 1) In...

Alternatives to count and know what columns have missing values in Pandas

python,pandas,missing-data
I tried this, but I'm not sure if this is the best way to get the information about columns with missing values. For example, I use the target labels to reduce information over missing values and see much better its distribution cols = dataframe.columns.values.tolist() dfnas = pd.DataFrame() for col in...

Plotting multiple time series after a groupby in pandas

python,pandas,group-by,time-series
Suppose I made a groupby on the valgdata DataFrame like below: grouped_valgdata = valgdata.groupby(['news_site','dato_uden_tid']).mean() Now I get this: sentiment news_site dato_uden_tid dr.dk 2015-06-15 54.777183 2015-06-16 54.703167 2015-06-17 54.948775 2015-06-18 54.424881 2015-06-19 53.290554 eb.dk 2015-06-15 53.279251 2015-06-16 53.285643 2015-06-17 53.558753 2015-06-18 52.854750 2015-06-19 54.415988 jp.dk 2015-06-15 56.590428 2015-06-16 55.313752 2015-06-17 53.771377...

Pandas sql update efficiently

python,database,pandas
I am using python pandas to load data from a MySQL database, change, then update another table. There are a 100,000+ rows so the UPDATE query's take some time. Is there a more efficient way to update the data in the database than to use the df.iterrows() and run an...

Pandas DataFrame: Delete specific date in all leap years

python,select,pandas,leap-year
The following sequence is an extract of the pandas DataFrame that I've got: >>> df_t value 2011-01-31 -5.575000 2011-03-31 7.700000 2011-05-31 15.966667 2011-07-31 10.683333 2011-08-31 10.454167 2011-10-31 9.320833 2011-12-31 -0.358333 2012-01-31 -11.550000 2012-03-31 1.700000 2012-05-31 12.333333 2012-07-31 12.816667 2012-08-31 11.837500 2012-10-31 2.733333 2012-12-31 4.075000 2013-01-31 2.450000 2013-03-31 -4.262500 2013-05-31 11.491667...

pandas find max value in groupby and apply function

python,pandas
I've got a dataframe df like the following: H,Nu,City 1,15,Madrid 3,15,Madrid 3,1600,Madrid 5,17615,Madrid 2,55,Dublin 4,5706,Dublin 2,68,Dublin 1,68,Dublin I would like to find the max value / city of the Nu column. Then find the corresponding values of H and add a new column df['H2'] = df['H']/max(H/city). So far I tried:...

Pandas write variable number of new rows from list in Series

python,pandas
I'm using Pandas as a way to write data from Selenium. Two example results from a search box ac_results on a webpage: #Search for product_id = "01" ac_results = "Orange (10)" #Search for product_id = "02" ac_result = ["Banana (10)", "Banana (20)", "Banana (30)"] Orange returns only one price ($10)...

Using a subset of Pandas dataframe with Scipy Kmeans?

python,pandas,scipy
I have a data frame that I import using df = pd.read_csv('my.csv',sep=','). In that CSV file, the first row is the column name, and the first column is the observation name. I know how to select a subset of the Panda dataframe, using: df.iloc[:,1::] which gives me only the numeric...

Best Way to add group totals to a dataframe in Pandas

python,pandas
I have a simple task that I'm wondering if there is a better / more efficient way to do. I have a dataframe that looks like this: Group Score Count 0 A 5 100 1 A 1 50 2 A 3 5 3 B 1 40 4 B 2 20...

Merge multiple pandas columns into new column

python,pandas,analysis
I have a dataframe where some of the columns indicate whether or not a set of survey questions was seen. For example: Q1_Seen Q2_Seen Q3_Seen Q4_Seen Q1a nan nan nan nan Q2a nan nan nan nan Q3d nan nan Q2c nan nan I would like to collapse these columns into...

Beatbox: How do I add a WHERE clause when pulling data from SFDC?

python,pandas,salesforce,beatbox
In Pandas, I am creating a dataframe that merges data from two different Beatbox queries. First, I pull all my Opportunity data, then I pull all my Account data, and then I merge. However I would like to optimize this process by only pulling data for account['ID'] that exists in...

Iteratively change every cell in a column of a Pandas dataframe

python,pandas,dataframes
I am trying to change the value of every cell in a Pandas data frame. Expecting .loc to allow me to identify a cell with the paradigm df.loc[row_index, column_name] = cell value, I've used the following loop: table["field"] = 6 #placehodler value used only to create the column for field,...

matplotlib mean interval plot

python,pandas,matplotlib,plot
I'm transitioning from R to python, and was looking to plot the mean line of a two variables. It is the plot of the x variable is split into intervals for the x axis, and mean of the y variable for the y axis. For example, if I have 1000...

Python: Using panda to import csv. Trying to plot a column but gives me an error saying “no numerical data to plot”

python,csv,pandas,plot
I'm trying to read following csv-file with panda and plot a column of it: data type,approved mining area,mined area,coal content,earth rate,coal rate,waste ratio unit,ha,ha,Mt,Mm3/a,Mt/a, Garzweiler,11400,3096,1246,140,37.5,4.4 Hambach,8500,4224,1500,275,40,5.2 Inden,4500,1655,358,87.5,22.5,3.6 which gives me the following (only a part so it fits here): data type approved mining area mined area coal content earth rate...

Performing arithmetic on partially known columns names

python,pandas
I would like to perform some arithmetic calculations on columns, where I know only first character (number), which is common for some columns. As an output I would need to create another data frame with the names that include the same character (number). For example. I have a df1 with...

Pandas/Python Combine two data frames with duplicate rows

python,pandas
Ok this seems like it should be easy to do with merge or concatenate operations but I can't crack it. I'm working in pandas. I have two dataframes with duplicate rows in between them and I want to combine them in a manner where no rows or collumns are duplicated....

getting all corresponding max values in pandas pivot table

python,pandas
I have the following dataframe (pandas version 0.13.1) >>> import pandas as pd >>> DF = pd.DataFrame({'Group':['G1','G1','G2','G2'],'Start':['10','10','12','13'],'End':['13','13','14','15'],'Sample':['S1','S2','S3','S3'],'Status':['yes','yes','no','yes'],'pValue':[0.13,0.12,0.96,0.76],'pValueString':['13/100','12/100','96/100','76/100'],'desc':['aaaaaa','bbbbbb','aaaaaa','cccccc']}) >>> DF End Group Sample Start Status pValue pValueString desc 0 13 G1 S1 10 yes 0.13 13/100 aaaaaa 1 13 G1 S2 10 no 0.12 12/100...

How do I copy a row from one pandas dataframe to another pandas dataframe?

python,python-2.7,pandas,dataframes
I have a dataframe of data that I am trying to append to another dataframe. I have tried various ways with .append() and there has been no successful way. When I print the data from iterrows. I provide 2 possible ways I tried to solve the issue below, one creates...

KeyError when using melt to restructure Dataframe

python,pandas
I have a dataframe that currently looks as follows and has 2628 rows and 101 columns. I want to convert the years row which is associated with the numbers 0.08333 0.16666 0.249999 and so on, into a column: years Currency 0.08333333 0.16666666 0.24999999 0.33333332 \ 2005-01-04 GBP 4.709456 4.633861 4.586271...

Beatbox: Possible to add condition to query when pulling SFDC data?

python,pandas,salesforce,beatbox
In Pandas, I want to pull Opportunity data with CreatedDate >= 1/1/2015. Currently, I am extracting all Opportunity data before filtering for CreatedDate. Is it possible to optimize this process by adding the CreatedDate condition to the query? Current State: query_result = service.query("SELECT ID, CreatedDate FROM Opportunity") records = query_result['records']...

Pandas groupby category, rating, get top value from each category?

python,pandas,dataframes
First question on SO, very new to pandas and still a little shaky on the terminology: I'm trying to figure out the proper syntax/sequence of operations on a dataframe to be able to group by column B, find the max (or min) corresponding value for each group in column C,...

Capping values after a trigger level in a different variable _after GroupBy

pandas,triggers,group-by
There was an elegant answer to a question almost like this provided by EdChum. The difference between that question and this is that now the capping needs to be applied to data that had had "GroupBy" performed. Original Data: Symbol DTE Spot Strike Vol AAPL 30.00 100.00 80.00 14.58 AAPL...

Trying to create a new dataframe based on internal sums of a column from another dataframe using Python/pandas

python,indexing,pandas,sum,dataframes
Let's assume I have a pandas dataframe df as follow: df = DataFrame({'Col1':[1,2,3,4], 'Col2':[5,6,7,8]}) Col1 Col2 0 1 5 1 2 6 2 3 7 3 4 8 Is there a way for me to change a column into the sum of all the following elements in the column? For...

What's the fastest way to compare datetime in pandas?

python,python-3.x,numpy,pandas,datetime64
I have two big csv files with different number of rows which I am importing as follows: tdata = pd.read_csv(tfilepath, sep=',', parse_dates=['date_1']) print(tdata.iloc[:, [0,3]]) TBA date_1 0 0 2010-01-04 1 9 2010-01-05 2 0 2010-01-06 3 8 2010-01-07 4 0 2010-01-08 5 0 2010-01-09 pdata = pd.read_csv(pfilepath, sep=',', parse_dates=['date_2']) print(pdata.iloc[:,...

pandas looking at next row and swapping values

python,pandas
If i see a A&T consecutively, I will set found=True for A. set remove=True for T. Set value of T as A (copy A's value to T) Set T found=True If I see G&C consecutively, I will set found=True for G. set remove=True for C Swap G & C Values...

IndexError obstructing code from working with larger csv file

python,csv,indexing,pandas
I have data that sorts a csv by using groupby and then plots the information. I used a small sample of information to create the code. It ran smoothly and so then I tried running it with the huge file of data. I am pretty new at Python and this...

Restructuring Dataframe

python,pandas
I have a dataframe that currently looks as follows and has 262800 rows and 3 columns. My dataframe is currently as follows: Currency Maturity value 0 GBP 0.08333333 4.709456 1 GBP 0.08333333 4.713099 2 GBP 0.08333333 4.707237 3 GBP 0.08333333 4.705043 4 GBP 0.08333333 4.697150 5 GBP 0.08333333 4.710647 6...

Concat Columns produces NAN even though axis is the same for all datasets

python,pandas,dataframes,concat
I am trying to concat columns from multiple dataframes. `AUD = OHLC_AUDUSD['bid']['close'];` `AUD = AUD.dropna()` `CAD = OHLC_USDCAD['bid']['close'];` `CAD = CAD.dropna()` `print AUD` symbol timestamp AUDUSD 2015-01-05 0.8096 2015-01-06 0.8077 2015-01-07 0.8074 2015-01-08 0.8112 2015-01-09 0.8200 Name: close, dtype: float64 `print CAD` symbol timestamp USDCAD 2015-01-05 1.1756 2015-01-06 1.1838 2015-01-07...

Pandas difference between dataframes on column values

python,pandas,dataframes,difference
I couldn't find a way to have a dataframe that has the difference of 2 dataframes based on a column. So basically: dfA = ID, val 1, test 2, other test dfB = ID, val 2, other test I want to have a dfC that holds the difference dfA -...

Failing to convert column in pandas dataframe to integer data type

python,pandas
I have this code which manipulates a data set to create a new column by pulling info from an existing column. In order to match the data properly using a pd.merge function with another data set, I would like to convert the 'Channel ID' column to integers. Despite the current...

what is the best method to extract highly correlated vaiables within the given threshold

python,numpy,pandas,scipy
I have one data frame and pairwise correlation were calculated >>> df1 = pd.read_csv("/home/zebrafish/Desktop/stack.csv") >>> df1.corr() GA PN PC MBP GR AP GA 1.000000 0.070541 0.259937 -0.452661 0.115722 0.268014 PN 0.070541 1.000000 0.512536 0.447831 -0.042238 0.263601 PC 0.259937 0.512536 1.000000 0.331354 -0.254312 0.958877 MBP -0.452661 0.447831 0.331354 1.000000 -0.467683 0.229870...

Find the common values in columns in Pandas dataframe

python,pandas
I have the data frame of the style: animal animal A Dog Dog B Cat Cat C Pig Pig D Cat Dog The different entries in row D tell me there is an error. I need to remove all rows where the animals are not the same. The columns do...

Pandas Dataframe Complex Calculation

python,python-2.7,pandas,dataframes
I have the following dataframe,df: Year totalPubs ActualCitations 0 1994 71 191.002034 1 1995 77 2763.911781 2 1996 69 2022.374474 3 1997 78 3393.094951 I want to write code that would do the following: Citations of currentyear / Sum of totalPubs of the two previous years I want something to...

AttributeError when scraping data from URL via Python

python,pandas,beautifulsoup
I am using the code below to try an extract the data from the table in this URL. I asked the same question here and got an Answer for it. However, despite the code from the Answer working at that time I've now come to realize that data in the...

Pandas: Excel subheading

python,pandas
I'm trying to read in an excel file that has a sub-header. So far, I'm doing the following: link = 'http://www.bea.gov/industry/xls/io-annual/GDPbyInd_GO_NAICS_1997-2013.xlsx' xd = pd.read_excel(link, sheetname='07NAICS_GO_A_Gross Output', skiprows=3) Unfortunately, the data has a second sub header in row 4 (0-indexed) that only gives the unit of measurement, as follows. Can I...

Joining two Pandas DataFrames does not work anymore?

join,pandas,merge,dataframes
I have 2 Pandas Dataframes. The first one looks like this: date rank id points 2010-01-04 1 100001 10550 2010-01-04 2 100002 9205 The second one like this: id name 100001 A 100002 B I want to join both dataframes via the id column. So the result should look like:...

Pandas - Dropping multiple empty columns

python,pandas
I have some tables where the first 11 columns are populated with data, but all columns after this are blank. I tried: df=df.dropna(axis=1,how='all') which didn't work. I then used: df = df.drop(df.columns[range(11,36)], axis=1) Which worked on the first few tables, but then some of the tables were longer or shorter...

Filter pandas DataFrame by column time value

python,pandas
I have a pandas DataFrame with a 'date' column, which uses this format: 2015-01-01 04:00:00 2015-01-01 05:00:00 2015-01-01 06:00:00 2015-01-01 07:00:00 ... 2015-01-02 04:00:00 2015-01-02 05:00:00 2015-01-02 06:00:00 2015-01-02 07:00:00 I want to filter the DataFrame so I only keep the rows with a stated time, e.g. 06:00:00 2015-01-01 06:00:00...