Python pandas - add multiple columns to dataframe where rows and columns are co-dependent

I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.

import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
 "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
 "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
 "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
 "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
 "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
 "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Something like:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

add a comment |

import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
 "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
 "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
 "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
 "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
 "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
 "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Something like:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

add a comment |

import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
 "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
 "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
 "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
 "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
 "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
 "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Something like:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2], 
 "cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'], 
 "wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
 "par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
 "prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
 "amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
 "book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])

for x in range(1, len(example_df['scen'])):

if (example_df['cusip'][x] == example_df['cusip'][x-1]):

# If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12

example_df

Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?

Something like:

def calc_bv(prin_cf, par_val, wal):
global bvalue, amort 
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort

bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df

python pandas performance dataframe

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

edited Nov 13 '18 at 13:20

Malik Asad

296111

edited Nov 13 '18 at 13:20

Malik Asad

296111

edited Nov 13 '18 at 13:20

Malik Asad

296111

asked Nov 13 '18 at 11:48

M.Hope

112

asked Nov 13 '18 at 11:48

M.Hope

112

asked Nov 13 '18 at 11:48

M.Hope

112

add a comment |

1 Answer
1

active

oldest

votes

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
 for x in range(1, len(example_df['scen'])):

 if (example_df['cusip'][x] == example_df['cusip'][x-1]):

 # If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
 return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
 n = len(par)
 for i in range(1, n):
 if cusip[i] == cusip[i-1]:
 if par[i] == 0:
 book[i] == 0
 else:
 book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
 amort[i] = (book[i] - par[i]) / wal[i] / 12
 return book, amort


def jpp(df):
 df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
 df['book_val'].values, df['amortization'].values,
 df['prin_cf'].values, df['wal'].values)

 return df

answered Nov 13 '18 at 12:11

jpp

100k2161111

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280403%2fpython-pandas-add-multiple-columns-to-dataframe-where-rows-and-columns-are-co%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
 for x in range(1, len(example_df['scen'])):

 if (example_df['cusip'][x] == example_df['cusip'][x-1]):

 # If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
 return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
 n = len(par)
 for i in range(1, n):
 if cusip[i] == cusip[i-1]:
 if par[i] == 0:
 book[i] == 0
 else:
 book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
 amort[i] = (book[i] - par[i]) / wal[i] / 12
 return book, amort


def jpp(df):
 df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
 df['book_val'].values, df['amortization'].values,
 df['prin_cf'].values, df['wal'].values)

 return df

answered Nov 13 '18 at 12:11

jpp

100k2161111

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

add a comment |

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
 for x in range(1, len(example_df['scen'])):

 if (example_df['cusip'][x] == example_df['cusip'][x-1]):

 # If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
 return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
 n = len(par)
 for i in range(1, n):
 if cusip[i] == cusip[i-1]:
 if par[i] == 0:
 book[i] == 0
 else:
 book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
 amort[i] = (book[i] - par[i]) / wal[i] / 12
 return book, amort


def jpp(df):
 df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
 df['book_val'].values, df['amortization'].values,
 df['prin_cf'].values, df['wal'].values)

 return df

answered Nov 13 '18 at 12:11

jpp

100k2161111

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

add a comment |

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
 for x in range(1, len(example_df['scen'])):

 if (example_df['cusip'][x] == example_df['cusip'][x-1]):

 # If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
 return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
 n = len(par)
 for i in range(1, n):
 if cusip[i] == cusip[i-1]:
 if par[i] == 0:
 book[i] == 0
 else:
 book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
 amort[i] = (book[i] - par[i]) / wal[i] / 12
 return book, amort


def jpp(df):
 df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
 df['book_val'].values, df['amortization'].values,
 df['prin_cf'].values, df['wal'].values)

 return df

answered Nov 13 '18 at 12:11

jpp

100k2161111

There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.

# Python 3.6.0, Pandas 0.19.2

assert jpp(df).equals(original(df))

%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop

Benchmarking code

Original:

def original(example_df):
 for x in range(1, len(example_df['scen'])):

 if (example_df['cusip'][x] == example_df['cusip'][x-1]):

 # If bond matures, don't report book value
 if(example_df['par_val'][x] == 0):
 example_df['book_val'][x] = 0
 else:
 example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]


 example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
 return example_df

Numba:

from numba import njit

@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
 n = len(par)
 for i in range(1, n):
 if cusip[i] == cusip[i-1]:
 if par[i] == 0:
 book[i] == 0
 else:
 book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
 amort[i] = (book[i] - par[i]) / wal[i] / 12
 return book, amort


def jpp(df):
 df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
 df['book_val'].values, df['amortization'].values,
 df['prin_cf'].values, df['wal'].values)

 return df

answered Nov 13 '18 at 12:11

jpp

100k2161111

answered Nov 13 '18 at 12:11

jpp

100k2161111

answered Nov 13 '18 at 12:11

jpp

100k2161111

answered Nov 13 '18 at 12:11

jpp

100k2161111

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

add a comment |

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?

– M.Hope
Dec 10 '18 at 15:29

@M.Hope, Sorry, can't think of anything. Get your work to install numba :)

– jpp
Dec 10 '18 at 15:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb