Python pandas - add multiple columns to dataframe where rows and columns are co-dependent
I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.
import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'],
"wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
"par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
"prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
"amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
"book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
example_df
The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
Something like:
def calc_bv(prin_cf, par_val, wal):
global bvalue, amort
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort
bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df
python pandas performance dataframe
add a comment |
I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.
import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'],
"wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
"par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
"prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
"amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
"book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
example_df
The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
Something like:
def calc_bv(prin_cf, par_val, wal):
global bvalue, amort
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort
bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df
python pandas performance dataframe
add a comment |
I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.
import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'],
"wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
"par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
"prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
"amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
"book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
example_df
The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
Something like:
def calc_bv(prin_cf, par_val, wal):
global bvalue, amort
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort
bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df
python pandas performance dataframe
I have managed to get this working with for-loops but these are very slow on the large datasets I'm working with, so am trying to find a way to do this using pandas, groupby, apply and lamda functions instead.
import pandas as pd
example_df = pd.DataFrame("scen": [1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"cusip": ['031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2', '031162CF5', '031162CF5', '031162CF5', '031162CF5', '38141GWM2', '38141GWM2', '38141GWM2', '38141GWM2'],
"wal": [50, 55, 60, 65, 40, 50, 60, 70, 40, 45, 50, 55, 30, 40, 50, 60],
"par_val": [900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000, 900000, 800000, 700000, 600000],
"prin_cf": [0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000, 0, 100000, 100000, 100000],
"amortization": [166.67, 0, 0, 0, 208.33, 0, 0, 0, 208.33, 0, 0, 0, 277.78, 0, 0, 0],
"book_val": [1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0, 1000000, 0, 0, 0])
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
example_df
The tricky part is that each row's book value depends on the previous row's amortization value, while each amortization value depends on the book value in the same row. Looking at the responses to a similar question here, I think there may be a way to do this using global variables that keep track of the previous values.
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
Something like:
def calc_bv(prin_cf, par_val, wal):
global bvalue, amort
bvalue = bvalue - amort - prin_cf
amort = (bvalue - par_val)/wal/12
return bvalue, amort
bvalue = example_df.loc[0, 'book_val']
amort = example_df.loc[0, 'amortization']
example_df[1:][['book_val','amortization']] = example_df2[1:].apply(lambda row: calc_bv(row['prev_prin_cf'],row['par_val'],row['wal']), axis=1, result_type="expand")
example_df
python pandas performance dataframe
python pandas performance dataframe
edited Nov 13 '18 at 13:20
Malik Asad
296111
296111
asked Nov 13 '18 at 11:48
M.HopeM.Hope
112
112
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.
# Python 3.6.0, Pandas 0.19.2
assert jpp(df).equals(original(df))
%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop
Benchmarking code
Original:
def original(example_df):
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
return example_df
Numba:
from numba import njit
@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
n = len(par)
for i in range(1, n):
if cusip[i] == cusip[i-1]:
if par[i] == 0:
book[i] == 0
else:
book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
amort[i] = (book[i] - par[i]) / wal[i] / 12
return book, amort
def jpp(df):
df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
df['book_val'].values, df['amortization'].values,
df['prin_cf'].values, df['wal'].values)
return df
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280403%2fpython-pandas-add-multiple-columns-to-dataframe-where-rows-and-columns-are-co%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.
# Python 3.6.0, Pandas 0.19.2
assert jpp(df).equals(original(df))
%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop
Benchmarking code
Original:
def original(example_df):
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
return example_df
Numba:
from numba import njit
@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
n = len(par)
for i in range(1, n):
if cusip[i] == cusip[i-1]:
if par[i] == 0:
book[i] == 0
else:
book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
amort[i] = (book[i] - par[i]) / wal[i] / 12
return book, amort
def jpp(df):
df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
df['book_val'].values, df['amortization'].values,
df['prin_cf'].values, df['wal'].values)
return df
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
add a comment |
There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.
# Python 3.6.0, Pandas 0.19.2
assert jpp(df).equals(original(df))
%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop
Benchmarking code
Original:
def original(example_df):
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
return example_df
Numba:
from numba import njit
@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
n = len(par)
for i in range(1, n):
if cusip[i] == cusip[i-1]:
if par[i] == 0:
book[i] == 0
else:
book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
amort[i] = (book[i] - par[i]) / wal[i] / 12
return book, amort
def jpp(df):
df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
df['book_val'].values, df['amortization'].values,
df['prin_cf'].values, df['wal'].values)
return df
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
add a comment |
There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.
# Python 3.6.0, Pandas 0.19.2
assert jpp(df).equals(original(df))
%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop
Benchmarking code
Original:
def original(example_df):
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
return example_df
Numba:
from numba import njit
@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
n = len(par)
for i in range(1, n):
if cusip[i] == cusip[i-1]:
if par[i] == 0:
book[i] == 0
else:
book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
amort[i] = (book[i] - par[i]) / wal[i] / 12
return book, amort
def jpp(df):
df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
df['book_val'].values, df['amortization'].values,
df['prin_cf'].values, df['wal'].values)
return df
There will, no doubt, be a smart Pandas solution based on groupby. But you can get a decent, around ~1000x, performance improvement by just rewriting your loop with numba.
# Python 3.6.0, Pandas 0.19.2
assert jpp(df).equals(original(df))
%timeit jpp(df) # 929 µs per loop
%timeit original(df) # 1.05 s per loop
Benchmarking code
Original:
def original(example_df):
for x in range(1, len(example_df['scen'])):
if (example_df['cusip'][x] == example_df['cusip'][x-1]):
# If bond matures, don't report book value
if(example_df['par_val'][x] == 0):
example_df['book_val'][x] = 0
else:
example_df['book_val'][x] = example_df['book_val'][x-1] - example_df['amortization'][x-1] - example_df['prin_cf'][x-1]
example_df['amortization'][x] = (example_df['book_val'][x] - example_df['par_val'][x]) / example_df['wal'][x] / 12
return example_df
Numba:
from numba import njit
@njit
def calculator(cusip, par, book, amort, prin_cf, wal):
n = len(par)
for i in range(1, n):
if cusip[i] == cusip[i-1]:
if par[i] == 0:
book[i] == 0
else:
book[i] = book[i-1] - amort[i-1] - prin_cf[i-1]
amort[i] = (book[i] - par[i]) / wal[i] / 12
return book, amort
def jpp(df):
df['book_val'], df['amortization'] = calculator(pd.factorize(df['cusip'])[0], df['par_val'].values,
df['book_val'].values, df['amortization'].values,
df['prin_cf'].values, df['wal'].values)
return df
answered Nov 13 '18 at 12:11
jppjpp
100k2161111
100k2161111
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
add a comment |
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
Thanks! Although unfortunately numba is not on the list of libraries I have access to at work... any ideas on how the pandas/groupby solution might look?
– M.Hope
Dec 10 '18 at 15:29
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
@M.Hope, Sorry, can't think of anything. Get your work to install numba :)
– jpp
Dec 10 '18 at 15:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53280403%2fpython-pandas-add-multiple-columns-to-dataframe-where-rows-and-columns-are-co%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown