Calculate statistics on slices of a dataframe
I have the following dataframe:
df= pd.DataFrame('A': np.random.randn(10),
'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
'C': np.random.randn(10),
'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))
that looks something like this:
A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15
I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:
MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15
It other words, I want to keep the non-NA values from columns B and D and calculate the mean and others of A and C of the values above/below the non-NA values from B and D (in this case, let's include -1.786725 in both means, similarly for 0.282747).
NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().
Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.
python pandas dataframe
add a comment |
I have the following dataframe:
df= pd.DataFrame('A': np.random.randn(10),
'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
'C': np.random.randn(10),
'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))
that looks something like this:
A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15
I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:
MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15
It other words, I want to keep the non-NA values from columns B and D and calculate the mean and others of A and C of the values above/below the non-NA values from B and D (in this case, let's include -1.786725 in both means, similarly for 0.282747).
NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().
Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.
python pandas dataframe
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51
add a comment |
I have the following dataframe:
df= pd.DataFrame('A': np.random.randn(10),
'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
'C': np.random.randn(10),
'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))
that looks something like this:
A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15
I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:
MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15
It other words, I want to keep the non-NA values from columns B and D and calculate the mean and others of A and C of the values above/below the non-NA values from B and D (in this case, let's include -1.786725 in both means, similarly for 0.282747).
NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().
Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.
python pandas dataframe
I have the following dataframe:
df= pd.DataFrame('A': np.random.randn(10),
'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
'C': np.random.randn(10),
'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))
that looks something like this:
A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15
I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:
MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15
It other words, I want to keep the non-NA values from columns B and D and calculate the mean and others of A and C of the values above/below the non-NA values from B and D (in this case, let's include -1.786725 in both means, similarly for 0.282747).
NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().
Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.
python pandas dataframe
python pandas dataframe
asked Nov 13 '18 at 10:57
user270199user270199
274
274
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51
add a comment |
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51
add a comment |
4 Answers
4
active
oldest
votes
This should do:
b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])
for col_stat, col_nan in zip(cols_stats, cols_nans):
df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
i_prev = 0
for k,i in enumerate(b_nans):
df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
i_prev = i
print(df2)
mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15
add a comment |
One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by
ind = np.array(df.dropna().index)
Now that you know the indices you can manually perform the operations.
add a comment |
You can approach the problem with below steps:
Find out the index where you encounter 1st non-Nan value by following code:
index = df['B'].first_valid_index()## This will return 4Now slice df with these index value
subset_df = df.iloc[0:index+1:,]take mean of the subset df for all the columns
upper_half_mean = subset_df .mean(axis=0)
Do the steps 2 & 3 for the below part of the df and then just concat.
add a comment |
Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together
np.random.seed(123)
# generate random data same way as in your question
# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]
# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices =
for i in range(len(idx)-1):
slices.append(df[idx[i]:idx[i+1]+1].mean())
# concat the separate series together as rows
output = pd.concat(slices, axis=1).T
# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values
Which gives:
A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279501%2fcalculate-statistics-on-slices-of-a-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
This should do:
b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])
for col_stat, col_nan in zip(cols_stats, cols_nans):
df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
i_prev = 0
for k,i in enumerate(b_nans):
df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
i_prev = i
print(df2)
mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15
add a comment |
This should do:
b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])
for col_stat, col_nan in zip(cols_stats, cols_nans):
df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
i_prev = 0
for k,i in enumerate(b_nans):
df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
i_prev = i
print(df2)
mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15
add a comment |
This should do:
b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])
for col_stat, col_nan in zip(cols_stats, cols_nans):
df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
i_prev = 0
for k,i in enumerate(b_nans):
df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
i_prev = i
print(df2)
mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15
This should do:
b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])
for col_stat, col_nan in zip(cols_stats, cols_nans):
df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
i_prev = 0
for k,i in enumerate(b_nans):
df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
i_prev = i
print(df2)
mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15
answered Nov 13 '18 at 11:43
yatuyatu
8,6311926
8,6311926
add a comment |
add a comment |
One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by
ind = np.array(df.dropna().index)
Now that you know the indices you can manually perform the operations.
add a comment |
One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by
ind = np.array(df.dropna().index)
Now that you know the indices you can manually perform the operations.
add a comment |
One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by
ind = np.array(df.dropna().index)
Now that you know the indices you can manually perform the operations.
One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by
ind = np.array(df.dropna().index)
Now that you know the indices you can manually perform the operations.
answered Nov 13 '18 at 11:20
randomwalkerrandomwalker
2828
2828
add a comment |
add a comment |
You can approach the problem with below steps:
Find out the index where you encounter 1st non-Nan value by following code:
index = df['B'].first_valid_index()## This will return 4Now slice df with these index value
subset_df = df.iloc[0:index+1:,]take mean of the subset df for all the columns
upper_half_mean = subset_df .mean(axis=0)
Do the steps 2 & 3 for the below part of the df and then just concat.
add a comment |
You can approach the problem with below steps:
Find out the index where you encounter 1st non-Nan value by following code:
index = df['B'].first_valid_index()## This will return 4Now slice df with these index value
subset_df = df.iloc[0:index+1:,]take mean of the subset df for all the columns
upper_half_mean = subset_df .mean(axis=0)
Do the steps 2 & 3 for the below part of the df and then just concat.
add a comment |
You can approach the problem with below steps:
Find out the index where you encounter 1st non-Nan value by following code:
index = df['B'].first_valid_index()## This will return 4Now slice df with these index value
subset_df = df.iloc[0:index+1:,]take mean of the subset df for all the columns
upper_half_mean = subset_df .mean(axis=0)
Do the steps 2 & 3 for the below part of the df and then just concat.
You can approach the problem with below steps:
Find out the index where you encounter 1st non-Nan value by following code:
index = df['B'].first_valid_index()## This will return 4Now slice df with these index value
subset_df = df.iloc[0:index+1:,]take mean of the subset df for all the columns
upper_half_mean = subset_df .mean(axis=0)
Do the steps 2 & 3 for the below part of the df and then just concat.
answered Nov 13 '18 at 11:34
Rahul AgarwalRahul Agarwal
2,19451027
2,19451027
add a comment |
add a comment |
Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together
np.random.seed(123)
# generate random data same way as in your question
# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]
# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices =
for i in range(len(idx)-1):
slices.append(df[idx[i]:idx[i+1]+1].mean())
# concat the separate series together as rows
output = pd.concat(slices, axis=1).T
# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values
Which gives:
A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15
add a comment |
Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together
np.random.seed(123)
# generate random data same way as in your question
# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]
# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices =
for i in range(len(idx)-1):
slices.append(df[idx[i]:idx[i+1]+1].mean())
# concat the separate series together as rows
output = pd.concat(slices, axis=1).T
# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values
Which gives:
A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15
add a comment |
Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together
np.random.seed(123)
# generate random data same way as in your question
# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]
# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices =
for i in range(len(idx)-1):
slices.append(df[idx[i]:idx[i+1]+1].mean())
# concat the separate series together as rows
output = pd.concat(slices, axis=1).T
# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values
Which gives:
A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15
Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together
np.random.seed(123)
# generate random data same way as in your question
# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]
# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices =
for i in range(len(idx)-1):
slices.append(df[idx[i]:idx[i+1]+1].mean())
# concat the separate series together as rows
output = pd.concat(slices, axis=1).T
# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values
Which gives:
A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15
edited Nov 13 '18 at 12:15
answered Nov 13 '18 at 11:57
SimonSimon
4,401103273
4,401103273
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279501%2fcalculate-statistics-on-slices-of-a-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
is B has a value in a particular row, is D also guaranteed to have a value in the same row?
– Simon
Nov 13 '18 at 11:07
@Simon: Yes, that's correct.
– user270199
Nov 13 '18 at 11:09
What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?
– Ken Syme
Nov 13 '18 at 11:51