Calculate statistics on slices of a dataframe

I have the following dataframe:

df= pd.DataFrame('A': np.random.randn(10),
 'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
 'C': np.random.randn(10),
 'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))

that looks something like this:

 A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15

I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:

 MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15

It other words, I want to keep the non-NA values from columns B and D and calculate the mean and others of A and C of the values above/below the non-NA values from B and D (in this case, let's include -1.786725 in both means, similarly for 0.282747).

NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().

Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.

asked Nov 13 '18 at 10:57

user270199

274

is B has a value in a particular row, is D also guaranteed to have a value in the same row?

– Simon
Nov 13 '18 at 11:07

@Simon: Yes, that's correct.

– user270199
Nov 13 '18 at 11:09

What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?

– Ken Syme
Nov 13 '18 at 11:51

add a comment |

I have the following dataframe:

df= pd.DataFrame('A': np.random.randn(10),
 'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
 'C': np.random.randn(10),
 'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))

that looks something like this:

 A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15

I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:

 MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15

NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().

Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.

asked Nov 13 '18 at 10:57

user270199

274

is B has a value in a particular row, is D also guaranteed to have a value in the same row?

– Simon
Nov 13 '18 at 11:07

@Simon: Yes, that's correct.

– user270199
Nov 13 '18 at 11:09

What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?

– Ken Syme
Nov 13 '18 at 11:51

add a comment |

I have the following dataframe:

df= pd.DataFrame('A': np.random.randn(10),
 'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
 'C': np.random.randn(10),
 'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))

that looks something like this:

 A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15

I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:

 MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15

NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().

Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.

asked Nov 13 '18 at 10:57

user270199

274

I have the following dataframe:

df= pd.DataFrame('A': np.random.randn(10),
 'B': np.concatenate((np.repeat(np.nan, 4), np.array([0.7]), np.repeat(np.nan, 4), np.array([0.5]))),
 'C': np.random.randn(10),
 'D': np.concatenate((np.repeat(np.nan, 4), np.array([0.2]), np.repeat(np.nan, 4), np.array([0.15]))))

that looks something like this:

 A B C D
0 0.537109 NaN -0.204046 NaN
1 0.744488 NaN 0.903607 NaN
2 0.640995 NaN 0.712210 NaN
3 0.212758 NaN 0.293984 NaN
4 -1.786725 0.7 0.282747 0.20
5 1.486037 NaN -0.199659 NaN
6 0.267356 NaN 0.890397 NaN
7 0.697408 NaN -0.771626 NaN
8 0.044247 NaN -1.157836 NaN
9 0.345658 0.5 1.556390 0.15

I would like to calculate some statistics such as mean on slices of this dataframe, for example, I would like the result to look something like:

 MeanA B MeanC D
0 2.1 0.7 1.3 0.20
1 1.4 0.5 5.6 0.15

NOTE: The values from MeanA and MeanC are made up because of the use of np.random.randn().

Now of course, my real dataframe contains more columns and more rows, so it would be nice to generalise this.

python pandas dataframe

asked Nov 13 '18 at 10:57

user270199

274

asked Nov 13 '18 at 10:57

user270199

274

asked Nov 13 '18 at 10:57

user270199

274

asked Nov 13 '18 at 10:57

user270199

274

asked Nov 13 '18 at 10:57

user270199

274

is B has a value in a particular row, is D also guaranteed to have a value in the same row?

– Simon
Nov 13 '18 at 11:07

@Simon: Yes, that's correct.

– user270199
Nov 13 '18 at 11:09

What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?

– Ken Syme
Nov 13 '18 at 11:51

add a comment |

is B has a value in a particular row, is D also guaranteed to have a value in the same row?

– Simon
Nov 13 '18 at 11:07

@Simon: Yes, that's correct.

– user270199
Nov 13 '18 at 11:09

What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?

– Ken Syme
Nov 13 '18 at 11:51

is B has a value in a particular row, is D also guaranteed to have a value in the same row?

– Simon
Nov 13 '18 at 11:07

@Simon: Yes, that's correct.

– user270199
Nov 13 '18 at 11:09

What exact values do you want to be included in the mean calculation? How far above/below the non-NA values would you go?

– Ken Syme
Nov 13 '18 at 11:51

add a comment |

4 Answers
4

active

oldest

votes

This should do:

b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])

for col_stat, col_nan in zip(cols_stats, cols_nans):
 df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
 i_prev = 0
 for k,i in enumerate(b_nans):
 df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
 i_prev = i

print(df2)

 mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15

answered Nov 13 '18 at 11:43

yatu

8,6311926

add a comment |

One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by

ind = np.array(df.dropna().index)

Now that you know the indices you can manually perform the operations.

answered Nov 13 '18 at 11:20

randomwalker

2828

add a comment |

You can approach the problem with below steps:

Find out the index where you encounter 1st non-Nan value by following code:

index = df['B'].first_valid_index() ## This will return 4

Now slice df with these index value

subset_df = df.iloc[0:index+1:,]

take mean of the subset df for all the columns

upper_half_mean = subset_df .mean(axis=0)

Do the steps 2 & 3 for the below part of the df and then just concat.

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

add a comment |

Here's a pretty simple way. The idea is to find the rows (index) where every column contains a value (the breakpoints). Iterate over those breakpoints and use those indices to get a sub slice of the dataframe. Calculate the mean for each column in that slice, which will return a single series for that slice. Then just concatenate those slices (series) together

np.random.seed(123)
# generate random data same way as in your question

# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]

# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices = 

for i in range(len(idx)-1):
 slices.append(df[idx[i]:idx[i+1]+1].mean())

# concat the separate series together as rows
output = pd.concat(slices, axis=1).T

# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values

Which gives:

 A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279501%2fcalculate-statistics-on-slices-of-a-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

This should do:

b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])

for col_stat, col_nan in zip(cols_stats, cols_nans):
 df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
 i_prev = 0
 for k,i in enumerate(b_nans):
 df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
 i_prev = i

print(df2)

 mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15

answered Nov 13 '18 at 11:43

yatu

8,6311926

add a comment |

This should do:

b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])

for col_stat, col_nan in zip(cols_stats, cols_nans):
 df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
 i_prev = 0
 for k,i in enumerate(b_nans):
 df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
 i_prev = i

print(df2)

 mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15

answered Nov 13 '18 at 11:43

yatu

8,6311926

add a comment |

This should do:

b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])

for col_stat, col_nan in zip(cols_stats, cols_nans):
 df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
 i_prev = 0
 for k,i in enumerate(b_nans):
 df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
 i_prev = i

print(df2)

 mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15

answered Nov 13 '18 at 11:43

yatu

8,6311926

This should do:

b_nans = df[~df.B.isnull()].index.values
cols_stats = ['A','C']
cols_nans = ['B','D']
df2 = pd.DataFrame(columns = [['mean_'+ col for col in cols] + cols_nans])

for col_stat, col_nan in zip(cols_stats, cols_nans):
 df2[col_nan] = df[~df[col_nan].isnull()][col_nan].reset_index(drop=True)
 i_prev = 0
 for k,i in enumerate(b_nans):
 df2.loc[k, 'mean_'+ col_stat] = df.loc[i_prev:i,col_stat].mean()
 i_prev = i

print(df2)

 mean_A mean_C B D
0 -0.059588 0.367446 0.7 0.20
1 0.037202 0.373243 0.5 0.15

answered Nov 13 '18 at 11:43

yatu

8,6311926

answered Nov 13 '18 at 11:43

yatu

8,6311926

answered Nov 13 '18 at 11:43

yatu

8,6311926

answered Nov 13 '18 at 11:43

yatu

8,6311926

add a comment |

One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by

ind = np.array(df.dropna().index)

Now that you know the indices you can manually perform the operations.

answered Nov 13 '18 at 11:20

randomwalker

2828

add a comment |

One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by

ind = np.array(df.dropna().index)

Now that you know the indices you can manually perform the operations.

answered Nov 13 '18 at 11:20

randomwalker

2828

add a comment |

One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by

ind = np.array(df.dropna().index)

Now that you know the indices you can manually perform the operations.

answered Nov 13 '18 at 11:20

randomwalker

2828

One way of doing this is to manually work with the index.
Lets get the indices of the non-NaN rows, e.g. by

ind = np.array(df.dropna().index)

Now that you know the indices you can manually perform the operations.

answered Nov 13 '18 at 11:20

randomwalker

2828

answered Nov 13 '18 at 11:20

randomwalker

2828

answered Nov 13 '18 at 11:20

randomwalker

2828

answered Nov 13 '18 at 11:20

randomwalker

2828

add a comment |

You can approach the problem with below steps:

Find out the index where you encounter 1st non-Nan value by following code:

index = df['B'].first_valid_index() ## This will return 4

Now slice df with these index value

subset_df = df.iloc[0:index+1:,]

take mean of the subset df for all the columns

upper_half_mean = subset_df .mean(axis=0)

Do the steps 2 & 3 for the below part of the df and then just concat.

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

add a comment |

You can approach the problem with below steps:

Find out the index where you encounter 1st non-Nan value by following code:

index = df['B'].first_valid_index() ## This will return 4

Now slice df with these index value

subset_df = df.iloc[0:index+1:,]

take mean of the subset df for all the columns

upper_half_mean = subset_df .mean(axis=0)

Do the steps 2 & 3 for the below part of the df and then just concat.

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

add a comment |

You can approach the problem with below steps:

Find out the index where you encounter 1st non-Nan value by following code:

index = df['B'].first_valid_index() ## This will return 4

Now slice df with these index value

subset_df = df.iloc[0:index+1:,]

take mean of the subset df for all the columns

upper_half_mean = subset_df .mean(axis=0)

Do the steps 2 & 3 for the below part of the df and then just concat.

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

You can approach the problem with below steps:

Find out the index where you encounter 1st non-Nan value by following code:

index = df['B'].first_valid_index() ## This will return 4

Now slice df with these index value

subset_df = df.iloc[0:index+1:,]

take mean of the subset df for all the columns

upper_half_mean = subset_df .mean(axis=0)

Do the steps 2 & 3 for the below part of the df and then just concat.

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

answered Nov 13 '18 at 11:34

Rahul Agarwal

2,19451027

add a comment |

np.random.seed(123)
# generate random data same way as in your question

# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]

# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices = 

for i in range(len(idx)-1):
 slices.append(df[idx[i]:idx[i+1]+1].mean())

# concat the separate series together as rows
output = pd.concat(slices, axis=1).T

# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values

Which gives:

 A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

add a comment |

np.random.seed(123)
# generate random data same way as in your question

# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]

# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices = 

for i in range(len(idx)-1):
 slices.append(df[idx[i]:idx[i+1]+1].mean())

# concat the separate series together as rows
output = pd.concat(slices, axis=1).T

# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values

Which gives:

 A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

add a comment |

np.random.seed(123)
# generate random data same way as in your question

# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]

# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices = 

for i in range(len(idx)-1):
 slices.append(df[idx[i]:idx[i+1]+1].mean())

# concat the separate series together as rows
output = pd.concat(slices, axis=1).T

# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values

Which gives:

 A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

np.random.seed(123)
# generate random data same way as in your question

# get indices of the breakpoints
idx = df[-df.isnull().any(axis=1)].index
idx = np.insert(idx, 0, 0) # idx will be [0, 4, 9]

# calculate mean across columns for each slice of the dataframe
# this creates a list of pandas series
slices = 

for i in range(len(idx)-1):
 slices.append(df[idx[i]:idx[i+1]+1].mean())

# concat the separate series together as rows
output = pd.concat(slices, axis=1).T

# this is needed to get the correct values of the nan columns
output[['B', 'D']] = df[['B', 'D']].dropna().values

Which gives:

 A B C D
0 -0.378040 0.7 -0.073018 0.20
1 -0.230593 0.5 0.817437 0.15

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

edited Nov 13 '18 at 12:15

answered Nov 13 '18 at 11:57

Simon

4,401103273

answered Nov 13 '18 at 11:57

Simon

4,401103273

answered Nov 13 '18 at 11:57

Simon

4,401103273

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb