R Summarise dplyr grouped data with certain rows excluded based on another column
I would like to summarize data across multiple columns, based on all rows other than rows with a certain value in a separate grouping variable column. For example, in the df below, I want to get the medians of A, B, C, D, and E based on values from rows that are not assigned to the cluster matching a given row.
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with would give the equivalent of:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
I can do it with nested for loops, but it's pretty slow to run and doesn't seem like a good R-like solution.
for(col in toupper(letters[1:5]))
for(clust in unique(df$cluster))
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
r dplyr
add a comment |
I would like to summarize data across multiple columns, based on all rows other than rows with a certain value in a separate grouping variable column. For example, in the df below, I want to get the medians of A, B, C, D, and E based on values from rows that are not assigned to the cluster matching a given row.
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with would give the equivalent of:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
I can do it with nested for loops, but it's pretty slow to run and doesn't seem like a good R-like solution.
for(col in toupper(letters[1:5]))
for(clust in unique(df$cluster))
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
r dplyr
1
Random note:LETTERS[1:5]
is the same astoupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51
add a comment |
I would like to summarize data across multiple columns, based on all rows other than rows with a certain value in a separate grouping variable column. For example, in the df below, I want to get the medians of A, B, C, D, and E based on values from rows that are not assigned to the cluster matching a given row.
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with would give the equivalent of:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
I can do it with nested for loops, but it's pretty slow to run and doesn't seem like a good R-like solution.
for(col in toupper(letters[1:5]))
for(clust in unique(df$cluster))
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
r dplyr
I would like to summarize data across multiple columns, based on all rows other than rows with a certain value in a separate grouping variable column. For example, in the df below, I want to get the medians of A, B, C, D, and E based on values from rows that are not assigned to the cluster matching a given row.
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with would give the equivalent of:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
I can do it with nested for loops, but it's pretty slow to run and doesn't seem like a good R-like solution.
for(col in toupper(letters[1:5]))
for(clust in unique(df$cluster))
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
r dplyr
r dplyr
asked Nov 15 '18 at 1:07
lblumlblum
8314
8314
1
Random note:LETTERS[1:5]
is the same astoupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51
add a comment |
1
Random note:LETTERS[1:5]
is the same astoupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51
1
1
Random note:
LETTERS[1:5]
is the same as toupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Random note:
LETTERS[1:5]
is the same as toupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51
add a comment |
1 Answer
1
active
oldest
votes
A solution using the tidyverse
.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311031%2fr-summarise-dplyr-grouped-data-with-certain-rows-excluded-based-on-another-colum%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
A solution using the tidyverse
.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
add a comment |
A solution using the tidyverse
.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
add a comment |
A solution using the tidyverse
.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
A solution using the tidyverse
.
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5
answered Nov 15 '18 at 1:31
wwwwww
28.4k112344
28.4k112344
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
add a comment |
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
Thanks, this works exactly as I hoped. I'll have to read a bit about map_dfr so that I really understand what's happening there.
– lblum
Nov 15 '18 at 17:54
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311031%2fr-summarise-dplyr-grouped-data-with-certain-rows-excluded-based-on-another-colum%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Random note:
LETTERS[1:5]
is the same astoupper(letters[1:5])
– Taiki Sakai
Nov 15 '18 at 1:28
Thanks, that is good to know!
– lblum
Nov 15 '18 at 17:51