Using dplyr to get multiple max values of a dataframe
I have a dataframe that looks something like this:
Name Variable1 Value1 Variable2 Value2
1 Michael A 2 <NA> <NA>
2 Michael B 3 <NA> <NA>
3 Michael <NA> <NA> X 10
4 John B 5 <NA> <NA>
5 John E 3 <NA> <NA>
6 John <NA> <NA> Y 20
7 John <NA> <NA> Z 40
8 Paul C 10 <NA> <NA>
9 Paul D 5 <NA> <NA>
In all rows, there is either a Variable1/Value1 pair, or a Variable2/Value2 pair.
I'm trying to use dplyr
with a combination of group_by
and filter
functions to generate a dataframe that has (1) only one row for each name and (2) the max-Value pairs for both 1 and 2. In this case, my final table would look something like:
Name Variable1 Value1 Variable2 Value2
1 Michael B 3 X 10
2 John B 5 Z 40
3 Paul C 10 <NA> <NA>
I can't seem to get the correct flow of grouping to execute this properly. One solution I've thought of is to split the dataframe into two for each pair, filter, using group_by
and filter
on each separately, and then joining them back together based on Name, but I was wondering if anybody had a more straightforward solution to offer. Maybe something with summarise
?
Thank you!
EDIT: Corrected missing NA values in table.
r dplyr
add a comment |
I have a dataframe that looks something like this:
Name Variable1 Value1 Variable2 Value2
1 Michael A 2 <NA> <NA>
2 Michael B 3 <NA> <NA>
3 Michael <NA> <NA> X 10
4 John B 5 <NA> <NA>
5 John E 3 <NA> <NA>
6 John <NA> <NA> Y 20
7 John <NA> <NA> Z 40
8 Paul C 10 <NA> <NA>
9 Paul D 5 <NA> <NA>
In all rows, there is either a Variable1/Value1 pair, or a Variable2/Value2 pair.
I'm trying to use dplyr
with a combination of group_by
and filter
functions to generate a dataframe that has (1) only one row for each name and (2) the max-Value pairs for both 1 and 2. In this case, my final table would look something like:
Name Variable1 Value1 Variable2 Value2
1 Michael B 3 X 10
2 John B 5 Z 40
3 Paul C 10 <NA> <NA>
I can't seem to get the correct flow of grouping to execute this properly. One solution I've thought of is to split the dataframe into two for each pair, filter, using group_by
and filter
on each separately, and then joining them back together based on Name, but I was wondering if anybody had a more straightforward solution to offer. Maybe something with summarise
?
Thank you!
EDIT: Corrected missing NA values in table.
r dplyr
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
In adplyr
method:group_by(x, Name) %>% summarise_all(max)
. Indata.table
-speak:x[,lapply(.SD, max),by="Name"]
.
– r2evans
Nov 13 '18 at 2:06
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50
add a comment |
I have a dataframe that looks something like this:
Name Variable1 Value1 Variable2 Value2
1 Michael A 2 <NA> <NA>
2 Michael B 3 <NA> <NA>
3 Michael <NA> <NA> X 10
4 John B 5 <NA> <NA>
5 John E 3 <NA> <NA>
6 John <NA> <NA> Y 20
7 John <NA> <NA> Z 40
8 Paul C 10 <NA> <NA>
9 Paul D 5 <NA> <NA>
In all rows, there is either a Variable1/Value1 pair, or a Variable2/Value2 pair.
I'm trying to use dplyr
with a combination of group_by
and filter
functions to generate a dataframe that has (1) only one row for each name and (2) the max-Value pairs for both 1 and 2. In this case, my final table would look something like:
Name Variable1 Value1 Variable2 Value2
1 Michael B 3 X 10
2 John B 5 Z 40
3 Paul C 10 <NA> <NA>
I can't seem to get the correct flow of grouping to execute this properly. One solution I've thought of is to split the dataframe into two for each pair, filter, using group_by
and filter
on each separately, and then joining them back together based on Name, but I was wondering if anybody had a more straightforward solution to offer. Maybe something with summarise
?
Thank you!
EDIT: Corrected missing NA values in table.
r dplyr
I have a dataframe that looks something like this:
Name Variable1 Value1 Variable2 Value2
1 Michael A 2 <NA> <NA>
2 Michael B 3 <NA> <NA>
3 Michael <NA> <NA> X 10
4 John B 5 <NA> <NA>
5 John E 3 <NA> <NA>
6 John <NA> <NA> Y 20
7 John <NA> <NA> Z 40
8 Paul C 10 <NA> <NA>
9 Paul D 5 <NA> <NA>
In all rows, there is either a Variable1/Value1 pair, or a Variable2/Value2 pair.
I'm trying to use dplyr
with a combination of group_by
and filter
functions to generate a dataframe that has (1) only one row for each name and (2) the max-Value pairs for both 1 and 2. In this case, my final table would look something like:
Name Variable1 Value1 Variable2 Value2
1 Michael B 3 X 10
2 John B 5 Z 40
3 Paul C 10 <NA> <NA>
I can't seem to get the correct flow of grouping to execute this properly. One solution I've thought of is to split the dataframe into two for each pair, filter, using group_by
and filter
on each separately, and then joining them back together based on Name, but I was wondering if anybody had a more straightforward solution to offer. Maybe something with summarise
?
Thank you!
EDIT: Corrected missing NA values in table.
r dplyr
r dplyr
edited Nov 13 '18 at 2:49
TactfulCactus
asked Nov 13 '18 at 1:43
TactfulCactusTactfulCactus
233
233
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
In adplyr
method:group_by(x, Name) %>% summarise_all(max)
. Indata.table
-speak:x[,lapply(.SD, max),by="Name"]
.
– r2evans
Nov 13 '18 at 2:06
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50
add a comment |
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
In adplyr
method:group_by(x, Name) %>% summarise_all(max)
. Indata.table
-speak:x[,lapply(.SD, max),by="Name"]
.
– r2evans
Nov 13 '18 at 2:06
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
In a
dplyr
method: group_by(x, Name) %>% summarise_all(max)
. In data.table
-speak: x[,lapply(.SD, max),by="Name"]
.– r2evans
Nov 13 '18 at 2:06
In a
dplyr
method: group_by(x, Name) %>% summarise_all(max)
. In data.table
-speak: x[,lapply(.SD, max),by="Name"]
.– r2evans
Nov 13 '18 at 2:06
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50
add a comment |
1 Answer
1
active
oldest
votes
library(dplyr)
df[is.na(df)] <- 0
df1 <-
df %>%
select(1:3)
df1_max <-
df %>%
group_by(Name) %>%
summarise(Value1 = max(Value1))
df2 <-
df %>%
select(c(1, 4:5))
df2_max <-
df %>%
group_by(Name) %>%
summarise(Value2 = max(Value2))
result <- left_join(
left_join(df1_max, df1),
left_join(df2_max, df2) %>%
distinct()
) %>%
select(c(1, 3, 2, 5, 4))
result[result == 0] <- NA
Result:
result
# A tibble: 3 x 5
Name Variable1 Value1 Variable2 Value2
<chr> <chr> <dbl> <chr> <dbl>
1 John B 5 Z 40
2 Michael B 3 X 10
3 Paul C 10 <NA> NA
Notes:
In data frames, the elements in a row are not independent from each other. So you shouldn't be able to dissociate them from each other. But this is what you are trying to do since the max value for
Variable1
and the max value forVariable2
are not on the same row, yet you want your result to have the max for each (thus creating a row that does not exist in your original data frame). This is why I split your data frame into 2, then re-associated them with a join.The max of
NA
and a value isNA
(we don't know what valueNA
is since it is missing, so we cannot know what the max is). Yet, you are consideringNA
to be inferior to any value. Since this is not how R works, I had to assign0
to yourNA
to get the max the way you consider it, then reassignNA
to0
at the end.You are missing two values in your data and I considered them to be
NA
.
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272580%2fusing-dplyr-to-get-multiple-max-values-of-a-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
library(dplyr)
df[is.na(df)] <- 0
df1 <-
df %>%
select(1:3)
df1_max <-
df %>%
group_by(Name) %>%
summarise(Value1 = max(Value1))
df2 <-
df %>%
select(c(1, 4:5))
df2_max <-
df %>%
group_by(Name) %>%
summarise(Value2 = max(Value2))
result <- left_join(
left_join(df1_max, df1),
left_join(df2_max, df2) %>%
distinct()
) %>%
select(c(1, 3, 2, 5, 4))
result[result == 0] <- NA
Result:
result
# A tibble: 3 x 5
Name Variable1 Value1 Variable2 Value2
<chr> <chr> <dbl> <chr> <dbl>
1 John B 5 Z 40
2 Michael B 3 X 10
3 Paul C 10 <NA> NA
Notes:
In data frames, the elements in a row are not independent from each other. So you shouldn't be able to dissociate them from each other. But this is what you are trying to do since the max value for
Variable1
and the max value forVariable2
are not on the same row, yet you want your result to have the max for each (thus creating a row that does not exist in your original data frame). This is why I split your data frame into 2, then re-associated them with a join.The max of
NA
and a value isNA
(we don't know what valueNA
is since it is missing, so we cannot know what the max is). Yet, you are consideringNA
to be inferior to any value. Since this is not how R works, I had to assign0
to yourNA
to get the max the way you consider it, then reassignNA
to0
at the end.You are missing two values in your data and I considered them to be
NA
.
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
add a comment |
library(dplyr)
df[is.na(df)] <- 0
df1 <-
df %>%
select(1:3)
df1_max <-
df %>%
group_by(Name) %>%
summarise(Value1 = max(Value1))
df2 <-
df %>%
select(c(1, 4:5))
df2_max <-
df %>%
group_by(Name) %>%
summarise(Value2 = max(Value2))
result <- left_join(
left_join(df1_max, df1),
left_join(df2_max, df2) %>%
distinct()
) %>%
select(c(1, 3, 2, 5, 4))
result[result == 0] <- NA
Result:
result
# A tibble: 3 x 5
Name Variable1 Value1 Variable2 Value2
<chr> <chr> <dbl> <chr> <dbl>
1 John B 5 Z 40
2 Michael B 3 X 10
3 Paul C 10 <NA> NA
Notes:
In data frames, the elements in a row are not independent from each other. So you shouldn't be able to dissociate them from each other. But this is what you are trying to do since the max value for
Variable1
and the max value forVariable2
are not on the same row, yet you want your result to have the max for each (thus creating a row that does not exist in your original data frame). This is why I split your data frame into 2, then re-associated them with a join.The max of
NA
and a value isNA
(we don't know what valueNA
is since it is missing, so we cannot know what the max is). Yet, you are consideringNA
to be inferior to any value. Since this is not how R works, I had to assign0
to yourNA
to get the max the way you consider it, then reassignNA
to0
at the end.You are missing two values in your data and I considered them to be
NA
.
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
add a comment |
library(dplyr)
df[is.na(df)] <- 0
df1 <-
df %>%
select(1:3)
df1_max <-
df %>%
group_by(Name) %>%
summarise(Value1 = max(Value1))
df2 <-
df %>%
select(c(1, 4:5))
df2_max <-
df %>%
group_by(Name) %>%
summarise(Value2 = max(Value2))
result <- left_join(
left_join(df1_max, df1),
left_join(df2_max, df2) %>%
distinct()
) %>%
select(c(1, 3, 2, 5, 4))
result[result == 0] <- NA
Result:
result
# A tibble: 3 x 5
Name Variable1 Value1 Variable2 Value2
<chr> <chr> <dbl> <chr> <dbl>
1 John B 5 Z 40
2 Michael B 3 X 10
3 Paul C 10 <NA> NA
Notes:
In data frames, the elements in a row are not independent from each other. So you shouldn't be able to dissociate them from each other. But this is what you are trying to do since the max value for
Variable1
and the max value forVariable2
are not on the same row, yet you want your result to have the max for each (thus creating a row that does not exist in your original data frame). This is why I split your data frame into 2, then re-associated them with a join.The max of
NA
and a value isNA
(we don't know what valueNA
is since it is missing, so we cannot know what the max is). Yet, you are consideringNA
to be inferior to any value. Since this is not how R works, I had to assign0
to yourNA
to get the max the way you consider it, then reassignNA
to0
at the end.You are missing two values in your data and I considered them to be
NA
.
library(dplyr)
df[is.na(df)] <- 0
df1 <-
df %>%
select(1:3)
df1_max <-
df %>%
group_by(Name) %>%
summarise(Value1 = max(Value1))
df2 <-
df %>%
select(c(1, 4:5))
df2_max <-
df %>%
group_by(Name) %>%
summarise(Value2 = max(Value2))
result <- left_join(
left_join(df1_max, df1),
left_join(df2_max, df2) %>%
distinct()
) %>%
select(c(1, 3, 2, 5, 4))
result[result == 0] <- NA
Result:
result
# A tibble: 3 x 5
Name Variable1 Value1 Variable2 Value2
<chr> <chr> <dbl> <chr> <dbl>
1 John B 5 Z 40
2 Michael B 3 X 10
3 Paul C 10 <NA> NA
Notes:
In data frames, the elements in a row are not independent from each other. So you shouldn't be able to dissociate them from each other. But this is what you are trying to do since the max value for
Variable1
and the max value forVariable2
are not on the same row, yet you want your result to have the max for each (thus creating a row that does not exist in your original data frame). This is why I split your data frame into 2, then re-associated them with a join.The max of
NA
and a value isNA
(we don't know what valueNA
is since it is missing, so we cannot know what the max is). Yet, you are consideringNA
to be inferior to any value. Since this is not how R works, I had to assign0
to yourNA
to get the max the way you consider it, then reassignNA
to0
at the end.You are missing two values in your data and I considered them to be
NA
.
edited Nov 13 '18 at 2:57
answered Nov 13 '18 at 2:34
prosoitosprosoitos
935419
935419
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
add a comment |
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
1
1
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
Okay! That all makes sense to me and was what I was thinking, though I didn't know about the max function with NA. Thanks for the explanation!
– TactfulCactus
Nov 13 '18 at 2:51
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272580%2fusing-dplyr-to-get-multiple-max-values-of-a-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
aggregate(df, by=list(Name), + FUN=max, na.rm=TRUE)
– W-B
Nov 13 '18 at 1:55
In a
dplyr
method:group_by(x, Name) %>% summarise_all(max)
. Indata.table
-speak:x[,lapply(.SD, max),by="Name"]
.– r2evans
Nov 13 '18 at 2:06
These solutions don't seem to work -- both give me errors regarding dealing with factors, presumably because max doesn't work with the columns Variable1 and Variable2. I want to be able to get the max of Value1 and Value2, and then get the associated Variable as well.
– TactfulCactus
Nov 13 '18 at 2:23
Are the missing value for Variable 2 in rows 4 and 5 blank, or NA?
– neilfws
Nov 13 '18 at 2:34
They were meant to be NA. Thanks! Fixed.
– TactfulCactus
Nov 13 '18 at 2:50