Compare two dataframes and print out updated rows in R
up vote
1
down vote
favorite
I'm quite new to R, and I'm trying to solve this problem that seems quite simple, but I'm not sure how to go about it. I'm trying to compare two dataframes, and print out the rows that are in one but not the other, and also print another list/dataframe with the rows where only one cell has been updated.
df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com
df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com
The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:
require(sqldf)
df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
This gave me the output:
`firstname lastname email`
`Frederick Sam sammic@gmail.com`
Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.
So, I want a way to print out:
firstname lastname email
Grace Holly rickyoaks@yahoo.com
I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.
r dataframe compare
add a comment |
up vote
1
down vote
favorite
I'm quite new to R, and I'm trying to solve this problem that seems quite simple, but I'm not sure how to go about it. I'm trying to compare two dataframes, and print out the rows that are in one but not the other, and also print another list/dataframe with the rows where only one cell has been updated.
df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com
df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com
The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:
require(sqldf)
df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
This gave me the output:
`firstname lastname email`
`Frederick Sam sammic@gmail.com`
Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.
So, I want a way to print out:
firstname lastname email
Grace Holly rickyoaks@yahoo.com
I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.
r dataframe compare
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm quite new to R, and I'm trying to solve this problem that seems quite simple, but I'm not sure how to go about it. I'm trying to compare two dataframes, and print out the rows that are in one but not the other, and also print another list/dataframe with the rows where only one cell has been updated.
df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com
df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com
The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:
require(sqldf)
df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
This gave me the output:
`firstname lastname email`
`Frederick Sam sammic@gmail.com`
Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.
So, I want a way to print out:
firstname lastname email
Grace Holly rickyoaks@yahoo.com
I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.
r dataframe compare
I'm quite new to R, and I'm trying to solve this problem that seems quite simple, but I'm not sure how to go about it. I'm trying to compare two dataframes, and print out the rows that are in one but not the other, and also print another list/dataframe with the rows where only one cell has been updated.
df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com
df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com
The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:
require(sqldf)
df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')
This gave me the output:
`firstname lastname email`
`Frederick Sam sammic@gmail.com`
Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.
So, I want a way to print out:
firstname lastname email
Grace Holly rickyoaks@yahoo.com
I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.
r dataframe compare
r dataframe compare
asked Nov 9 at 12:53
Phoenix
154
154
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:
sqldf("select df1.*, df2.email email2
from df1
join df2 on df1.firstname = df2.firstname and
df1.lastname = df2.lastname and
df1.email <> df2.email")
giving the following which shows the df1 record and the differing email from df2.
firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com
2) or a base solution would be:
subset(merge(df1, df2, by = 1:2), email.x != email.y)
Note
The input used in reproducible form is:
Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"
Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"
df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)
add a comment |
up vote
2
down vote
First, I create the data frames.
# Create data frames
df1 <- read.table(text = "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com", ,
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com",
header = TRUE, stringsAsFactors = FALSE)
Next, I load dplyr
.
# Load libraries
library(dplyr)
Here, I perform an antijoin to find the rows in df2
that are not in df1
.
# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))
# firstname lastname email
# 1 Frederick Sam sammic@gmail.com
Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2
, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.
# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>%
bind_rows(df2) %>%
anti_join(df3) %>%
filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)
# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com
I'm currently thinking about a more succinct way of writing the filter
line that generalises to an arbitrary number of columns.
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:
sqldf("select df1.*, df2.email email2
from df1
join df2 on df1.firstname = df2.firstname and
df1.lastname = df2.lastname and
df1.email <> df2.email")
giving the following which shows the df1 record and the differing email from df2.
firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com
2) or a base solution would be:
subset(merge(df1, df2, by = 1:2), email.x != email.y)
Note
The input used in reproducible form is:
Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"
Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"
df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)
add a comment |
up vote
1
down vote
accepted
1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:
sqldf("select df1.*, df2.email email2
from df1
join df2 on df1.firstname = df2.firstname and
df1.lastname = df2.lastname and
df1.email <> df2.email")
giving the following which shows the df1 record and the differing email from df2.
firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com
2) or a base solution would be:
subset(merge(df1, df2, by = 1:2), email.x != email.y)
Note
The input used in reproducible form is:
Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"
Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"
df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:
sqldf("select df1.*, df2.email email2
from df1
join df2 on df1.firstname = df2.firstname and
df1.lastname = df2.lastname and
df1.email <> df2.email")
giving the following which shows the df1 record and the differing email from df2.
firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com
2) or a base solution would be:
subset(merge(df1, df2, by = 1:2), email.x != email.y)
Note
The input used in reproducible form is:
Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"
Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"
df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)
1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:
sqldf("select df1.*, df2.email email2
from df1
join df2 on df1.firstname = df2.firstname and
df1.lastname = df2.lastname and
df1.email <> df2.email")
giving the following which shows the df1 record and the differing email from df2.
firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com
2) or a base solution would be:
subset(merge(df1, df2, by = 1:2), email.x != email.y)
Note
The input used in reproducible form is:
Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"
Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"
df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)
edited 2 days ago
answered 2 days ago
G. Grothendieck
141k9123227
141k9123227
add a comment |
add a comment |
up vote
2
down vote
First, I create the data frames.
# Create data frames
df1 <- read.table(text = "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com", ,
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com",
header = TRUE, stringsAsFactors = FALSE)
Next, I load dplyr
.
# Load libraries
library(dplyr)
Here, I perform an antijoin to find the rows in df2
that are not in df1
.
# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))
# firstname lastname email
# 1 Frederick Sam sammic@gmail.com
Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2
, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.
# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>%
bind_rows(df2) %>%
anti_join(df3) %>%
filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)
# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com
I'm currently thinking about a more succinct way of writing the filter
line that generalises to an arbitrary number of columns.
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52
add a comment |
up vote
2
down vote
First, I create the data frames.
# Create data frames
df1 <- read.table(text = "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com", ,
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com",
header = TRUE, stringsAsFactors = FALSE)
Next, I load dplyr
.
# Load libraries
library(dplyr)
Here, I perform an antijoin to find the rows in df2
that are not in df1
.
# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))
# firstname lastname email
# 1 Frederick Sam sammic@gmail.com
Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2
, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.
# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>%
bind_rows(df2) %>%
anti_join(df3) %>%
filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)
# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com
I'm currently thinking about a more succinct way of writing the filter
line that generalises to an arbitrary number of columns.
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52
add a comment |
up vote
2
down vote
up vote
2
down vote
First, I create the data frames.
# Create data frames
df1 <- read.table(text = "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com", ,
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com",
header = TRUE, stringsAsFactors = FALSE)
Next, I load dplyr
.
# Load libraries
library(dplyr)
Here, I perform an antijoin to find the rows in df2
that are not in df1
.
# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))
# firstname lastname email
# 1 Frederick Sam sammic@gmail.com
Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2
, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.
# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>%
bind_rows(df2) %>%
anti_join(df3) %>%
filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)
# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com
I'm currently thinking about a more succinct way of writing the filter
line that generalises to an arbitrary number of columns.
First, I create the data frames.
# Create data frames
df1 <- read.table(text = "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com", ,
header = TRUE, stringsAsFactors = FALSE)
df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com",
header = TRUE, stringsAsFactors = FALSE)
Next, I load dplyr
.
# Load libraries
library(dplyr)
Here, I perform an antijoin to find the rows in df2
that are not in df1
.
# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))
# firstname lastname email
# 1 Frederick Sam sammic@gmail.com
Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2
, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.
# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>%
bind_rows(df2) %>%
anti_join(df3) %>%
filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)
# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com
I'm currently thinking about a more succinct way of writing the filter
line that generalises to an arbitrary number of columns.
edited Nov 9 at 13:41
answered Nov 9 at 13:36
Lyngbakr
3,82911224
3,82911224
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52
add a comment |
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41
I have also tried another solution, based on a similar question I spotted:
email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)– Phoenix
Nov 9 at 14:52
I have also tried another solution, based on a similar question I spotted:
email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)
It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)– Phoenix
Nov 9 at 14:52
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53226082%2fcompare-two-dataframes-and-print-out-updated-rows-in-r%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password