Compare two dataframes and print out updated rows in R

up vote
1
down vote

favorite

I'm quite new to R, and I'm trying to solve this problem that seems quite simple, but I'm not sure how to go about it. I'm trying to compare two dataframes, and print out the rows that are in one but not the other, and also print another list/dataframe with the rows where only one cell has been updated.

df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com

df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com

The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:

require(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')

This gave me the output:

`firstname lastname email`
`Frederick Sam sammic@gmail.com`

Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.

So, I want a way to print out:

firstname lastname email

Grace Holly rickyoaks@yahoo.com

I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.

asked Nov 9 at 12:53

Phoenix

154

add a comment |

up vote
1
down vote

favorite

df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com

df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com

The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:

require(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')

This gave me the output:

`firstname lastname email`
`Frederick Sam sammic@gmail.com`

Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.

So, I want a way to print out:

firstname lastname email

Grace Holly rickyoaks@yahoo.com

I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.

asked Nov 9 at 12:53

Phoenix

154

add a comment |

up vote
1
down vote

favorite

df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com

df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com

The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:

require(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')

This gave me the output:

`firstname lastname email`
`Frederick Sam sammic@gmail.com`

Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.

So, I want a way to print out:

firstname lastname email

Grace Holly rickyoaks@yahoo.com

I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.

asked Nov 9 at 12:53

Phoenix

154

df1
firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com

df2
firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com

The first thing I wanted to do was get the rows that were in df2 and not in df1, and this is how I did it:

require(sqldf)

df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1')

This gave me the output:

`firstname lastname email`
`Frederick Sam sammic@gmail.com`

Now, what I want is a way to get the first row as an output of it's own, by noting that the first and last name are the same, but the email is different.

So, I want a way to print out:

firstname lastname email

Grace Holly rickyoaks@yahoo.com

I've looked at the compare () function, and the merge and other functions, but they seem to be comparing to spot different rows, instead of different cells.

r dataframe compare

asked Nov 9 at 12:53

Phoenix

154

asked Nov 9 at 12:53

Phoenix

154

asked Nov 9 at 12:53

Phoenix

154

asked Nov 9 at 12:53

Phoenix

154

asked Nov 9 at 12:53

Phoenix

154

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:

sqldf("select df1.*, df2.email email2 
 from df1 
 join df2 on df1.firstname = df2.firstname and 
 df1.lastname = df2.lastname and 
 df1.email <> df2.email")

giving the following which shows the df1 record and the differing email from df2.

 firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com

2) or a base solution would be:

subset(merge(df1, df2, by = 1:2), email.x != email.y)

Note

The input used in reproducible form is:

Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"

Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

add a comment |

up vote
2
down vote

First, I create the data frames.

# Create data frames
df1 <- read.table(text = "firstname lastname email
 Grace Holly hollyoaks@yahoo.com
 Trish Edison edisontrish@gmail.com", ,
 header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com", 
 header = TRUE, stringsAsFactors = FALSE)

Next, I load dplyr.

# Load libraries
library(dplyr)

Here, I perform an antijoin to find the rows in df2 that are not in df1.

# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))

# firstname lastname email
# 1 Frederick Sam sammic@gmail.com

Then, I bind together the original two data frames, remove the row identified earlier as only appearing in df2, then I check for duplicates using all columns except one. If there are duplicates in all column except for one, I keep those rows.

# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>% 
 bind_rows(df2) %>% 
 anti_join(df3) %>% 
 filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)

# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com

I'm currently thinking about a more succinct way of writing the filter line that generalises to an arbitrary number of columns.

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted: email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1) It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53226082%2fcompare-two-dataframes-and-print-out-updated-rows-in-r%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:

sqldf("select df1.*, df2.email email2 
 from df1 
 join df2 on df1.firstname = df2.firstname and 
 df1.lastname = df2.lastname and 
 df1.email <> df2.email")

giving the following which shows the df1 record and the differing email from df2.

 firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com

2) or a base solution would be:

subset(merge(df1, df2, by = 1:2), email.x != email.y)

Note

The input used in reproducible form is:

Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"

Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

add a comment |

up vote
1
down vote

accepted

1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:

sqldf("select df1.*, df2.email email2 
 from df1 
 join df2 on df1.firstname = df2.firstname and 
 df1.lastname = df2.lastname and 
 df1.email <> df2.email")

giving the following which shows the df1 record and the differing email from df2.

 firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com

2) or a base solution would be:

subset(merge(df1, df2, by = 1:2), email.x != email.y)

Note

The input used in reproducible form is:

Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"

Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

add a comment |

up vote
1
down vote

accepted

1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:

sqldf("select df1.*, df2.email email2 
 from df1 
 join df2 on df1.firstname = df2.firstname and 
 df1.lastname = df2.lastname and 
 df1.email <> df2.email")

giving the following which shows the df1 record and the differing email from df2.

 firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com

2) or a base solution would be:

subset(merge(df1, df2, by = 1:2), email.x != email.y)

Note

The input used in reproducible form is:

Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"

Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

1) If you are looking for a way to list those people whose firstname and lastname are in both df1 and df2 but whose email is different then:

sqldf("select df1.*, df2.email email2 
 from df1 
 join df2 on df1.firstname = df2.firstname and 
 df1.lastname = df2.lastname and 
 df1.email <> df2.email")

giving the following which shows the df1 record and the differing email from df2.

 firstname lastname email email2
1 Grace Holly hollyoaks@yahoo.com rickyoaks@yahoo.com

2) or a base solution would be:

subset(merge(df1, df2, by = 1:2), email.x != email.y)

Note

The input used in reproducible form is:

Lines1 <- "firstname lastname email
Grace Holly hollyoaks@yahoo.com
Trish Edison edisontrish@gmail.com"

Lines2 <- "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com"

df1 <- read.table(text = Lines1, header = TRUE, as.is = TRUE, strip.white = TRUE)
df2 <- read.table(text = Lines2, header = TRUE, as.is = TRUE, strip.white = TRUE)

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

edited 2 days ago

answered 2 days ago

G. Grothendieck

141k9123227

answered 2 days ago

G. Grothendieck

141k9123227

answered 2 days ago

G. Grothendieck

141k9123227

add a comment |

up vote
2
down vote

First, I create the data frames.

# Create data frames
df1 <- read.table(text = "firstname lastname email
 Grace Holly hollyoaks@yahoo.com
 Trish Edison edisontrish@gmail.com", ,
 header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com", 
 header = TRUE, stringsAsFactors = FALSE)

Next, I load dplyr.

# Load libraries
library(dplyr)

Here, I perform an antijoin to find the rows in df2 that are not in df1.

# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))

# firstname lastname email
# 1 Frederick Sam sammic@gmail.com

# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>% 
 bind_rows(df2) %>% 
 anti_join(df3) %>% 
 filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)

# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com

I'm currently thinking about a more succinct way of writing the filter line that generalises to an arbitrary number of columns.

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted: email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1) It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

add a comment |

up vote
2
down vote

First, I create the data frames.

# Create data frames
df1 <- read.table(text = "firstname lastname email
 Grace Holly hollyoaks@yahoo.com
 Trish Edison edisontrish@gmail.com", ,
 header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com", 
 header = TRUE, stringsAsFactors = FALSE)

Next, I load dplyr.

# Load libraries
library(dplyr)

Here, I perform an antijoin to find the rows in df2 that are not in df1.

# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))

# firstname lastname email
# 1 Frederick Sam sammic@gmail.com

# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>% 
 bind_rows(df2) %>% 
 anti_join(df3) %>% 
 filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)

# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com

I'm currently thinking about a more succinct way of writing the filter line that generalises to an arbitrary number of columns.

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted: email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1) It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

add a comment |

up vote
2
down vote

First, I create the data frames.

# Create data frames
df1 <- read.table(text = "firstname lastname email
 Grace Holly hollyoaks@yahoo.com
 Trish Edison edisontrish@gmail.com", ,
 header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com", 
 header = TRUE, stringsAsFactors = FALSE)

Next, I load dplyr.

# Load libraries
library(dplyr)

Here, I perform an antijoin to find the rows in df2 that are not in df1.

# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))

# firstname lastname email
# 1 Frederick Sam sammic@gmail.com

# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>% 
 bind_rows(df2) %>% 
 anti_join(df3) %>% 
 filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)

# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com

I'm currently thinking about a more succinct way of writing the filter line that generalises to an arbitrary number of columns.

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

First, I create the data frames.

# Create data frames
df1 <- read.table(text = "firstname lastname email
 Grace Holly hollyoaks@yahoo.com
 Trish Edison edisontrish@gmail.com", ,
 header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = "firstname lastname email
Grace Holly rickyoaks@yahoo.com
Frederick Sam sammic@gmail.com", 
 header = TRUE, stringsAsFactors = FALSE)

Next, I load dplyr.

# Load libraries
library(dplyr)

Here, I perform an antijoin to find the rows in df2 that are not in df1.

# Perform antijoin
df3 <- df2 %>% anti_join(df1, by = c("firstname", "lastname"))

# firstname lastname email
# 1 Frederick Sam sammic@gmail.com

# Bind two data frames together
# Remove those only appearing in df2
# Filter to those with duplicates in all but one column
df1 %>% 
 bind_rows(df2) %>% 
 anti_join(df3) %>% 
 filter((duplicated(firstname, lastname) + duplicated(email, lastname) + duplicated(firstname, email)) == ncol(df1) - 1)

# firstname lastname email
# 1 Grace Holly rickyoaks@yahoo.com

I'm currently thinking about a more succinct way of writing the filter line that generalises to an arbitrary number of columns.

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

edited Nov 9 at 13:41

answered Nov 9 at 13:36

Lyngbakr

3,82911224

answered Nov 9 at 13:36

Lyngbakr

3,82911224

answered Nov 9 at 13:36

Lyngbakr

3,82911224

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted: email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1) It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

add a comment |

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted: email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1) It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

This is great, thanks. I have also tried another solution, based on a similar question I spotted:
– Phoenix
Nov 9 at 14:41

I have also tried another solution, based on a similar question I spotted:

email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)

It solved the two questions, about finding the rows not in the other dataframe, as well as picking the row with only the email as as the difference. Your solution is neater though :)
– Phoenix
Nov 9 at 14:52

I have also tried another solution, based on a similar question I spotted:

email_change <- anti_join(df2, df1) email_update_only <- inner_join(email_change, df2, by = c("firstname", "lastname"), suffix = c(".df1", ".df2")) %>% filter(email.df2 != email.df1)

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

This page is only for reference, If you need detailed information, please check here

A2g Ep,SDz7mP6VxTVGBTumy

搜尋此網誌

Pfthb