Memory leakage in using `ggplot` on large binned datasets










1















I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.



In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.



In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.




REPRODUCIBLE EXAMPLE



library(data.table)
library(ggplot2)
library(magrittr)

# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv)
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) return(mean(xBreaks[c(i, i+1)])))[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) return(mean(yBreaks[c(i, i+1)])))[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)


before.mem.size <- memory.size()

# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) return(runif(nrow))) %>% set_names(paste0("x", seq(ncol))))

# Graph plotting
dummyEnv <- new.env()
with(dummyEnv,
fcn <- function(tab)
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)

lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
)
rm(dummyEnv)

# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()

# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))



I have tried the following modifications to the code:



  • Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.

  • The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.

  • Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).

  • After the process, even after I call rm(list = ls()), the memory is still non-recoverable.

I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.



Thanks for attention!










share|improve this question






















  • Add with(dummyEnv, rm(list = ls())) before removing the environment.

    – Roland
    Nov 15 '18 at 7:49











  • Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

    – Matthew Hui
    Nov 15 '18 at 9:36











  • I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

    – Roland
    Nov 15 '18 at 11:18











  • It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

    – Roland
    Nov 15 '18 at 11:19











  • Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

    – Matthew Hui
    Nov 15 '18 at 16:27















1















I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.



In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.



In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.




REPRODUCIBLE EXAMPLE



library(data.table)
library(ggplot2)
library(magrittr)

# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv)
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) return(mean(xBreaks[c(i, i+1)])))[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) return(mean(yBreaks[c(i, i+1)])))[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)


before.mem.size <- memory.size()

# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) return(runif(nrow))) %>% set_names(paste0("x", seq(ncol))))

# Graph plotting
dummyEnv <- new.env()
with(dummyEnv,
fcn <- function(tab)
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)

lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
)
rm(dummyEnv)

# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()

# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))



I have tried the following modifications to the code:



  • Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.

  • The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.

  • Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).

  • After the process, even after I call rm(list = ls()), the memory is still non-recoverable.

I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.



Thanks for attention!










share|improve this question






















  • Add with(dummyEnv, rm(list = ls())) before removing the environment.

    – Roland
    Nov 15 '18 at 7:49











  • Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

    – Matthew Hui
    Nov 15 '18 at 9:36











  • I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

    – Roland
    Nov 15 '18 at 11:18











  • It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

    – Roland
    Nov 15 '18 at 11:19











  • Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

    – Matthew Hui
    Nov 15 '18 at 16:27













1












1








1


1






I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.



In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.



In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.




REPRODUCIBLE EXAMPLE



library(data.table)
library(ggplot2)
library(magrittr)

# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv)
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) return(mean(xBreaks[c(i, i+1)])))[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) return(mean(yBreaks[c(i, i+1)])))[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)


before.mem.size <- memory.size()

# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) return(runif(nrow))) %>% set_names(paste0("x", seq(ncol))))

# Graph plotting
dummyEnv <- new.env()
with(dummyEnv,
fcn <- function(tab)
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)

lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
)
rm(dummyEnv)

# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()

# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))



I have tried the following modifications to the code:



  • Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.

  • The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.

  • Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).

  • After the process, even after I call rm(list = ls()), the memory is still non-recoverable.

I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.



Thanks for attention!










share|improve this question














I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.



In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.



In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.




REPRODUCIBLE EXAMPLE



library(data.table)
library(ggplot2)
library(magrittr)

# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv)
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) return(mean(xBreaks[c(i, i+1)])))[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) return(mean(yBreaks[c(i, i+1)])))[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)


before.mem.size <- memory.size()

# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) return(runif(nrow))) %>% set_names(paste0("x", seq(ncol))))

# Graph plotting
dummyEnv <- new.env()
with(dummyEnv,
fcn <- function(tab)
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)

lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
)
rm(dummyEnv)

# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()

# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))



I have tried the following modifications to the code:



  • Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.

  • The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.

  • Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).

  • After the process, even after I call rm(list = ls()), the memory is still non-recoverable.

I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.



Thanks for attention!







r memory ggplot2 memory-leaks data.table






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 5:16









Matthew HuiMatthew Hui

16210




16210












  • Add with(dummyEnv, rm(list = ls())) before removing the environment.

    – Roland
    Nov 15 '18 at 7:49











  • Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

    – Matthew Hui
    Nov 15 '18 at 9:36











  • I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

    – Roland
    Nov 15 '18 at 11:18











  • It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

    – Roland
    Nov 15 '18 at 11:19











  • Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

    – Matthew Hui
    Nov 15 '18 at 16:27

















  • Add with(dummyEnv, rm(list = ls())) before removing the environment.

    – Roland
    Nov 15 '18 at 7:49











  • Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

    – Matthew Hui
    Nov 15 '18 at 9:36











  • I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

    – Roland
    Nov 15 '18 at 11:18











  • It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

    – Roland
    Nov 15 '18 at 11:19











  • Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

    – Matthew Hui
    Nov 15 '18 at 16:27
















Add with(dummyEnv, rm(list = ls())) before removing the environment.

– Roland
Nov 15 '18 at 7:49





Add with(dummyEnv, rm(list = ls())) before removing the environment.

– Roland
Nov 15 '18 at 7:49













Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

– Matthew Hui
Nov 15 '18 at 9:36





Thank you for your comment. Yes, the suggestion does help to mitigate the inflation in memory used, but memory leakage still occurs at large especially when the data size is big. Is there any other possible sources of leakage?

– Matthew Hui
Nov 15 '18 at 9:36













I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

– Roland
Nov 15 '18 at 11:18





I would be more careful with calling something "memory leak". It's not easy to investigate this as you have (i) use of environments, (ii) a package object that has some special behavior regarding memory, (iii) two other packages in your example. My suggestion would be to not create dummyEnv.

– Roland
Nov 15 '18 at 11:18













It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

– Roland
Nov 15 '18 at 11:19





It might well be that you have found a bug in R or data.table but right now I can't confirm this and can't see which one.

– Roland
Nov 15 '18 at 11:19













Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

– Matthew Hui
Nov 15 '18 at 16:27





Agree that it is not necessarily memory leak. Not creating dummyEnv does not help but intensify the increase in memory usage. I suppose it is something in how ggplot handle the dataset... Thank you for helping

– Matthew Hui
Nov 15 '18 at 16:27












0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312860%2fmemory-leakage-in-using-ggplot-on-large-binned-datasets%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312860%2fmemory-leakage-in-using-ggplot-on-large-binned-datasets%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo