Incremental upload/update to PostgreSQL table using Pentaho DI

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.

The first time I run this flow everything goes ok and the table gets populated. However, if later I want to do an incremental update on the same table, I need to truncate it and run the flow again. Is there any method that allows me to only load new/updated rows?

In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.

See my implementation:

enter image description here

Thanks in advance!!

asked Nov 11 at 18:20

Gabriela Martinez

14514

add a comment |

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.

In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.

See my implementation:

enter image description here

Thanks in advance!!

asked Nov 11 at 18:20

Gabriela Martinez

14514

add a comment |

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.

In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.

See my implementation:

enter image description here

Thanks in advance!!

asked Nov 11 at 18:20

Gabriela Martinez

14514

I have the following flow in Pentaho Data Integration to read a txt file and map it to a PostgreSQL table.

In the PostgreSQL Bulk Load operator, I can only see "Truncate/Insert" options and this is very inefficient, as my tables are really large.

See my implementation:

enter image description here

Thanks in advance!!

postgresql pentaho-data-integration incremental-build

asked Nov 11 at 18:20

Gabriela Martinez

14514

asked Nov 11 at 18:20

Gabriela Martinez

14514

asked Nov 11 at 18:20

Gabriela Martinez

14514

asked Nov 11 at 18:20

Gabriela Martinez

14514

asked Nov 11 at 18:20

Gabriela Martinez

14514

add a comment |

2 Answers
2

active

oldest

votes

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

Try using the Table output step, with Batch size("Commit size" in the step) of 5000, and altering the number of copies executing the step (depends on the number of cores your processor has) to say, 4 copies (Dual core CPU with 2 logical cores ea.). You can alter the number of copies by right clicking the step in the GUI and setting the desired number.

This will parallelize the output into 4 groups of Inserts, of 5000 rows per 'cycle' each. If this cause memory overload in the JVM, you can easily adapt that and increase the memory usage in the option PENTAHO_DI_JAVA_OPTIONS, simply double the amount that's set on Xms(minimum) and XmX(maximum), mine is set to "-Xms2048m" "-Xmx4096m".

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

answered Nov 13 at 18:02

Cristian Curti

32828

1

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

add a comment |

you are looking for an incremental load. you can do it in two ways.

There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251792%2fincremental-upload-update-to-postgresql-table-using-pentaho-di%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

answered Nov 13 at 18:02

Cristian Curti

32828

1

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

add a comment |

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

answered Nov 13 at 18:02

Cristian Curti

32828

1

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

add a comment |

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

answered Nov 13 at 18:02

Cristian Curti

32828

Looking around for possibilities, some users say that the only advantage of Bulk Loader is performance with very large batch of rows (upwards of millions). But there ways of countering this.

The only peculiarity i found with this step and PostgreSQL is that you need to specify the Database Fields in the step, even if the incoming rows have the exact same layout as the table.

answered Nov 13 at 18:02

Cristian Curti

32828

answered Nov 13 at 18:02

Cristian Curti

32828

answered Nov 13 at 18:02

Cristian Curti

32828

answered Nov 13 at 18:02

Cristian Curti

32828

1

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

add a comment |

1

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

I think requirement is for incremental load, not for batch load. please find my answer. do you think it is right approach ?
– KP M
Nov 14 at 15:52

Ah yes, that missed my sight, for this the 'Synchronize After Merge' step is better, i'll edit my answer as soon as i can, or you can edit yours if you know how the step works. Otherwise, the Insert/Update step works fine as well, it's just that Synchronize after merge handles Inserts/Updates/Deletes. If deletes are not needed, Insert/Update will work all the same.
– Cristian Curti
Nov 14 at 17:56

add a comment |

you are looking for an incremental load. you can do it in two ways.

There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

add a comment |

you are looking for an incremental load. you can do it in two ways.

There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

add a comment |

you are looking for an incremental load. you can do it in two ways.

There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

you are looking for an incremental load. you can do it in two ways.

There is a step called "Insert/Update" , this will be used to do incremental load.

you will have option to specify key columns to compare. then under fields section select "Y" for update. Please select "N" for those columns you are selecting under key comparison.

Use table output and uncheck "Truncate table" option. While retrieving the data from source table, use variable in where clause. first get the max value from your target table and set this value to a variable and include in the where clause of your query.

Editing here..
if your data source is a flat file, then as I told get the max value(date/int) from target table and join with your data. after that use filter rows to have incremental data.

Hope this will help.

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

edited Nov 14 at 15:59

answered Nov 14 at 15:51

KP M

8910

answered Nov 14 at 15:51

KP M

8910

answered Nov 14 at 15:51

KP M

8910

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb