Raw Time Series Data storing POST CREATION









up vote
-1
down vote

favorite












I am designing a system that, among other things, will store Time Series data (until they somehow expire).



Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).



Right now I am a little bit confused.



I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).



For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.



I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).



Cassandra is a best-bet when insertion-rate is very high, and this is not my case.



What system should I prefer, then???



Just a quick recap of my constraints:



  • High amount of data to be stored (100M rows for a day)

  • ETL tool needed (better if open sourced)

  • High reading rate

  • Low (but "big", during no-event periods) insertion rate

  • Not "powerful only when scaled" (I have only a single available node for this system)

  • Strong Data aging/retention policies

EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION



Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!










share|improve this question























  • What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
    – Cristian Curti
    14 hours ago














up vote
-1
down vote

favorite












I am designing a system that, among other things, will store Time Series data (until they somehow expire).



Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).



Right now I am a little bit confused.



I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).



For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.



I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).



Cassandra is a best-bet when insertion-rate is very high, and this is not my case.



What system should I prefer, then???



Just a quick recap of my constraints:



  • High amount of data to be stored (100M rows for a day)

  • ETL tool needed (better if open sourced)

  • High reading rate

  • Low (but "big", during no-event periods) insertion rate

  • Not "powerful only when scaled" (I have only a single available node for this system)

  • Strong Data aging/retention policies

EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION



Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!










share|improve this question























  • What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
    – Cristian Curti
    14 hours ago












up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I am designing a system that, among other things, will store Time Series data (until they somehow expire).



Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).



Right now I am a little bit confused.



I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).



For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.



I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).



Cassandra is a best-bet when insertion-rate is very high, and this is not my case.



What system should I prefer, then???



Just a quick recap of my constraints:



  • High amount of data to be stored (100M rows for a day)

  • ETL tool needed (better if open sourced)

  • High reading rate

  • Low (but "big", during no-event periods) insertion rate

  • Not "powerful only when scaled" (I have only a single available node for this system)

  • Strong Data aging/retention policies

EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION



Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!










share|improve this question















I am designing a system that, among other things, will store Time Series data (until they somehow expire).



Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).



Right now I am a little bit confused.



I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).



For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.



I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).



Cassandra is a best-bet when insertion-rate is very high, and this is not my case.



What system should I prefer, then???



Just a quick recap of my constraints:



  • High amount of data to be stored (100M rows for a day)

  • ETL tool needed (better if open sourced)

  • High reading rate

  • Low (but "big", during no-event periods) insertion rate

  • Not "powerful only when scaled" (I have only a single available node for this system)

  • Strong Data aging/retention policies

EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION



Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!







database performance cassandra time-series etl






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 15 hours ago

























asked 19 hours ago









LucaF

32




32











  • What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
    – Cristian Curti
    14 hours ago
















  • What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
    – Cristian Curti
    14 hours ago















What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago




What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53224137%2fraw-time-series-data-storing-post-creation%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53224137%2fraw-time-series-data-storing-post-creation%23new-answer', 'question_page');

);

Post as a guest














































































Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo