Raw Time Series Data storing POST CREATION

Multi tool use
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
add a comment |
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
database performance cassandra time-series etl
edited 15 hours ago
asked 19 hours ago
LucaF
32
32
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53224137%2fraw-time-series-data-storing-post-creation%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
fj5wT2dWmkXUYCT,q,GbX5o2s2kVZEIh3ygR,yo2zwNCdr
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago