Raw Time Series Data storing POST CREATION
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
add a comment |
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
I am designing a system that, among other things, will store Time Series data (until they somehow expire).
Data is coming from different sensors that read vibrations at very high frequencies (10+ kHz); my system will be designed for analytics purposes, so MANY MANY MANY reads, expecially because of range queries. Inserts will be rare: for each event another middleware system records metrics and save then onto different .csv files, given as output ONLY AFTER the event end (so 0 possibilities of an "online" data ingestion).
Right now I am a little bit confused.
I first tried a NoSQL solution (Cassandra), using Pentaho DI as the ETL tool (I also have to add, for each record/row, some additional info, like timestamp, sensor, etc.).
For 1000M rows, expected finish time was 32 (THIRTY-TWO) hours.
I am sure I messed things up while transforming data, but I am however ready to bet that ETL + Cassandra is not the best fit for my problem (probably high-latency queries while reading).
Cassandra is a best-bet when insertion-rate is very high, and this is not my case.
What system should I prefer, then???
Just a quick recap of my constraints:
- High amount of data to be stored (100M rows for a day)
- ETL tool needed (better if open sourced)
- High reading rate
- Low (but "big", during no-event periods) insertion rate
- Not "powerful only when scaled" (I have only a single available node for this system)
- Strong Data aging/retention policies
EDIT FOR THE FUNNY GUY WHO DOWNVOTED MY QUESTION
Dude, I certainly appreciate advices and corrections; but please, please!!! Don't just downvote questions without telling why!
database performance cassandra time-series etl
database performance cassandra time-series etl
edited 15 hours ago
asked 19 hours ago
LucaF
32
32
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53224137%2fraw-time-series-data-storing-post-creation%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
What exactly did you try with PDI ? Could you share the KTR's/KJB's without exposing sensitive data ? That could help a lot.
– Cristian Curti
14 hours ago