How to properly do conditional aggregation in Hive
Assume I have this data:
player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]
And I want to display it as such:
player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98
The original data above is currently thrown into an ORC table using:
SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s
Which gets me:
player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98
Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.
In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.
Here would be the primary way this data is queried:
SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id
Which would result in :
player_id offense_wins defense_wins
100 35 17
Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.
The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.
Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.
hive aggregation
add a comment |
Assume I have this data:
player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]
And I want to display it as such:
player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98
The original data above is currently thrown into an ORC table using:
SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s
Which gets me:
player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98
Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.
In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.
Here would be the primary way this data is queried:
SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id
Which would result in :
player_id offense_wins defense_wins
100 35 17
Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.
The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.
Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.
hive aggregation
add a comment |
Assume I have this data:
player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]
And I want to display it as such:
player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98
The original data above is currently thrown into an ORC table using:
SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s
Which gets me:
player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98
Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.
In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.
Here would be the primary way this data is queried:
SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id
Which would result in :
player_id offense_wins defense_wins
100 35 17
Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.
The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.
Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.
hive aggregation
Assume I have this data:
player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]
And I want to display it as such:
player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98
The original data above is currently thrown into an ORC table using:
SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s
Which gets me:
player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98
Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.
In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.
Here would be the primary way this data is queried:
SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id
Which would result in :
player_id offense_wins defense_wins
100 35 17
Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.
The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.
Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.
hive aggregation
hive aggregation
asked Nov 13 '18 at 21:51
squirrelchewsquirrelchew
186
186
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290055%2fhow-to-properly-do-conditional-aggregation-in-hive%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290055%2fhow-to-properly-do-conditional-aggregation-in-hive%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown