How to properly do conditional aggregation in Hive

Assume I have this data:

player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]

And I want to display it as such:

player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98

The original data above is currently thrown into an ORC table using:

SELECT p.player_id
 , s.position
 , s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s

Which gets me:

player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98

Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.

In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.

Here would be the primary way this data is queried:

SELECT player_id
 , max(case when position = 'offense' then wins end) as offense_wins
 , max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id

Which would result in :

player_id offense_wins defense_wins
100 35 17

Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.

The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.

Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.

asked Nov 13 '18 at 21:51

squirrelchew

186

add a comment |

Assume I have this data:

player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]

And I want to display it as such:

player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98

The original data above is currently thrown into an ORC table using:

SELECT p.player_id
 , s.position
 , s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s

Which gets me:

player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98

In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.

Here would be the primary way this data is queried:

SELECT player_id
 , max(case when position = 'offense' then wins end) as offense_wins
 , max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id

Which would result in :

player_id offense_wins defense_wins
100 35 17

The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.

asked Nov 13 '18 at 21:51

squirrelchew

186

add a comment |

Assume I have this data:

player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]

And I want to display it as such:

player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98

The original data above is currently thrown into an ORC table using:

SELECT p.player_id
 , s.position
 , s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s

Which gets me:

player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98

In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.

Here would be the primary way this data is queried:

SELECT player_id
 , max(case when position = 'offense' then wins end) as offense_wins
 , max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id

Which would result in :

player_id offense_wins defense_wins
100 35 17

The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.

asked Nov 13 '18 at 21:51

squirrelchew

186

Assume I have this data:

player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]

And I want to display it as such:

player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98

The original data above is currently thrown into an ORC table using:

SELECT p.player_id
 , s.position
 , s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s

Which gets me:

player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98

In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.

Here would be the primary way this data is queried:

SELECT player_id
 , max(case when position = 'offense' then wins end) as offense_wins
 , max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id

Which would result in :

player_id offense_wins defense_wins
100 35 17

The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.

hive aggregation

asked Nov 13 '18 at 21:51

squirrelchew

186

asked Nov 13 '18 at 21:51

squirrelchew

186

asked Nov 13 '18 at 21:51

squirrelchew

186

asked Nov 13 '18 at 21:51

squirrelchew

186

asked Nov 13 '18 at 21:51

squirrelchew

186

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290055%2fhow-to-properly-do-conditional-aggregation-in-hive%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb