How to properly do conditional aggregation in Hive










1















Assume I have this data:



player_id stats
100 ["position":"offense","wins":35,"position":"defense","wins":17]
200 ["position":"offense","wins":85,"position":"defense","wins":52]
300 ["position":"offense","wins":12,"position":"defense","wins":98]


And I want to display it as such:



player_id offense_wins defense_wins
100 35 17
200 85 52
300 12 98


The original data above is currently thrown into an ORC table using:



SELECT p.player_id
, s.position
, s.wins
FROM player_stats p
LATERAL VIEW EXPLODE(p.stats) sTable as s


Which gets me:



player_id position wins
100 offense 35
100 defense 17
200 offense 85
200 defense 52
300 offense 12
300 defense 98


Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.



In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.



Here would be the primary way this data is queried:



SELECT player_id
, max(case when position = 'offense' then wins end) as offense_wins
, max(case when position = 'defense' then wins end) as defense_wins
FROM orctable
WHERE player_id = 100
GROUP BY player_id


Which would result in :



player_id offense_wins defense_wins
100 35 17


Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.



The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.



Question 1 - is this the only and/or proper way to transform the data into the desired end result?
Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.










share|improve this question


























    1















    Assume I have this data:



    player_id stats
    100 ["position":"offense","wins":35,"position":"defense","wins":17]
    200 ["position":"offense","wins":85,"position":"defense","wins":52]
    300 ["position":"offense","wins":12,"position":"defense","wins":98]


    And I want to display it as such:



    player_id offense_wins defense_wins
    100 35 17
    200 85 52
    300 12 98


    The original data above is currently thrown into an ORC table using:



    SELECT p.player_id
    , s.position
    , s.wins
    FROM player_stats p
    LATERAL VIEW EXPLODE(p.stats) sTable as s


    Which gets me:



    player_id position wins
    100 offense 35
    100 defense 17
    200 offense 85
    200 defense 52
    300 offense 12
    300 defense 98


    Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.



    In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.



    Here would be the primary way this data is queried:



    SELECT player_id
    , max(case when position = 'offense' then wins end) as offense_wins
    , max(case when position = 'defense' then wins end) as defense_wins
    FROM orctable
    WHERE player_id = 100
    GROUP BY player_id


    Which would result in :



    player_id offense_wins defense_wins
    100 35 17


    Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.



    The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.



    Question 1 - is this the only and/or proper way to transform the data into the desired end result?
    Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.










    share|improve this question
























      1












      1








      1


      0






      Assume I have this data:



      player_id stats
      100 ["position":"offense","wins":35,"position":"defense","wins":17]
      200 ["position":"offense","wins":85,"position":"defense","wins":52]
      300 ["position":"offense","wins":12,"position":"defense","wins":98]


      And I want to display it as such:



      player_id offense_wins defense_wins
      100 35 17
      200 85 52
      300 12 98


      The original data above is currently thrown into an ORC table using:



      SELECT p.player_id
      , s.position
      , s.wins
      FROM player_stats p
      LATERAL VIEW EXPLODE(p.stats) sTable as s


      Which gets me:



      player_id position wins
      100 offense 35
      100 defense 17
      200 offense 85
      200 defense 52
      300 offense 12
      300 defense 98


      Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.



      In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.



      Here would be the primary way this data is queried:



      SELECT player_id
      , max(case when position = 'offense' then wins end) as offense_wins
      , max(case when position = 'defense' then wins end) as defense_wins
      FROM orctable
      WHERE player_id = 100
      GROUP BY player_id


      Which would result in :



      player_id offense_wins defense_wins
      100 35 17


      Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.



      The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.



      Question 1 - is this the only and/or proper way to transform the data into the desired end result?
      Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.










      share|improve this question














      Assume I have this data:



      player_id stats
      100 ["position":"offense","wins":35,"position":"defense","wins":17]
      200 ["position":"offense","wins":85,"position":"defense","wins":52]
      300 ["position":"offense","wins":12,"position":"defense","wins":98]


      And I want to display it as such:



      player_id offense_wins defense_wins
      100 35 17
      200 85 52
      300 12 98


      The original data above is currently thrown into an ORC table using:



      SELECT p.player_id
      , s.position
      , s.wins
      FROM player_stats p
      LATERAL VIEW EXPLODE(p.stats) sTable as s


      Which gets me:



      player_id position wins
      100 offense 35
      100 defense 17
      200 offense 85
      200 defense 52
      300 offense 12
      300 defense 98


      Now from this point in MySQL I can just group_by the player_id then case the position, pulling in the associated wins value when it = 'offense' or 'defense' into their own columns, then wrap each case in a COALESCE() to prevent the nulls from coming through. Super fast.



      In Hive, instead of COALESCE I have to use MIN or MAX, but the result will be the same regardless.



      Here would be the primary way this data is queried:



      SELECT player_id
      , max(case when position = 'offense' then wins end) as offense_wins
      , max(case when position = 'defense' then wins end) as defense_wins
      FROM orctable
      WHERE player_id = 100
      GROUP BY player_id


      Which would result in :



      player_id offense_wins defense_wins
      100 35 17


      Now, in my real world situation the original dataset has six instances of that 'stats' array, each containing a map of 3-5 pairs. Because of this, the ORC table has player_id listed some 700 times from the repeated lateral views.



      The entire table is 300k rows, and the player_id in the real world example is duplicated on this table just over 700 times.



      Question 1 - is this the only and/or proper way to transform the data into the desired end result?
      Question 2 - should this query be taking between 5 and 10 seconds to complete? The same dataset on a small MySQL server would do this in milliseconds.







      hive aggregation






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 13 '18 at 21:51









      squirrelchewsquirrelchew

      186




      186






















          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290055%2fhow-to-properly-do-conditional-aggregation-in-hive%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53290055%2fhow-to-properly-do-conditional-aggregation-in-hive%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Darth Vader #20

          How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

          Ondo