struggling to handle deduplication after aggregation in spark streaming



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








-1















1.streaming data is coming from kafka
2.consuming through spark streaming
3.firstname,lastname,userid and membername ( using member names i am getting the member count
e.g mark,tyson,2,chris,lisa,iwanka - so here member count is 3



somehow i have to do the count its the requirmnt . but how can i remove deduplication after aggregation . its my concern



 val df2=df.select(firstname,lastname,membercount,userid)
df2.writestream.format("console").start().awaitTermination

or
df3.select("*").where("membercount >= 3").dropDuplication("userid")

// this one is not working , but i need to do the same after
count only so that in batches same user id will not come again.
only first time entry i want.


Batch-1 output



 firstname lastname member-count userid

john smith 5 1
mark boucher 8 2
shawn pollock 3 3



batch-2 output



 firstname lastname member-count userid

john smith 7 (prev.count 5) 1
shawn pollock 12 (prev.count 8) 3
chris jordan 6 4



// but here i want batch -2 ---------output



1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches.



i.e based on userid , i want entry for the one time only in batch output
and neglect again the same user in batch output
firstname lastname member-count userid
chris jordan 6 4










share|improve this question






























    -1















    1.streaming data is coming from kafka
    2.consuming through spark streaming
    3.firstname,lastname,userid and membername ( using member names i am getting the member count
    e.g mark,tyson,2,chris,lisa,iwanka - so here member count is 3



    somehow i have to do the count its the requirmnt . but how can i remove deduplication after aggregation . its my concern



     val df2=df.select(firstname,lastname,membercount,userid)
    df2.writestream.format("console").start().awaitTermination

    or
    df3.select("*").where("membercount >= 3").dropDuplication("userid")

    // this one is not working , but i need to do the same after
    count only so that in batches same user id will not come again.
    only first time entry i want.


    Batch-1 output



     firstname lastname member-count userid

    john smith 5 1
    mark boucher 8 2
    shawn pollock 3 3



    batch-2 output



     firstname lastname member-count userid

    john smith 7 (prev.count 5) 1
    shawn pollock 12 (prev.count 8) 3
    chris jordan 6 4



    // but here i want batch -2 ---------output



    1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches.



    i.e based on userid , i want entry for the one time only in batch output
    and neglect again the same user in batch output
    firstname lastname member-count userid
    chris jordan 6 4










    share|improve this question


























      -1












      -1








      -1








      1.streaming data is coming from kafka
      2.consuming through spark streaming
      3.firstname,lastname,userid and membername ( using member names i am getting the member count
      e.g mark,tyson,2,chris,lisa,iwanka - so here member count is 3



      somehow i have to do the count its the requirmnt . but how can i remove deduplication after aggregation . its my concern



       val df2=df.select(firstname,lastname,membercount,userid)
      df2.writestream.format("console").start().awaitTermination

      or
      df3.select("*").where("membercount >= 3").dropDuplication("userid")

      // this one is not working , but i need to do the same after
      count only so that in batches same user id will not come again.
      only first time entry i want.


      Batch-1 output



       firstname lastname member-count userid

      john smith 5 1
      mark boucher 8 2
      shawn pollock 3 3



      batch-2 output



       firstname lastname member-count userid

      john smith 7 (prev.count 5) 1
      shawn pollock 12 (prev.count 8) 3
      chris jordan 6 4



      // but here i want batch -2 ---------output



      1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches.



      i.e based on userid , i want entry for the one time only in batch output
      and neglect again the same user in batch output
      firstname lastname member-count userid
      chris jordan 6 4










      share|improve this question
















      1.streaming data is coming from kafka
      2.consuming through spark streaming
      3.firstname,lastname,userid and membername ( using member names i am getting the member count
      e.g mark,tyson,2,chris,lisa,iwanka - so here member count is 3



      somehow i have to do the count its the requirmnt . but how can i remove deduplication after aggregation . its my concern



       val df2=df.select(firstname,lastname,membercount,userid)
      df2.writestream.format("console").start().awaitTermination

      or
      df3.select("*").where("membercount >= 3").dropDuplication("userid")

      // this one is not working , but i need to do the same after
      count only so that in batches same user id will not come again.
      only first time entry i want.


      Batch-1 output



       firstname lastname member-count userid

      john smith 5 1
      mark boucher 8 2
      shawn pollock 3 3



      batch-2 output



       firstname lastname member-count userid

      john smith 7 (prev.count 5) 1
      shawn pollock 12 (prev.count 8) 3
      chris jordan 6 4



      // but here i want batch -2 ---------output



      1.The possibilty is the john smith ,shawn pollock count will increase again in next batches ,but i dont want to show or keep in output for next batches.



      i.e based on userid , i want entry for the one time only in batch output
      and neglect again the same user in batch output
      firstname lastname member-count userid
      chris jordan 6 4







      scala apache-spark duplicates spark-streaming spark-structured-streaming






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 31 '18 at 5:42







      stackoverflow

















      asked Nov 15 '18 at 15:09









      stackoverflowstackoverflow

      118




      118






















          1 Answer
          1






          active

          oldest

          votes


















          0














          Your question is hard to read, but as I understand you want a while loop with a condition?



          var a = 10;
          while(a < 20)
          println( "Value of a: " + a );
          a = a + 1;



          For example will print



          value of a: 10
          value of a: 11
          value of a: 12
          value of a: 13
          value of a: 14
          value of a: 15
          value of a: 16
          value of a: 17
          value of a: 18
          value of a: 19





          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322416%2fstruggling-to-handle-deduplication-after-aggregation-in-spark-streaming%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Your question is hard to read, but as I understand you want a while loop with a condition?



            var a = 10;
            while(a < 20)
            println( "Value of a: " + a );
            a = a + 1;



            For example will print



            value of a: 10
            value of a: 11
            value of a: 12
            value of a: 13
            value of a: 14
            value of a: 15
            value of a: 16
            value of a: 17
            value of a: 18
            value of a: 19





            share|improve this answer



























              0














              Your question is hard to read, but as I understand you want a while loop with a condition?



              var a = 10;
              while(a < 20)
              println( "Value of a: " + a );
              a = a + 1;



              For example will print



              value of a: 10
              value of a: 11
              value of a: 12
              value of a: 13
              value of a: 14
              value of a: 15
              value of a: 16
              value of a: 17
              value of a: 18
              value of a: 19





              share|improve this answer

























                0












                0








                0







                Your question is hard to read, but as I understand you want a while loop with a condition?



                var a = 10;
                while(a < 20)
                println( "Value of a: " + a );
                a = a + 1;



                For example will print



                value of a: 10
                value of a: 11
                value of a: 12
                value of a: 13
                value of a: 14
                value of a: 15
                value of a: 16
                value of a: 17
                value of a: 18
                value of a: 19





                share|improve this answer













                Your question is hard to read, but as I understand you want a while loop with a condition?



                var a = 10;
                while(a < 20)
                println( "Value of a: " + a );
                a = a + 1;



                For example will print



                value of a: 10
                value of a: 11
                value of a: 12
                value of a: 13
                value of a: 14
                value of a: 15
                value of a: 16
                value of a: 17
                value of a: 18
                value of a: 19






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 15 '18 at 15:14









                elkenelken

                138110




                138110





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322416%2fstruggling-to-handle-deduplication-after-aggregation-in-spark-streaming%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Kleinkühnau

                    Makov (Slowakei)

                    Deutsches Schauspielhaus