How to use subprocess.run() to run Hive query?









up vote
0
down vote

favorite
1












So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.



Here's what I'm running:



import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)

log_buff.close()
data_buff.close()


I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.



The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.



Update:



With the above way of doing things I'm currently getting the following output:



log.txt



[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties


data.txt



[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$


And I can verify the java/hive process did run:



[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash


But it looks like it's not finishing and not logging everything that I'd like.










share|improve this question



























    up vote
    0
    down vote

    favorite
    1












    So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.



    Here's what I'm running:



    import subprocess
    query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
    outfile = "data.txt"
    logfile = "log.txt"

    log_buff = open("log.txt", "a")
    data_buff = open("data.txt", "w")

    # note - "hive -e [query]" would normally just print all the results
    # to the console after finishing
    proc = subprocess.run(["hive" , "-e" '""'.format(query)],
    stdin=subprocess.PIPE,
    stdout=data_buff,
    stderr=log_buff,
    shell=True)

    log_buff.close()
    data_buff.close()


    I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.



    The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.



    Update:



    With the above way of doing things I'm currently getting the following output:



    log.txt



    [ralston@tpsci-gw01-vm tmp]$ cat log.txt
    Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
    Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

    Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties


    data.txt



    [ralston@tpsci-gw01-vm tmp]$ cat data.txt
    hive> [ralston@tpsci-gw01-vm tmp]$


    And I can verify the java/hive process did run:



    [ralston@tpsci-gw01-vm tmp]$ ps -u ralston
    PID TTY TIME CMD
    14096 pts/0 00:00:00 hive
    14141 pts/0 00:00:07 java
    14259 pts/0 00:00:00 ps
    16275 ? 00:00:00 sshd
    16276 pts/0 00:00:00 bash


    But it looks like it's not finishing and not logging everything that I'd like.










    share|improve this question

























      up vote
      0
      down vote

      favorite
      1









      up vote
      0
      down vote

      favorite
      1






      1





      So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.



      Here's what I'm running:



      import subprocess
      query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
      outfile = "data.txt"
      logfile = "log.txt"

      log_buff = open("log.txt", "a")
      data_buff = open("data.txt", "w")

      # note - "hive -e [query]" would normally just print all the results
      # to the console after finishing
      proc = subprocess.run(["hive" , "-e" '""'.format(query)],
      stdin=subprocess.PIPE,
      stdout=data_buff,
      stderr=log_buff,
      shell=True)

      log_buff.close()
      data_buff.close()


      I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.



      The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.



      Update:



      With the above way of doing things I'm currently getting the following output:



      log.txt



      [ralston@tpsci-gw01-vm tmp]$ cat log.txt
      Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
      Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

      Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties


      data.txt



      [ralston@tpsci-gw01-vm tmp]$ cat data.txt
      hive> [ralston@tpsci-gw01-vm tmp]$


      And I can verify the java/hive process did run:



      [ralston@tpsci-gw01-vm tmp]$ ps -u ralston
      PID TTY TIME CMD
      14096 pts/0 00:00:00 hive
      14141 pts/0 00:00:07 java
      14259 pts/0 00:00:00 ps
      16275 ? 00:00:00 sshd
      16276 pts/0 00:00:00 bash


      But it looks like it's not finishing and not logging everything that I'd like.










      share|improve this question















      So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.



      Here's what I'm running:



      import subprocess
      query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
      outfile = "data.txt"
      logfile = "log.txt"

      log_buff = open("log.txt", "a")
      data_buff = open("data.txt", "w")

      # note - "hive -e [query]" would normally just print all the results
      # to the console after finishing
      proc = subprocess.run(["hive" , "-e" '""'.format(query)],
      stdin=subprocess.PIPE,
      stdout=data_buff,
      stderr=log_buff,
      shell=True)

      log_buff.close()
      data_buff.close()


      I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.



      The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.



      Update:



      With the above way of doing things I'm currently getting the following output:



      log.txt



      [ralston@tpsci-gw01-vm tmp]$ cat log.txt
      Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
      Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

      Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties


      data.txt



      [ralston@tpsci-gw01-vm tmp]$ cat data.txt
      hive> [ralston@tpsci-gw01-vm tmp]$


      And I can verify the java/hive process did run:



      [ralston@tpsci-gw01-vm tmp]$ ps -u ralston
      PID TTY TIME CMD
      14096 pts/0 00:00:00 hive
      14141 pts/0 00:00:07 java
      14259 pts/0 00:00:00 ps
      16275 ? 00:00:00 sshd
      16276 pts/0 00:00:00 bash


      But it looks like it's not finishing and not logging everything that I'd like.







      hive subprocess python-3.6






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 9 at 22:09

























      asked Nov 9 at 21:40









      ralston3

      496414




      496414






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          So I managed to get this working with the following setup:



          import subprocess
          query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
          outfile = "data.txt"
          logfile = "log.txt"

          log_buff = open("log.txt", "a")
          data_buff = open("data.txt", "w")
          # Remove shell=True from proc, and add "> outfile.txt" to the command
          proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
          stdin=subprocess.PIPE,
          stdout=data_buff,
          stderr=log_buff)
          # keep track of job runtime and set limit
          start, elapsed, finished, limit = time.time(), 0, False, 60
          while not finished:
          try:
          outs, errs = proc.communicate(timeout=10)
          print("job finished")
          finished = True
          except subprocess.TimeoutExpired:
          elapsed = abs(time.time() - start) / 60.
          if elapsed >= 60:
          print("Job took over 60 mins")
          break
          print("Comm timed out. Continuing")
          continue

          print("done")

          log_buff.close()
          data_buff.close()


          Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.



          Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.






          share|improve this answer




















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233595%2fhow-to-use-subprocess-run-to-run-hive-query%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            So I managed to get this working with the following setup:



            import subprocess
            query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
            outfile = "data.txt"
            logfile = "log.txt"

            log_buff = open("log.txt", "a")
            data_buff = open("data.txt", "w")
            # Remove shell=True from proc, and add "> outfile.txt" to the command
            proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
            stdin=subprocess.PIPE,
            stdout=data_buff,
            stderr=log_buff)
            # keep track of job runtime and set limit
            start, elapsed, finished, limit = time.time(), 0, False, 60
            while not finished:
            try:
            outs, errs = proc.communicate(timeout=10)
            print("job finished")
            finished = True
            except subprocess.TimeoutExpired:
            elapsed = abs(time.time() - start) / 60.
            if elapsed >= 60:
            print("Job took over 60 mins")
            break
            print("Comm timed out. Continuing")
            continue

            print("done")

            log_buff.close()
            data_buff.close()


            Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.



            Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.






            share|improve this answer
























              up vote
              1
              down vote













              So I managed to get this working with the following setup:



              import subprocess
              query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
              outfile = "data.txt"
              logfile = "log.txt"

              log_buff = open("log.txt", "a")
              data_buff = open("data.txt", "w")
              # Remove shell=True from proc, and add "> outfile.txt" to the command
              proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
              stdin=subprocess.PIPE,
              stdout=data_buff,
              stderr=log_buff)
              # keep track of job runtime and set limit
              start, elapsed, finished, limit = time.time(), 0, False, 60
              while not finished:
              try:
              outs, errs = proc.communicate(timeout=10)
              print("job finished")
              finished = True
              except subprocess.TimeoutExpired:
              elapsed = abs(time.time() - start) / 60.
              if elapsed >= 60:
              print("Job took over 60 mins")
              break
              print("Comm timed out. Continuing")
              continue

              print("done")

              log_buff.close()
              data_buff.close()


              Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.



              Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.






              share|improve this answer






















                up vote
                1
                down vote










                up vote
                1
                down vote









                So I managed to get this working with the following setup:



                import subprocess
                query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
                outfile = "data.txt"
                logfile = "log.txt"

                log_buff = open("log.txt", "a")
                data_buff = open("data.txt", "w")
                # Remove shell=True from proc, and add "> outfile.txt" to the command
                proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
                stdin=subprocess.PIPE,
                stdout=data_buff,
                stderr=log_buff)
                # keep track of job runtime and set limit
                start, elapsed, finished, limit = time.time(), 0, False, 60
                while not finished:
                try:
                outs, errs = proc.communicate(timeout=10)
                print("job finished")
                finished = True
                except subprocess.TimeoutExpired:
                elapsed = abs(time.time() - start) / 60.
                if elapsed >= 60:
                print("Job took over 60 mins")
                break
                print("Comm timed out. Continuing")
                continue

                print("done")

                log_buff.close()
                data_buff.close()


                Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.



                Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.






                share|improve this answer












                So I managed to get this working with the following setup:



                import subprocess
                query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
                outfile = "data.txt"
                logfile = "log.txt"

                log_buff = open("log.txt", "a")
                data_buff = open("data.txt", "w")
                # Remove shell=True from proc, and add "> outfile.txt" to the command
                proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
                stdin=subprocess.PIPE,
                stdout=data_buff,
                stderr=log_buff)
                # keep track of job runtime and set limit
                start, elapsed, finished, limit = time.time(), 0, False, 60
                while not finished:
                try:
                outs, errs = proc.communicate(timeout=10)
                print("job finished")
                finished = True
                except subprocess.TimeoutExpired:
                elapsed = abs(time.time() - start) / 60.
                if elapsed >= 60:
                print("Job took over 60 mins")
                break
                print("Comm timed out. Continuing")
                continue

                print("done")

                log_buff.close()
                data_buff.close()


                Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.



                Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 9 at 22:21









                ralston3

                496414




                496414



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233595%2fhow-to-use-subprocess-run-to-run-hive-query%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Darth Vader #20

                    How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

                    Ondo