How to use subprocess.run() to run Hive query?

up vote
0
down vote

favorite

So I'm trying to execute a hive query using the subprocess module, and save the output into a file data.txt as well as the logs (into log.txt), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.

Here's what I'm running:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results 
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff,
 shell=True)

log_buff.close()
data_buff.close()

I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.

The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.

Update:

With the above way of doing things I'm currently getting the following output:

log.txt

[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties

data.txt

[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$

And I can verify the java/hive process did run:

[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
 PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash

But it looks like it's not finishing and not logging everything that I'd like.

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

add a comment |

up vote
0
down vote

favorite

Here's what I'm running:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results 
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff,
 shell=True)

log_buff.close()
data_buff.close()

I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.

The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.

Update:

With the above way of doing things I'm currently getting the following output:

log.txt

[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties

data.txt

[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$

And I can verify the java/hive process did run:

[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
 PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash

But it looks like it's not finishing and not logging everything that I'd like.

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

add a comment |

up vote
0
down vote

favorite

Here's what I'm running:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results 
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff,
 shell=True)

log_buff.close()
data_buff.close()

I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.

The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.

Update:

With the above way of doing things I'm currently getting the following output:

log.txt

[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties

data.txt

[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$

And I can verify the java/hive process did run:

[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
 PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash

But it looks like it's not finishing and not logging everything that I'd like.

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

Here's what I'm running:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")

# note - "hive -e [query]" would normally just print all the results 
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff,
 shell=True)

log_buff.close()
data_buff.close()

I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run() because I'd like the process to block until finished.

The final output should be a file data.txt with the tab-delimited results of the query, and log.txt with all of the logging produced by the hive job. Any help would be wonderful.

Update:

With the above way of doing things I'm currently getting the following output:

log.txt

[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties

data.txt

[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$

And I can verify the java/hive process did run:

[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
 PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash

But it looks like it's not finishing and not logging everything that I'd like.

hive subprocess python-3.6

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

edited Nov 9 at 22:09

asked Nov 9 at 21:40

ralston3

496414

asked Nov 9 at 21:40

ralston3

496414

asked Nov 9 at 21:40

ralston3

496414

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

So I managed to get this working with the following setup:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
 try:
 outs, errs = proc.communicate(timeout=10)
 print("job finished")
 finished = True
 except subprocess.TimeoutExpired:
 elapsed = abs(time.time() - start) / 60. 
 if elapsed >= 60:
 print("Job took over 60 mins")
 break 
 print("Comm timed out. Continuing")
 continue

print("done")

log_buff.close()
data_buff.close()

Which produced the output as needed. I knew about process.communicate() but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile to the hive query.

Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

answered Nov 9 at 22:21

ralston3

496414

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233595%2fhow-to-use-subprocess-run-to-run-hive-query%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

So I managed to get this working with the following setup:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
 try:
 outs, errs = proc.communicate(timeout=10)
 print("job finished")
 finished = True
 except subprocess.TimeoutExpired:
 elapsed = abs(time.time() - start) / 60. 
 if elapsed >= 60:
 print("Job took over 60 mins")
 break 
 print("Comm timed out. Continuing")
 continue

print("done")

log_buff.close()
data_buff.close()

Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

answered Nov 9 at 22:21

ralston3

496414

add a comment |

up vote
1
down vote

So I managed to get this working with the following setup:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
 try:
 outs, errs = proc.communicate(timeout=10)
 print("job finished")
 finished = True
 except subprocess.TimeoutExpired:
 elapsed = abs(time.time() - start) / 60. 
 if elapsed >= 60:
 print("Job took over 60 mins")
 break 
 print("Comm timed out. Continuing")
 continue

print("done")

log_buff.close()
data_buff.close()

Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

answered Nov 9 at 22:21

ralston3

496414

add a comment |

up vote
1
down vote

So I managed to get this working with the following setup:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
 try:
 outs, errs = proc.communicate(timeout=10)
 print("job finished")
 finished = True
 except subprocess.TimeoutExpired:
 elapsed = abs(time.time() - start) / 60. 
 if elapsed >= 60:
 print("Job took over 60 mins")
 break 
 print("Comm timed out. Continuing")
 continue

print("done")

log_buff.close()
data_buff.close()

Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

answered Nov 9 at 22:21

ralston3

496414

So I managed to get this working with the following setup:

import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"

log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
 stdin=subprocess.PIPE,
 stdout=data_buff,
 stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
 try:
 outs, errs = proc.communicate(timeout=10)
 print("job finished")
 finished = True
 except subprocess.TimeoutExpired:
 elapsed = abs(time.time() - start) / 60. 
 if elapsed >= 60:
 print("Job took over 60 mins")
 break 
 print("Comm timed out. Continuing")
 continue

print("done")

log_buff.close()
data_buff.close()

Feel free to add any details. I've never seen anyone have to loop over proc.communicate() so I'm skeptical that I might be doing something wrong.

answered Nov 9 at 22:21

ralston3

496414

answered Nov 9 at 22:21

ralston3

496414

answered Nov 9 at 22:21

ralston3

496414

answered Nov 9 at 22:21

ralston3

496414

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

KDAiQMgPG,MzLT 3d5zPD00IeW9 kvCLcL7A,I,dCIHlV9K8pF6qtOjxFGw4HKTKk7qN P,TTb1BuboIo6BgWdyCK

搜尋此網誌

Pfthb