How to use subprocess.run() to run Hive query?
up vote
0
down vote
favorite
So I'm trying to execute a hive query using the subprocess
module, and save the output into a file data.txt
as well as the logs (into log.txt
), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run()
because I'd like the process to block until finished.
The final output should be a file data.txt
with the tab-delimited results of the query, and log.txt
with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
hive subprocess python-3.6
add a comment |
up vote
0
down vote
favorite
So I'm trying to execute a hive query using the subprocess
module, and save the output into a file data.txt
as well as the logs (into log.txt
), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run()
because I'd like the process to block until finished.
The final output should be a file data.txt
with the tab-delimited results of the query, and log.txt
with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
hive subprocess python-3.6
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
So I'm trying to execute a hive query using the subprocess
module, and save the output into a file data.txt
as well as the logs (into log.txt
), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run()
because I'd like the process to block until finished.
The final output should be a file data.txt
with the tab-delimited results of the query, and log.txt
with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
hive subprocess python-3.6
So I'm trying to execute a hive query using the subprocess
module, and save the output into a file data.txt
as well as the logs (into log.txt
), but I seem to be having a bit of trouble. I've look at this gist as well as this SO question, but neither seem to give me what I need.
Here's what I'm running:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# note - "hive -e [query]" would normally just print all the results
# to the console after finishing
proc = subprocess.run(["hive" , "-e" '""'.format(query)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff,
shell=True)
log_buff.close()
data_buff.close()
I've also looked into this SO question regarding subprocess.run() vs subprocess.Popen, and I believe I want .run()
because I'd like the process to block until finished.
The final output should be a file data.txt
with the tab-delimited results of the query, and log.txt
with all of the logging produced by the hive job. Any help would be wonderful.
Update:
With the above way of doing things I'm currently getting the following output:
log.txt
[ralston@tpsci-gw01-vm tmp]$ cat log.txt
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/y/share/hadoop-2.8.3.0.1802131730/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/y/libexec/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Logging initialized using configuration in file:/home/y/libexec/hive/conf/hive-log4j.properties
data.txt
[ralston@tpsci-gw01-vm tmp]$ cat data.txt
hive> [ralston@tpsci-gw01-vm tmp]$
And I can verify the java/hive process did run:
[ralston@tpsci-gw01-vm tmp]$ ps -u ralston
PID TTY TIME CMD
14096 pts/0 00:00:00 hive
14141 pts/0 00:00:07 java
14259 pts/0 00:00:00 ps
16275 ? 00:00:00 sshd
16276 pts/0 00:00:00 bash
But it looks like it's not finishing and not logging everything that I'd like.
hive subprocess python-3.6
hive subprocess python-3.6
edited Nov 9 at 22:09
asked Nov 9 at 21:40
ralston3
496414
496414
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate()
but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile
to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate()
so I'm skeptical that I might be doing something wrong.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate()
but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile
to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate()
so I'm skeptical that I might be doing something wrong.
add a comment |
up vote
1
down vote
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate()
but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile
to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate()
so I'm skeptical that I might be doing something wrong.
add a comment |
up vote
1
down vote
up vote
1
down vote
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate()
but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile
to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate()
so I'm skeptical that I might be doing something wrong.
So I managed to get this working with the following setup:
import subprocess
query = "select user, sum(revenue) as revenue from my_table where user = 'dave' group by user;"
outfile = "data.txt"
logfile = "log.txt"
log_buff = open("log.txt", "a")
data_buff = open("data.txt", "w")
# Remove shell=True from proc, and add "> outfile.txt" to the command
proc = subprocess.Popen(["hive" , "-e", '""'.format(query), ">", "".format(outfile)],
stdin=subprocess.PIPE,
stdout=data_buff,
stderr=log_buff)
# keep track of job runtime and set limit
start, elapsed, finished, limit = time.time(), 0, False, 60
while not finished:
try:
outs, errs = proc.communicate(timeout=10)
print("job finished")
finished = True
except subprocess.TimeoutExpired:
elapsed = abs(time.time() - start) / 60.
if elapsed >= 60:
print("Job took over 60 mins")
break
print("Comm timed out. Continuing")
continue
print("done")
log_buff.close()
data_buff.close()
Which produced the output as needed. I knew about process.communicate()
but that previously didn't work. I believe the issue was related to not adding an output file with > $outfile
to the hive query.
Feel free to add any details. I've never seen anyone have to loop over proc.communicate()
so I'm skeptical that I might be doing something wrong.
answered Nov 9 at 22:21
ralston3
496414
496414
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53233595%2fhow-to-use-subprocess-run-to-run-hive-query%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown