Parquet binary UTF8 as string on hive
There is a parquet file with a binary (UTF8)
column named abc
in it.
Is it possible to create an external
table on hive that will contain the same column abc
, but casted to string
?
The structure of the parquet file:
$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);
hive parquet sqldatatypes
add a comment |
There is a parquet file with a binary (UTF8)
column named abc
in it.
Is it possible to create an external
table on hive that will contain the same column abc
, but casted to string
?
The structure of the parquet file:
$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);
hive parquet sqldatatypes
Tryselect cast(abc as string) from mytable
or similar? There's alsodecode
function orbase64
. If you explain what is inside binary data that would be more useful.
– Hitobat
Nov 12 '18 at 18:24
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00
add a comment |
There is a parquet file with a binary (UTF8)
column named abc
in it.
Is it possible to create an external
table on hive that will contain the same column abc
, but casted to string
?
The structure of the parquet file:
$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);
hive parquet sqldatatypes
There is a parquet file with a binary (UTF8)
column named abc
in it.
Is it possible to create an external
table on hive that will contain the same column abc
, but casted to string
?
The structure of the parquet file:
$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema
optional binary abc (UTF8);
hive parquet sqldatatypes
hive parquet sqldatatypes
edited Nov 13 '18 at 14:23
dmigo
asked Nov 12 '18 at 18:06
dmigodmigo
5571036
5571036
Tryselect cast(abc as string) from mytable
or similar? There's alsodecode
function orbase64
. If you explain what is inside binary data that would be more useful.
– Hitobat
Nov 12 '18 at 18:24
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00
add a comment |
Tryselect cast(abc as string) from mytable
or similar? There's alsodecode
function orbase64
. If you explain what is inside binary data that would be more useful.
– Hitobat
Nov 12 '18 at 18:24
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00
Try
select cast(abc as string) from mytable
or similar? There's also decode
function or base64
. If you explain what is inside binary data that would be more useful.– Hitobat
Nov 12 '18 at 18:24
Try
select cast(abc as string) from mytable
or similar? There's also decode
function or base64
. If you explain what is inside binary data that would be more useful.– Hitobat
Nov 12 '18 at 18:24
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00
add a comment |
2 Answers
2
active
oldest
votes
There are three different types involved:
- There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
- Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
- The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
- The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.
add a comment |
Apparently, you can simply specify string
as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267746%2fparquet-binary-utf8-as-string-on-hive%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are three different types involved:
- There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
- Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
- The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
- The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.
add a comment |
There are three different types involved:
- There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
- Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
- The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
- The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.
add a comment |
There are three different types involved:
- There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
- Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
- The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
- The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.
There are three different types involved:
- There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
- Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
- The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
- The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.
edited Nov 14 '18 at 18:33
answered Nov 14 '18 at 17:32
ZoltanZoltan
1,164313
1,164313
add a comment |
add a comment |
Apparently, you can simply specify string
as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';
add a comment |
Apparently, you can simply specify string
as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';
add a comment |
Apparently, you can simply specify string
as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';
Apparently, you can simply specify string
as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';
edited Nov 13 '18 at 14:36
answered Nov 13 '18 at 14:24
dmigodmigo
5571036
5571036
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267746%2fparquet-binary-utf8-as-string-on-hive%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Try
select cast(abc as string) from mytable
or similar? There's alsodecode
function orbase64
. If you explain what is inside binary data that would be more useful.– Hitobat
Nov 12 '18 at 18:24
@Hitobat I want to remove the burden of casting types from the users. Do you think it might be possible?
– dmigo
Nov 13 '18 at 12:40
Oh, I understand what you mean now. In this case, you can have an external-table to read the data, and then define a view based on the table for your users to query from.
– Hitobat
Nov 13 '18 at 14:00