Create float from exponent and significand
Given integers exp and 0<=sig<2^52, how can I create the float64 with exp as exponent and whose significand bits are the same as the binary representation of sig (in Go)?
go floating-point binary
add a comment |
Given integers exp and 0<=sig<2^52, how can I create the float64 with exp as exponent and whose significand bits are the same as the binary representation of sig (in Go)?
go floating-point binary
add a comment |
Given integers exp and 0<=sig<2^52, how can I create the float64 with exp as exponent and whose significand bits are the same as the binary representation of sig (in Go)?
go floating-point binary
Given integers exp and 0<=sig<2^52, how can I create the float64 with exp as exponent and whose significand bits are the same as the binary representation of sig (in Go)?
go floating-point binary
go floating-point binary
asked Nov 13 '18 at 8:32
TedTed
443414
443414
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The IEEE-754 standard defines the floating point arithmetics which Go uses for floating point numbers such as float32 and float64 (just like almost any other language).
Since your significand may be up to 52 bits, obviously it can only be represented using a float64 value.
The memory layout (bits) of a float64 value is described in Double-precision floating-point format.
Here's a picture of the bits of a float64 value (taken from Wikipedia):

You claim you have the exponent value and the significand (which is the fraction part).
You may use simple bitwise arithmetic to construct the 64-bit value of the floating point like this:
bits := exp<<52 | sig
(Note: exp and sig should be of type uint64. If not, use a type conversion.)
Once you have the bits, you may use the math.Float64frombits() function to get it as a float64 value:
f := math.Float64frombits(bits)
Note that the exponent value of the memory layout is not the "direct" number you have to use when calculating the value of the number, but:
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
So the number encoded in the above double-precision format is calculated like:
(-1)sign x 2e-1023 x 1.fraction
WouldMath.Ldexphelp here?func Ldexp(frac float64, exp int) float64
– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as afloat64value, and it does something similar under the hood.
– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to beexp, I have to dobits := (exp+1023)<<52 | sig, correct?
– Ted
Nov 13 '18 at 14:54
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53276811%2fcreate-float-from-exponent-and-significand%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The IEEE-754 standard defines the floating point arithmetics which Go uses for floating point numbers such as float32 and float64 (just like almost any other language).
Since your significand may be up to 52 bits, obviously it can only be represented using a float64 value.
The memory layout (bits) of a float64 value is described in Double-precision floating-point format.
Here's a picture of the bits of a float64 value (taken from Wikipedia):

You claim you have the exponent value and the significand (which is the fraction part).
You may use simple bitwise arithmetic to construct the 64-bit value of the floating point like this:
bits := exp<<52 | sig
(Note: exp and sig should be of type uint64. If not, use a type conversion.)
Once you have the bits, you may use the math.Float64frombits() function to get it as a float64 value:
f := math.Float64frombits(bits)
Note that the exponent value of the memory layout is not the "direct" number you have to use when calculating the value of the number, but:
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
So the number encoded in the above double-precision format is calculated like:
(-1)sign x 2e-1023 x 1.fraction
WouldMath.Ldexphelp here?func Ldexp(frac float64, exp int) float64
– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as afloat64value, and it does something similar under the hood.
– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to beexp, I have to dobits := (exp+1023)<<52 | sig, correct?
– Ted
Nov 13 '18 at 14:54
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
add a comment |
The IEEE-754 standard defines the floating point arithmetics which Go uses for floating point numbers such as float32 and float64 (just like almost any other language).
Since your significand may be up to 52 bits, obviously it can only be represented using a float64 value.
The memory layout (bits) of a float64 value is described in Double-precision floating-point format.
Here's a picture of the bits of a float64 value (taken from Wikipedia):

You claim you have the exponent value and the significand (which is the fraction part).
You may use simple bitwise arithmetic to construct the 64-bit value of the floating point like this:
bits := exp<<52 | sig
(Note: exp and sig should be of type uint64. If not, use a type conversion.)
Once you have the bits, you may use the math.Float64frombits() function to get it as a float64 value:
f := math.Float64frombits(bits)
Note that the exponent value of the memory layout is not the "direct" number you have to use when calculating the value of the number, but:
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
So the number encoded in the above double-precision format is calculated like:
(-1)sign x 2e-1023 x 1.fraction
WouldMath.Ldexphelp here?func Ldexp(frac float64, exp int) float64
– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as afloat64value, and it does something similar under the hood.
– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to beexp, I have to dobits := (exp+1023)<<52 | sig, correct?
– Ted
Nov 13 '18 at 14:54
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
add a comment |
The IEEE-754 standard defines the floating point arithmetics which Go uses for floating point numbers such as float32 and float64 (just like almost any other language).
Since your significand may be up to 52 bits, obviously it can only be represented using a float64 value.
The memory layout (bits) of a float64 value is described in Double-precision floating-point format.
Here's a picture of the bits of a float64 value (taken from Wikipedia):

You claim you have the exponent value and the significand (which is the fraction part).
You may use simple bitwise arithmetic to construct the 64-bit value of the floating point like this:
bits := exp<<52 | sig
(Note: exp and sig should be of type uint64. If not, use a type conversion.)
Once you have the bits, you may use the math.Float64frombits() function to get it as a float64 value:
f := math.Float64frombits(bits)
Note that the exponent value of the memory layout is not the "direct" number you have to use when calculating the value of the number, but:
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
So the number encoded in the above double-precision format is calculated like:
(-1)sign x 2e-1023 x 1.fraction
The IEEE-754 standard defines the floating point arithmetics which Go uses for floating point numbers such as float32 and float64 (just like almost any other language).
Since your significand may be up to 52 bits, obviously it can only be represented using a float64 value.
The memory layout (bits) of a float64 value is described in Double-precision floating-point format.
Here's a picture of the bits of a float64 value (taken from Wikipedia):

You claim you have the exponent value and the significand (which is the fraction part).
You may use simple bitwise arithmetic to construct the 64-bit value of the floating point like this:
bits := exp<<52 | sig
(Note: exp and sig should be of type uint64. If not, use a type conversion.)
Once you have the bits, you may use the math.Float64frombits() function to get it as a float64 value:
f := math.Float64frombits(bits)
Note that the exponent value of the memory layout is not the "direct" number you have to use when calculating the value of the number, but:
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
So the number encoded in the above double-precision format is calculated like:
(-1)sign x 2e-1023 x 1.fraction
edited Nov 13 '18 at 14:58
answered Nov 13 '18 at 11:51
iczaicza
168k25333366
168k25333366
WouldMath.Ldexphelp here?func Ldexp(frac float64, exp int) float64
– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as afloat64value, and it does something similar under the hood.
– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to beexp, I have to dobits := (exp+1023)<<52 | sig, correct?
– Ted
Nov 13 '18 at 14:54
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
add a comment |
WouldMath.Ldexphelp here?func Ldexp(frac float64, exp int) float64
– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as afloat64value, and it does something similar under the hood.
– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to beexp, I have to dobits := (exp+1023)<<52 | sig, correct?
– Ted
Nov 13 '18 at 14:54
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
Would
Math.Ldexp help here? func Ldexp(frac float64, exp int) float64– aMike
Nov 13 '18 at 13:57
Would
Math.Ldexp help here? func Ldexp(frac float64, exp int) float64– aMike
Nov 13 '18 at 13:57
@aMike I was considering it, but it takes the fraction as a
float64 value, and it does something similar under the hood.– icza
Nov 13 '18 at 13:58
@aMike I was considering it, but it takes the fraction as a
float64 value, and it does something similar under the hood.– icza
Nov 13 '18 at 13:58
I see, so if I want the actual exponent to be
exp, I have to do bits := (exp+1023)<<52 | sig, correct?– Ted
Nov 13 '18 at 14:54
I see, so if I want the actual exponent to be
exp, I have to do bits := (exp+1023)<<52 | sig, correct?– Ted
Nov 13 '18 at 14:54
1
1
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
@Ted Yes, that's right.
– icza
Nov 13 '18 at 14:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53276811%2fcreate-float-from-exponent-and-significand%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown