adding and accessing auxiliary tf.Dataset attributes with Keras

up vote
0
down vote

favorite

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:

dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
dataset = dataset.map(
 lambda filename, label, sample_weight: tuple(tf.py_func(
 self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))

The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.

I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.

Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.

This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.

What is the best approach to do this?

Thanks

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

add a comment |

up vote
0
down vote

favorite

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:

dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
dataset = dataset.map(
 lambda filename, label, sample_weight: tuple(tf.py_func(
 self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))

The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.

What is the best approach to do this?

Thanks

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

add a comment |

up vote
0
down vote

favorite

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:

dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
dataset = dataset.map(
 lambda filename, label, sample_weight: tuple(tf.py_func(
 self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))

The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.

What is the best approach to do this?

Thanks

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:

dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
dataset = dataset.map(
 lambda filename, label, sample_weight: tuple(tf.py_func(
 self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))

The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.

What is the best approach to do this?

Thanks

keras tensorflow-datasets

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

edited Nov 10 at 9:15

Joel

1,6086719

edited Nov 10 at 9:15

Joel

1,6086719

edited Nov 10 at 9:15

Joel

1,6086719

asked Nov 9 at 23:32

PhilAW

asked Nov 9 at 23:32

PhilAW

asked Nov 9 at 23:32

PhilAW

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.

I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.

So again in this case I'm wondering how a tf.Dataset can be extended.

On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).

Thanks for any insights.

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234585%2fadding-and-accessing-auxiliary-tf-dataset-attributes-with-keras%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

So again in this case I'm wondering how a tf.Dataset can be extended.

Thanks for any insights.

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

add a comment |

up vote
0
down vote

So again in this case I'm wondering how a tf.Dataset can be extended.

Thanks for any insights.

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

add a comment |

up vote
0
down vote

So again in this case I'm wondering how a tf.Dataset can be extended.

Thanks for any insights.

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

So again in this case I'm wondering how a tf.Dataset can be extended.

Thanks for any insights.

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

edited Nov 15 at 18:38

answered Nov 15 at 18:28

PhilAW

answered Nov 15 at 18:28

PhilAW

answered Nov 15 at 18:28

PhilAW

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

add a comment |

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

R G9CV,8qJp8C8uruz,b 8WLb0vX

搜尋此網誌

Pfthb