adding and accessing auxiliary tf.Dataset attributes with Keras
up vote
0
down vote
favorite
I use a tf.py_func
call to parse data (features, labels and sample_weights) from file to a tf.Dataset
:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function
.
I use tensorflow.python.keras.models.Sequential.fit(...)
to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict
to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
keras tensorflow-datasets
add a comment |
up vote
0
down vote
favorite
I use a tf.py_func
call to parse data (features, labels and sample_weights) from file to a tf.Dataset
:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function
.
I use tensorflow.python.keras.models.Sequential.fit(...)
to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict
to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
keras tensorflow-datasets
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I use a tf.py_func
call to parse data (features, labels and sample_weights) from file to a tf.Dataset
:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function
.
I use tensorflow.python.keras.models.Sequential.fit(...)
to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict
to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
keras tensorflow-datasets
I use a tf.py_func
call to parse data (features, labels and sample_weights) from file to a tf.Dataset
:
dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights))
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))
The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function
.
I use tensorflow.python.keras.models.Sequential.fit(...)
to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict
to predict outputs.
Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.
This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.
What is the best approach to do this?
Thanks
keras tensorflow-datasets
keras tensorflow-datasets
edited Nov 10 at 9:15
Joel
1,6086719
1,6086719
asked Nov 9 at 23:32
PhilAW
93
93
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn
, so it stores (and re-stores) on every iteration through the tf.Dataset
. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset
from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset
with this information. Also I noticed that when I adding a field to a tf.Dataset
like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset
can be extended.
On the other question, it looks like the order of tf.Dataset
elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...)
are ordered as the file ids were presented to my_parse_fn
(at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn
, so it stores (and re-stores) on every iteration through the tf.Dataset
. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset
from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset
with this information. Also I noticed that when I adding a field to a tf.Dataset
like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset
can be extended.
On the other question, it looks like the order of tf.Dataset
elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...)
are ordered as the file ids were presented to my_parse_fn
(at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
add a comment |
up vote
0
down vote
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn
, so it stores (and re-stores) on every iteration through the tf.Dataset
. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset
from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset
with this information. Also I noticed that when I adding a field to a tf.Dataset
like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset
can be extended.
On the other question, it looks like the order of tf.Dataset
elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...)
are ordered as the file ids were presented to my_parse_fn
(at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
add a comment |
up vote
0
down vote
up vote
0
down vote
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn
, so it stores (and re-stores) on every iteration through the tf.Dataset
. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset
from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset
with this information. Also I noticed that when I adding a field to a tf.Dataset
like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset
can be extended.
On the other question, it looks like the order of tf.Dataset
elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...)
are ordered as the file ids were presented to my_parse_fn
(at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn
, so it stores (and re-stores) on every iteration through the tf.Dataset
. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset
from file (i.e., on every epoch) is important.
I still think that it would be helpful to be able to more conveniently extend a tf.Dataset
with this information. Also I noticed that when I adding a field to a tf.Dataset
like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.
So again in this case I'm wondering how a tf.Dataset
can be extended.
On the other question, it looks like the order of tf.Dataset
elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...)
are ordered as the file ids were presented to my_parse_fn
(at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).
Thanks for any insights.
edited Nov 15 at 18:38
answered Nov 15 at 18:28
PhilAW
93
93
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
add a comment |
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
– PhilAW
Nov 24 at 20:56
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234585%2fadding-and-accessing-auxiliary-tf-dataset-attributes-with-keras%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown