adding and accessing auxiliary tf.Dataset attributes with Keras









up vote
0
down vote

favorite












I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:



dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
dataset = dataset.map(
lambda filename, label, sample_weight: tuple(tf.py_func(
self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))


The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.



I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.



Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.



This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.



What is the best approach to do this?



Thanks










share|improve this question



























    up vote
    0
    down vote

    favorite












    I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:



    dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
    dataset = dataset.map(
    lambda filename, label, sample_weight: tuple(tf.py_func(
    self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))


    The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.



    I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.



    Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.



    This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.



    What is the best approach to do this?



    Thanks










    share|improve this question

























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:



      dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
      dataset = dataset.map(
      lambda filename, label, sample_weight: tuple(tf.py_func(
      self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))


      The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.



      I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.



      Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.



      This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.



      What is the best approach to do this?



      Thanks










      share|improve this question















      I use a tf.py_func call to parse data (features, labels and sample_weights) from file to a tf.Dataset:



      dataset = tf.data.Dataset.from_tensor_slices((records, labels, sample_weights)) 
      dataset = dataset.map(
      lambda filename, label, sample_weight: tuple(tf.py_func(
      self._my_parse_function, [filename, label, sample_weights], [tf.float32, label.dtype, tf.float32])))


      The data is variable-length 1-D sequences, so I also pad the sequences to a fixed length in my_parse_function.



      I use tensorflow.python.keras.models.Sequential.fit(...) to train the data (which now accepts datasets as input, including datasets with sample_weights) and tensorflow.python.keras.models.Sequential.predict to predict outputs.



      Once I have predictions I would like to do some post-processing to make sense of the outputs. For example, I'd like to truncate the padded data to the actual sequence length. Also, I'd like to know for sure which file the data came from, since I am not sure that ordering is guaranteed with dataset iterators, especially if batching is used (I do batch the dataset as well) or multi-GPU or multi-workers are involved (I hope to try the multi- scenarios). Even if order was 'guaranteed' this is a decent sanity check.



      This information, filename (i.e, a string) and sequence length (i.e, an integer), is not currently conveniently accessible, so I'd like to add these two attributes to the dataset elements and be able to retrieve them during/after the call to predict.



      What is the best approach to do this?



      Thanks







      keras tensorflow-datasets






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 10 at 9:15









      Joel

      1,6086719




      1,6086719










      asked Nov 9 at 23:32









      PhilAW

      93




      93






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.



          I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.



          So again in this case I'm wondering how a tf.Dataset can be extended.



          On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).



          Thanks for any insights.






          share|improve this answer






















          • The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
            – PhilAW
            Nov 24 at 20:56










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234585%2fadding-and-accessing-auxiliary-tf-dataset-attributes-with-keras%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.



          I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.



          So again in this case I'm wondering how a tf.Dataset can be extended.



          On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).



          Thanks for any insights.






          share|improve this answer






















          • The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
            – PhilAW
            Nov 24 at 20:56














          up vote
          0
          down vote













          As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.



          I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.



          So again in this case I'm wondering how a tf.Dataset can be extended.



          On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).



          Thanks for any insights.






          share|improve this answer






















          • The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
            – PhilAW
            Nov 24 at 20:56












          up vote
          0
          down vote










          up vote
          0
          down vote









          As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.



          I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.



          So again in this case I'm wondering how a tf.Dataset can be extended.



          On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).



          Thanks for any insights.






          share|improve this answer














          As a workaround, I store this auxiliary information in a 'global' dictionary in my_parse_fn, so it stores (and re-stores) on every iteration through the tf.Dataset. This is ok for now since there are only about 1000 examples in the training set, so storing 1000 strings and integers is not a problem. But if this auxiliary information were larger or the training set were larger, this approach would not be very scalable. In my case, the input data for each training example is significantly large, about 50MB in size, which is why reading a tf.Dataset from file (i.e., on every epoch) is important.



          I still think that it would be helpful to be able to more conveniently extend a tf.Dataset with this information. Also I noticed that when I adding a field to a tf.Dataset like dataset.tag to identify, say, dataset.tag = 'training', dataset.tag ='validation' or dataset.tag = 'test' sets, the field did not survive the iterations of training.



          So again in this case I'm wondering how a tf.Dataset can be extended.



          On the other question, it looks like the order of tf.Dataset elements is respected through iterations, so predictions, say, from tensorflow.python.keras.models.Sequential.predict(...) are ordered as the file ids were presented to my_parse_fn (at least batching respects this ordering, but I still don't know about whether a multi-GPU scenario would as well).



          Thanks for any insights.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 15 at 18:38

























          answered Nov 15 at 18:28









          PhilAW

          93




          93











          • The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
            – PhilAW
            Nov 24 at 20:56
















          • The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
            – PhilAW
            Nov 24 at 20:56















          The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
          – PhilAW
          Nov 24 at 20:56




          The order is deterministic if no dataset.shuffle() is called. See @mrry comment here: stackoverflow.com/a/47781670/4095551
          – PhilAW
          Nov 24 at 20:56

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53234585%2fadding-and-accessing-auxiliary-tf-dataset-attributes-with-keras%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Use pre created SQLite database for Android project in kotlin

          Darth Vader #20

          Ondo