Hadoop HDFS: Read/Write parallelism?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















Couldn't find enough information on internet so asking here:



Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?



My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?










share|improve this question




























    0















    Couldn't find enough information on internet so asking here:



    Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?



    My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?










    share|improve this question
























      0












      0








      0








      Couldn't find enough information on internet so asking here:



      Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?



      My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?










      share|improve this question














      Couldn't find enough information on internet so asking here:



      Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?



      My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?







      hadoop hdfs






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 7:14









      ShawnShawn

      8931817




      8931817






















          1 Answer
          1






          active

          oldest

          votes


















          1














          Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.



          By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the rn characters



          Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files






          share|improve this answer


















          • 1





            And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

            – cricket_007
            Nov 15 '18 at 7:26












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314185%2fhadoop-hdfs-read-write-parallelism%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.



          By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the rn characters



          Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files






          share|improve this answer


















          • 1





            And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

            – cricket_007
            Nov 15 '18 at 7:26
















          1














          Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.



          By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the rn characters



          Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files






          share|improve this answer


















          • 1





            And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

            – cricket_007
            Nov 15 '18 at 7:26














          1












          1








          1







          Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.



          By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the rn characters



          Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files






          share|improve this answer













          Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.



          By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the rn characters



          Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 15 '18 at 7:24









          cricket_007cricket_007

          84.1k1147119




          84.1k1147119







          • 1





            And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

            – cricket_007
            Nov 15 '18 at 7:26













          • 1





            And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

            – cricket_007
            Nov 15 '18 at 7:26








          1




          1





          And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

          – cricket_007
          Nov 15 '18 at 7:26






          And CSV or plaintext is a horribly inefficient format for Hadoop processing compared to the alternative columnar formats

          – cricket_007
          Nov 15 '18 at 7:26




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314185%2fhadoop-hdfs-read-write-parallelism%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

          Syphilis

          Darth Vader #20