Command to perform full outer join with duplicate entries in key/join column



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








0















I have three files. I need to join them based on one column and perform some transformations.



file1.dat (Column 1 is used for joining)



123,is1,ric1,col1,smbc1 
123,is2,ric1,col1,smbc1
234,is3,ric3,col3,smbc2
345,is4,ric4,,smbc2
345,is4,,col5,smbc2


file2.dat (Column 1 is used for joining)



123,abc 
234,bcd


file3.dat (Column 4 is used for joining)



r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8


Expected Output (output.dat)



123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3


I wrote the following awk command.



awk '
BEGIN FS=OFS=","
FILENAME == ARGV[1] temp_join_one[$1] = $2"
FILENAME == ARGV[2] exchtbunload[$1] = $2; next
FILENAME == ARGV[3] ");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat"
' file1.dat file2.dat file3.dat


I need to join all three files.



The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)



But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.



I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.



Any suggestion would be helpful.



Thanks in advance.










share|improve this question






























    0















    I have three files. I need to join them based on one column and perform some transformations.



    file1.dat (Column 1 is used for joining)



    123,is1,ric1,col1,smbc1 
    123,is2,ric1,col1,smbc1
    234,is3,ric3,col3,smbc2
    345,is4,ric4,,smbc2
    345,is4,,col5,smbc2


    file2.dat (Column 1 is used for joining)



    123,abc 
    234,bcd


    file3.dat (Column 4 is used for joining)



    r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
    r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8
    r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8
    r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8


    Expected Output (output.dat)



    123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
    123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
    123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
    123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
    234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3
    345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3
    345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3


    I wrote the following awk command.



    awk '
    BEGIN FS=OFS=","
    FILENAME == ARGV[1] temp_join_one[$1] = $2"
    FILENAME == ARGV[2] exchtbunload[$1] = $2; next
    FILENAME == ARGV[3] ");
    v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
    v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
    v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
    v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
    v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
    v11=($2=""?"N/A":$2);
    print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
    "output.dat"
    ' file1.dat file2.dat file3.dat


    I need to join all three files.



    The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)



    But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.



    I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.



    Any suggestion would be helpful.



    Thanks in advance.










    share|improve this question


























      0












      0








      0


      1






      I have three files. I need to join them based on one column and perform some transformations.



      file1.dat (Column 1 is used for joining)



      123,is1,ric1,col1,smbc1 
      123,is2,ric1,col1,smbc1
      234,is3,ric3,col3,smbc2
      345,is4,ric4,,smbc2
      345,is4,,col5,smbc2


      file2.dat (Column 1 is used for joining)



      123,abc 
      234,bcd


      file3.dat (Column 4 is used for joining)



      r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
      r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8
      r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8
      r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8


      Expected Output (output.dat)



      123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
      123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
      123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
      123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
      234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3
      345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3
      345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3


      I wrote the following awk command.



      awk '
      BEGIN FS=OFS=","
      FILENAME == ARGV[1] temp_join_one[$1] = $2"
      FILENAME == ARGV[2] exchtbunload[$1] = $2; next
      FILENAME == ARGV[3] ");
      v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
      v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
      v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
      v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
      v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
      v11=($2=""?"N/A":$2);
      print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
      "output.dat"
      ' file1.dat file2.dat file3.dat


      I need to join all three files.



      The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)



      But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.



      I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.



      Any suggestion would be helpful.



      Thanks in advance.










      share|improve this question
















      I have three files. I need to join them based on one column and perform some transformations.



      file1.dat (Column 1 is used for joining)



      123,is1,ric1,col1,smbc1 
      123,is2,ric1,col1,smbc1
      234,is3,ric3,col3,smbc2
      345,is4,ric4,,smbc2
      345,is4,,col5,smbc2


      file2.dat (Column 1 is used for joining)



      123,abc 
      234,bcd


      file3.dat (Column 4 is used for joining)



      r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
      r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8
      r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8
      r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8


      Expected Output (output.dat)



      123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
      123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3
      123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
      123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3
      234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3
      345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3
      345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3


      I wrote the following awk command.



      awk '
      BEGIN FS=OFS=","
      FILENAME == ARGV[1] temp_join_one[$1] = $2"
      FILENAME == ARGV[2] exchtbunload[$1] = $2; next
      FILENAME == ARGV[3] ");
      v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
      v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
      v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
      v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
      v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
      v11=($2=""?"N/A":$2);
      print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
      "output.dat"
      ' file1.dat file2.dat file3.dat


      I need to join all three files.



      The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)



      But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.



      I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.



      Any suggestion would be helpful.



      Thanks in advance.







      unix awk scripting






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '18 at 11:29







      Arun Aranganathan

















      asked Nov 15 '18 at 16:10









      Arun AranganathanArun Aranganathan

      575




      575






















          2 Answers
          2






          active

          oldest

          votes


















          4














          with join, assuming files are sorted by the key.



          $ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
          file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

          123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
          345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
          345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3





          share|improve this answer























          • This blows my mind

            – kvantour
            Nov 15 '18 at 21:42











          • easier than awk, not?

            – karakfa
            Nov 15 '18 at 22:15











          • @karakfa Not really easier, I had to look it up in the man page ;-)

            – kvantour
            Nov 16 '18 at 9:41











          • @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

            – Arun Aranganathan
            Nov 16 '18 at 11:43












          • with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

            – karakfa
            Nov 16 '18 at 18:39


















          1














          I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:



          (edit: Revised version based on new information about the data)



          #!/bin/sh
          # Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
          file1=$1
          file2=$2
          file3=$3

          rm -f scratch.db
          sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
          CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
          CREATE TABLE file2(f2_1 INTEGER, f2_2);
          CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
          .import $file1 file1
          .import $file2 file2
          .import $file3 file3
          -- Build indexes to speed up joining and sorting gigs of data.
          CREATE INDEX file1_idx ON file1(f1_1);
          CREATE INDEX file2_idx ON file2(f2_1);
          CREATE INDEX file3_idx ON file3(f3_4);
          SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
          , f3_2, f3_3
          FROM file3
          LEFT JOIN file1 ON f1_1 = f3_4
          LEFT JOIN file2 ON f2_1 = f3_4
          ORDER BY f3_4;
          EOF
          rm -f scratch.db


          Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.






          share|improve this answer

























          • Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

            – Arun Aranganathan
            Mar 4 at 2:10











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323516%2fcommand-to-perform-full-outer-join-with-duplicate-entries-in-key-join-column%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4














          with join, assuming files are sorted by the key.



          $ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
          file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

          123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
          345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
          345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3





          share|improve this answer























          • This blows my mind

            – kvantour
            Nov 15 '18 at 21:42











          • easier than awk, not?

            – karakfa
            Nov 15 '18 at 22:15











          • @karakfa Not really easier, I had to look it up in the man page ;-)

            – kvantour
            Nov 16 '18 at 9:41











          • @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

            – Arun Aranganathan
            Nov 16 '18 at 11:43












          • with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

            – karakfa
            Nov 16 '18 at 18:39















          4














          with join, assuming files are sorted by the key.



          $ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
          file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

          123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
          345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
          345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3





          share|improve this answer























          • This blows my mind

            – kvantour
            Nov 15 '18 at 21:42











          • easier than awk, not?

            – karakfa
            Nov 15 '18 at 22:15











          • @karakfa Not really easier, I had to look it up in the man page ;-)

            – kvantour
            Nov 16 '18 at 9:41











          • @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

            – Arun Aranganathan
            Nov 16 '18 at 11:43












          • with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

            – karakfa
            Nov 16 '18 at 18:39













          4












          4








          4







          with join, assuming files are sorted by the key.



          $ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
          file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

          123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
          345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
          345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3





          share|improve this answer













          with join, assuming files are sorted by the key.



          $ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
          file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

          123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
          123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
          234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
          345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
          345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 15 '18 at 21:29









          karakfakarakfa

          50.8k52940




          50.8k52940












          • This blows my mind

            – kvantour
            Nov 15 '18 at 21:42











          • easier than awk, not?

            – karakfa
            Nov 15 '18 at 22:15











          • @karakfa Not really easier, I had to look it up in the man page ;-)

            – kvantour
            Nov 16 '18 at 9:41











          • @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

            – Arun Aranganathan
            Nov 16 '18 at 11:43












          • with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

            – karakfa
            Nov 16 '18 at 18:39

















          • This blows my mind

            – kvantour
            Nov 15 '18 at 21:42











          • easier than awk, not?

            – karakfa
            Nov 15 '18 at 22:15











          • @karakfa Not really easier, I had to look it up in the man page ;-)

            – kvantour
            Nov 16 '18 at 9:41











          • @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

            – Arun Aranganathan
            Nov 16 '18 at 11:43












          • with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

            – karakfa
            Nov 16 '18 at 18:39
















          This blows my mind

          – kvantour
          Nov 15 '18 at 21:42





          This blows my mind

          – kvantour
          Nov 15 '18 at 21:42













          easier than awk, not?

          – karakfa
          Nov 15 '18 at 22:15





          easier than awk, not?

          – karakfa
          Nov 15 '18 at 22:15













          @karakfa Not really easier, I had to look it up in the man page ;-)

          – kvantour
          Nov 16 '18 at 9:41





          @karakfa Not really easier, I had to look it up in the man page ;-)

          – kvantour
          Nov 16 '18 at 9:41













          @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

          – Arun Aranganathan
          Nov 16 '18 at 11:43






          @karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

          – Arun Aranganathan
          Nov 16 '18 at 11:43














          with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

          – karakfa
          Nov 16 '18 at 18:39





          with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

          – karakfa
          Nov 16 '18 at 18:39













          1














          I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:



          (edit: Revised version based on new information about the data)



          #!/bin/sh
          # Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
          file1=$1
          file2=$2
          file3=$3

          rm -f scratch.db
          sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
          CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
          CREATE TABLE file2(f2_1 INTEGER, f2_2);
          CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
          .import $file1 file1
          .import $file2 file2
          .import $file3 file3
          -- Build indexes to speed up joining and sorting gigs of data.
          CREATE INDEX file1_idx ON file1(f1_1);
          CREATE INDEX file2_idx ON file2(f2_1);
          CREATE INDEX file3_idx ON file3(f3_4);
          SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
          , f3_2, f3_3
          FROM file3
          LEFT JOIN file1 ON f1_1 = f3_4
          LEFT JOIN file2 ON f2_1 = f3_4
          ORDER BY f3_4;
          EOF
          rm -f scratch.db


          Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.






          share|improve this answer

























          • Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

            – Arun Aranganathan
            Mar 4 at 2:10















          1














          I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:



          (edit: Revised version based on new information about the data)



          #!/bin/sh
          # Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
          file1=$1
          file2=$2
          file3=$3

          rm -f scratch.db
          sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
          CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
          CREATE TABLE file2(f2_1 INTEGER, f2_2);
          CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
          .import $file1 file1
          .import $file2 file2
          .import $file3 file3
          -- Build indexes to speed up joining and sorting gigs of data.
          CREATE INDEX file1_idx ON file1(f1_1);
          CREATE INDEX file2_idx ON file2(f2_1);
          CREATE INDEX file3_idx ON file3(f3_4);
          SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
          , f3_2, f3_3
          FROM file3
          LEFT JOIN file1 ON f1_1 = f3_4
          LEFT JOIN file2 ON f2_1 = f3_4
          ORDER BY f3_4;
          EOF
          rm -f scratch.db


          Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.






          share|improve this answer

























          • Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

            – Arun Aranganathan
            Mar 4 at 2:10













          1












          1








          1







          I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:



          (edit: Revised version based on new information about the data)



          #!/bin/sh
          # Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
          file1=$1
          file2=$2
          file3=$3

          rm -f scratch.db
          sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
          CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
          CREATE TABLE file2(f2_1 INTEGER, f2_2);
          CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
          .import $file1 file1
          .import $file2 file2
          .import $file3 file3
          -- Build indexes to speed up joining and sorting gigs of data.
          CREATE INDEX file1_idx ON file1(f1_1);
          CREATE INDEX file2_idx ON file2(f2_1);
          CREATE INDEX file3_idx ON file3(f3_4);
          SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
          , f3_2, f3_3
          FROM file3
          LEFT JOIN file1 ON f1_1 = f3_4
          LEFT JOIN file2 ON f2_1 = f3_4
          ORDER BY f3_4;
          EOF
          rm -f scratch.db


          Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.






          share|improve this answer















          I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:



          (edit: Revised version based on new information about the data)



          #!/bin/sh
          # Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
          file1=$1
          file2=$2
          file3=$3

          rm -f scratch.db
          sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
          CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
          CREATE TABLE file2(f2_1 INTEGER, f2_2);
          CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
          .import $file1 file1
          .import $file2 file2
          .import $file3 file3
          -- Build indexes to speed up joining and sorting gigs of data.
          CREATE INDEX file1_idx ON file1(f1_1);
          CREATE INDEX file2_idx ON file2(f2_1);
          CREATE INDEX file3_idx ON file3(f3_4);
          SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
          , f3_2, f3_3
          FROM file3
          LEFT JOIN file1 ON f1_1 = f3_4
          LEFT JOIN file2 ON f2_1 = f3_4
          ORDER BY f3_4;
          EOF
          rm -f scratch.db


          Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 16 '18 at 19:19

























          answered Nov 16 '18 at 5:35









          ShawnShawn

          5,5412615




          5,5412615












          • Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

            – Arun Aranganathan
            Mar 4 at 2:10

















          • Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

            – Arun Aranganathan
            Mar 4 at 2:10
















          Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

          – Arun Aranganathan
          Mar 4 at 2:10





          Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

          – Arun Aranganathan
          Mar 4 at 2:10

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323516%2fcommand-to-perform-full-outer-join-with-duplicate-entries-in-key-join-column%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Use pre created SQLite database for Android project in kotlin

          Darth Vader #20

          Ondo