Command to perform full outer join with duplicate entries in key/join column

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I have three files. I need to join them based on one column and perform some transformations.

file1.dat (Column 1 is used for joining)

123,is1,ric1,col1,smbc1 
123,is2,ric1,col1,smbc1 
234,is3,ric3,col3,smbc2 
345,is4,ric4,,smbc2 
345,is4,,col5,smbc2

file2.dat (Column 1 is used for joining)

123,abc 
234,bcd

file3.dat (Column 4 is used for joining)

r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8 
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8 
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8

Expected Output (output.dat)

123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3 
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3 
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3

I wrote the following awk command.

awk '
BEGIN FS=OFS=","
FILENAME == ARGV[1] temp_join_one[$1] = $2"
FILENAME == ARGV[2] exchtbunload[$1] = $2; next
FILENAME == ARGV[3] ");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat" 
' file1.dat file2.dat file3.dat

I need to join all three files.

The final output file should have all the values from file3 irrespective of whether they are in other two files and the corresponding columns should be empty(or N/A) if it is not present in other two files. (The order of the columns is not a very big problem. I can use awk to rearrange them.)

But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.

I tried to apply the solution suggested using join condition. It works with smaller files. But the files I have are close to 3-5 GB in size. And they are in numerical order and not lexicographical order. Sorting them looks like would take lot of time.

Any suggestion would be helpful.

Thanks in advance.

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

add a comment |

I have three files. I need to join them based on one column and perform some transformations.

file1.dat (Column 1 is used for joining)

123,is1,ric1,col1,smbc1 
123,is2,ric1,col1,smbc1 
234,is3,ric3,col3,smbc2 
345,is4,ric4,,smbc2 
345,is4,,col5,smbc2

file2.dat (Column 1 is used for joining)

123,abc 
234,bcd

file3.dat (Column 4 is used for joining)

r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8 
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8 
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8

Expected Output (output.dat)

123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3 
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3 
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3

I wrote the following awk command.

awk '
BEGIN FS=OFS=","
FILENAME == ARGV[1] temp_join_one[$1] = $2"
FILENAME == ARGV[2] exchtbunload[$1] = $2; next
FILENAME == ARGV[3] ");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat" 
' file1.dat file2.dat file3.dat

I need to join all three files.

But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.

Any suggestion would be helpful.

Thanks in advance.

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

add a comment |

I have three files. I need to join them based on one column and perform some transformations.

file1.dat (Column 1 is used for joining)

123,is1,ric1,col1,smbc1 
123,is2,ric1,col1,smbc1 
234,is3,ric3,col3,smbc2 
345,is4,ric4,,smbc2 
345,is4,,col5,smbc2

file2.dat (Column 1 is used for joining)

123,abc 
234,bcd

file3.dat (Column 4 is used for joining)

r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8 
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8 
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8

Expected Output (output.dat)

123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3 
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3 
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3

I wrote the following awk command.

awk '
BEGIN FS=OFS=","
FILENAME == ARGV[1] temp_join_one[$1] = $2"
FILENAME == ARGV[2] exchtbunload[$1] = $2; next
FILENAME == ARGV[3] ");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat" 
' file1.dat file2.dat file3.dat

I need to join all three files.

But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.

Any suggestion would be helpful.

Thanks in advance.

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

I have three files. I need to join them based on one column and perform some transformations.

file1.dat (Column 1 is used for joining)

123,is1,ric1,col1,smbc1 
123,is2,ric1,col1,smbc1 
234,is3,ric3,col3,smbc2 
345,is4,ric4,,smbc2 
345,is4,,col5,smbc2

file2.dat (Column 1 is used for joining)

123,abc 
234,bcd

file3.dat (Column 4 is used for joining)

r0c1,r0c2,r0c3,123,r0c5,r0c6,r0c7,r0c8 
r2c1,r2c2,r2c3,123,r2c5,r2c6,r2c7,r2c8 
r3c1,r3c2,r3c3,234,r3c5,r3c6,r3c7,r3c8 
r4c1,r4c2,r4c3,345,r4c5,r4c6,r4c7,r4c8

Expected Output (output.dat)

123,r0c5,is1,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r0c5,is2,ric1,smbc1,abc,r0c8,r0c6,col1,r0c7,r0c1,r0c2,r0c3 
123,r2c5,is1,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
123,r2c5,is2,ric1,smbc1,abc,r2c8,r2c6,col1,r2c7,r2c1,r2c2,r2c3 
234,r3c5,is3,ric3,smbc2,bcd,r3c8,r3c6,col3,r3c7,r3c1,r3c2,r3c3 
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c1,r4c2,r4c3 
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c1,r4c2,r4c3

I wrote the following awk command.

awk '
BEGIN FS=OFS=","
FILENAME == ARGV[1] temp_join_one[$1] = $2"
FILENAME == ARGV[2] exchtbunload[$1] = $2; next
FILENAME == ARGV[3] ");
v3=(array_temp_join_one[1]==""?"N/A":array_temp_join_one[1]);
v4=(array_temp_join_one[2]==""?"N/A":array_temp_join_one[2]);
v5=(array_temp_join_one[4]==""?"N/A":array_temp_join_one[4]);
v6=(exchtbunload[$4]==""?"N/A":exchtbunload[$4]);
v9=(array_temp_join_one[3]==""?"N/A":array_temp_join_one[3]);
v11=($2=""?"N/A":$2);
print $4, $5, v3, v4, v5, v6, $8, $6, v9, $7, $1, v11, $3 >
"output.dat" 
' file1.dat file2.dat file3.dat

I need to join all three files.

But my problem is, as the key is not unique, I am not getting the expected output. My output has only three lines.

Any suggestion would be helpful.

Thanks in advance.

unix awk scripting

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

edited Nov 16 '18 at 11:29

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

asked Nov 15 '18 at 16:10

Arun Aranganathan

575

add a comment |

2 Answers
2

active

oldest

votes

with join, assuming files are sorted by the key.

$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
 file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

add a comment |

I really like the answer using join, but it does require that the files are sorted by the key column. Here's a version that doesn't have that restriction. Working under the theory that the best tool for doing database-like things is a database, it imports the CSV files into tables of a temporary SQLite database and then runs a SELECT on them to get your desired output:

(edit: Revised version based on new information about the data)

#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3

rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
 , f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db

Note: This will use a temporary database file that's going to be the size of all your data and then some because of indexes. If you're space constrained, I have an idea for doing it without temporary files, given the information that the join columns are sorted numerically, but it's enough work that I'm not going to bother unless asked.

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53323516%2fcommand-to-perform-full-outer-join-with-duplicate-entries-in-key-join-column%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

with join, assuming files are sorted by the key.

$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
 file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

add a comment |

with join, assuming files are sorted by the key.

$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
 file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

add a comment |

with join, assuming files are sorted by the key.

$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
 file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

with join, assuming files are sorted by the key.

$ join -t, -1 1 -2 4 <(join -t, -a1 -a2 -e "N/A" -o1.1,1.2,1.3,1.4,1.5,2.1 file1 file2) 
 file3 -o1.1,2.5,1.2,1.3,1.5,1.6,2.8,2.6,1.4,2.7,2.2,2.3

123,r0c5,is1,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is1,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
123,r0c5,is2,ric1,smbc1,123,r0c8,r0c6,col1,r0c7,r0c2,r0c3
123,r2c5,is2,ric1,smbc1,123,r2c8,r2c6,col1,r2c7,r2c2,r2c3
234,r3c5,is3,ric3,smbc2,234,r3c8,r3c6,col3,r3c7,r3c2,r3c3
345,r4c5,is4,ric4,smbc2,N/A,r4c8,r4c6,N/A,r4c7,r4c2,r4c3
345,r4c5,is4,N/A,smbc2,N/A,r4c8,r4c6,col5,r4c7,r4c2,r4c3

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

answered Nov 15 '18 at 21:29

karakfa

50.8k52940

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

add a comment |

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

This blows my mind

– kvantour
Nov 15 '18 at 21:42

easier than awk, not?

– karakfa
Nov 15 '18 at 22:15

@karakfa Not really easier, I had to look it up in the man page ;-)

– kvantour
Nov 16 '18 at 9:41

@karakfa I like the solution. But I have a small problem. Your solution needs the files to be sorted. I have three different set of file where i have to apply this logic. Each files are close to 3-5GB in size. So can you suggest on how to tackle this issue. Note: My files are numerically sorted.

– Arun Aranganathan
Nov 16 '18 at 11:43

with join sorting is required. I wish they add a "numerically sorted" option but it expects lexical order unfortunately. If sorting is not practical this won't work.

– karakfa
Nov 16 '18 at 18:39

add a comment |

(edit: Revised version based on new information about the data)

#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3

rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
 , f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

add a comment |

(edit: Revised version based on new information about the data)

#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3

rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
 , f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

add a comment |

(edit: Revised version based on new information about the data)

#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3

rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
 , f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

(edit: Revised version based on new information about the data)

#!/bin/sh
# Usage: ./merge.sh file1.dat file2.dat file3.dat > output.dat
file1=$1
file2=$2
file3=$3

rm -f scratch.db
sqlite3 -batch -noheader -csv -nullvalue "N/A" scratch.db <<EOF | perl -pe 's#(?:^|,)K""(?=,|$)#N/A#g'
CREATE TABLE file1(f1_1 INTEGER, f1_2, f1_3, f1_4, f1_5);
CREATE TABLE file2(f2_1 INTEGER, f2_2);
CREATE TABLE file3(f3_1, f3_2, f3_3, f3_4 INTEGER, f3_5, f3_6, f3_7, f3_8);
.import $file1 file1
.import $file2 file2
.import $file3 file3
-- Build indexes to speed up joining and sorting gigs of data.
CREATE INDEX file1_idx ON file1(f1_1);
CREATE INDEX file2_idx ON file2(f2_1);
CREATE INDEX file3_idx ON file3(f3_4);
SELECT f3_4, f3_5, f1_2, f1_3, f1_5, f2_2, f3_8, f3_6, f1_4, f3_7, f3_1
 , f3_2, f3_3
FROM file3
LEFT JOIN file1 ON f1_1 = f3_4
LEFT JOIN file2 ON f2_1 = f3_4
ORDER BY f3_4;
EOF
rm -f scratch.db

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

edited Nov 16 '18 at 19:19

answered Nov 16 '18 at 5:35

Shawn

5,5412615

answered Nov 16 '18 at 5:35

Shawn

5,5412615

answered Nov 16 '18 at 5:35

Shawn

5,5412615

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

add a comment |

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

Hi @Shawn, Thanks for the help. I used the join command suggested by karakfa

– Arun Aranganathan
Mar 4 at 2:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb