spark drop multiple duplicated columns after join

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

asked Nov 15 '18 at 10:08

user3607698

388117

add a comment |

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

asked Nov 15 '18 at 10:08

user3607698

388117

add a comment |

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

asked Nov 15 '18 at 10:08

user3607698

388117

I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema

root
 |-- id: string (nullable = true)
 |-- value: string (nullable = true)
 |-- test: string (nullable = true)
 |-- details: string (nullable = true)
 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

now I want to drop the last two columns

 |-- test: string (nullable = true)
 |-- value: string (nullable = true)

I tried with df..dropDuplicates() but it dropping all

how to drop the duplicated columns which comes in the last ?

apache-spark apache-spark-sql

asked Nov 15 '18 at 10:08

user3607698

388117

asked Nov 15 '18 at 10:08

user3607698

388117

asked Nov 15 '18 at 10:08

user3607698

388117

asked Nov 15 '18 at 10:08

user3607698

388117

asked Nov 15 '18 at 10:08

user3607698

388117

add a comment |

3 Answers
3

active

oldest

votes

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:

scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update1:

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName)) 
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

|
show 1 more comment

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

add a comment |

df.dropDuplicates() works only for rows.

You can df1.drop(df2.column("value"))

You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53316976%2fspark-drop-multiple-duplicated-columns-after-join%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:

scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update1:

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName)) 
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

|
show 1 more comment

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:

scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update1:

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName)) 
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

|
show 1 more comment

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:

scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update1:

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName)) 
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

You have to use a vararg syntax to get the column names from an array and drop it.
Check below:

scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+

scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)

scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)

scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update1:

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+


scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update2:

To remove the columns dynamically, check the below solution.

scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+

scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)

scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)

scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName)) 
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

Update3

Renaming/aliasing in one go.

scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+


scala>

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

edited Nov 15 '18 at 14:24

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

answered Nov 15 '18 at 13:01

stack0114106

5,0202423

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

|
show 1 more comment

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

this solution drops both the duplicated columns, my requirement is to drop one and keep other

– user3607698
Nov 15 '18 at 13:04

in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

– stack0114106
Nov 15 '18 at 13:36

check my update2 on how to approach it dynamically. let me know if it helps

– stack0114106
Nov 15 '18 at 14:14

Is it possible to do alias for list of columns in one go?

– user3607698
Nov 15 '18 at 14:16

yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

– stack0114106
Nov 15 '18 at 14:23

|
show 1 more comment

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

add a comment |

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

add a comment |

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column

 1. DF1.join(DF2,Seq("column1","column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))

So to drop the duplicate columns you can use

 1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
 2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))

In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

answered Nov 15 '18 at 10:19

Sundeep Pidugu

407115

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

add a comment |

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

– user3607698
Nov 15 '18 at 10:30

what do you mean by list of columns ? are they dynamically generated ?

– Sundeep Pidugu
Nov 15 '18 at 11:10

val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

– user3607698
Nov 15 '18 at 11:17

i never tried this are you getting any any error while running this ? please do let me know

– Sundeep Pidugu
Nov 15 '18 at 11:33

this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

– user3607698
Nov 15 '18 at 12:26

add a comment |

df.dropDuplicates() works only for rows.

You can df1.drop(df2.column("value"))

You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

add a comment |

df.dropDuplicates() works only for rows.

You can df1.drop(df2.column("value"))

You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

add a comment |

df.dropDuplicates() works only for rows.

You can df1.drop(df2.column("value"))

You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

df.dropDuplicates() works only for rows.

You can df1.drop(df2.column("value"))

You can specify columns you want to select, for example, with df.select(Seq of columns)

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

answered Nov 15 '18 at 12:52

Serge Harnyk

402211

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

add a comment |

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

– user3607698
Nov 15 '18 at 12:55

Try df2.column("value") instead of new Column("value")

– Serge Harnyk
Nov 15 '18 at 13:06

getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

– user3607698
Nov 15 '18 at 13:09

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

NMeuPuypXH E56oM t5Ynv a,MeDAj0zu,cBxIHbtty1VFr,flYieG7Gwuu79YgSPwl

搜尋此網誌

Pfthb