How can I sum multiple columns in a spark dataframe in pyspark?

I've got a list of column names i want to sum

columns = ['col1','col2','col3']

How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)

Dataframe with result i want:

col1 col2 col3 result
 1 2 3 6

Thanks !

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23

Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33

add a comment |

I've got a list of column names i want to sum

columns = ['col1','col2','col3']

How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)

Dataframe with result i want:

col1 col2 col3 result
 1 2 3 6

Thanks !

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23

Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33

add a comment |

I've got a list of column names i want to sum

columns = ['col1','col2','col3']

How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)

Dataframe with result i want:

col1 col2 col3 result
 1 2 3 6

Thanks !

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

I've got a list of column names i want to sum

columns = ['col1','col2','col3']

How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)

Dataframe with result i want:

col1 col2 col3 result
 1 2 3 6

Thanks !

python apache-spark pyspark

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

edited Nov 14 '18 at 17:24

asked Nov 14 '18 at 10:21

Manrique

522315

asked Nov 14 '18 at 10:21

Manrique

522315

asked Nov 14 '18 at 10:21

Manrique

522315

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23

Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33

add a comment |

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23

Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33

Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23

Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33

add a comment |

2 Answers
2

active

oldest

votes

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

add a comment |

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53297872%2fhow-can-i-sum-multiple-columns-in-a-spark-dataframe-in-pyspark%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

add a comment |

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

add a comment |

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

answered Nov 14 '18 at 10:25

Mayank Porwal

4,9702724

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

add a comment |

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

– vikrant rana
Dec 4 '18 at 14:38

add a comment |

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

add a comment |

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

add a comment |

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

edited Jan 22 at 5:45

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

answered Jan 21 at 5:36

Dileep Kumar Patchigolla

404620

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb