pyspark generate all combinations of unique values

I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:

all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")

Is there any way to change this code to more sparkonic one?

======EDIT======

I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:

test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)

which for some unknown reason raise following exception:

An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
 at java.lang.Integer.valueOf(Integer.java:832)

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47

The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09

add a comment |

all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")

Is there any way to change this code to more sparkonic one?

======EDIT======

I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:

test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)

which for some unknown reason raise following exception:

An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
 at java.lang.Integer.valueOf(Integer.java:832)

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47

The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09

add a comment |

all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")

Is there any way to change this code to more sparkonic one?

======EDIT======

I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:

test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)

which for some unknown reason raise following exception:

An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
 at java.lang.Integer.valueOf(Integer.java:832)

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")

Is there any way to change this code to more sparkonic one?

======EDIT======

I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:

test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)

which for some unknown reason raise following exception:

An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
 at java.lang.Integer.valueOf(Integer.java:832)

pandas pyspark itertools

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

edited Nov 13 '18 at 12:08

asked Nov 13 '18 at 10:37

user1877600

1741316

asked Nov 13 '18 at 10:37

user1877600

1741316

asked Nov 13 '18 at 10:37

user1877600

1741316

"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47

The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09

add a comment |

"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47

The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09

"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?

– pault
Nov 13 '18 at 15:47

The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?

– user1877600
Nov 13 '18 at 20:09

add a comment |

1 Answer
1

active

oldest

votes

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

answered Nov 13 '18 at 11:20

Manrique

500114

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279125%2fpyspark-generate-all-combinations-of-unique-values%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

answered Nov 13 '18 at 11:20

Manrique

500114

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

add a comment |

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

answered Nov 13 '18 at 11:20

Manrique

500114

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

add a comment |

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

answered Nov 13 '18 at 11:20

Manrique

500114

You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.

((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())

It can be put inside a for loop with some work to automatize it for other dataframes.

Hope this helps

answered Nov 13 '18 at 11:20

Manrique

500114

answered Nov 13 '18 at 11:20

Manrique

500114

answered Nov 13 '18 at 11:20

Manrique

500114

answered Nov 13 '18 at 11:20

Manrique

500114

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

add a comment |

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.

– user1877600
Nov 13 '18 at 11:45

Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.

– Manrique
Nov 13 '18 at 12:13

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

vYLdPCvsUAF N4,LpfzgHboJV5 nGeA44ug

搜尋此網誌

Pfthb