pyspark generate all combinations of unique values
I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)
pandas pyspark itertools
add a comment |
I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)
pandas pyspark itertools
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09
add a comment |
I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)
pandas pyspark itertools
I am trying to generate all combination of unique values within my spark dataframe.
The solution, which comes to my mind require usage of itertools.product and pandas dataframe, and therefore it is not efficient enough.
Here is my code:
all_date = [ i.Date for i in df.select("Date").distinct().collect()]
all_stores_id = [i.ID for i in fd.select("ID").distinct().collect()]
all_category = [i.CATEGORY for i in fd.select("CATEGORY").distinct().collect()]
combined = [all_date, all_stores_id, all_category]
all_combination_pdf= pd.DataFrame(columns = ['Date', 'ID', 'CATEGORY'], data=list(itertools.product(*combined)))
# convert pandas dataframe to spark
all_combination_df = sqlContext.createDataFrame(all_combination_pdf)
joined = all_combination_df.join(df,["Date","ID","CATEGORY"],how="left")
Is there any way to change this code to more sparkonic one?
======EDIT======
I've also tried to implement such functionalities using the crossJoin function.
Here is the the code:
test_df = ((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
test_df.show(10)
which for some unknown reason raise following exception:
An error occurred while calling o305.showString.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Integer.valueOf(Integer.java:832)
pandas pyspark itertools
pandas pyspark itertools
edited Nov 13 '18 at 12:08
user1877600
asked Nov 13 '18 at 10:37
user1877600user1877600
1741316
1741316
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09
add a comment |
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09
add a comment |
1 Answer
1
active
oldest
votes
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279125%2fpyspark-generate-all-combinations-of-unique-values%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
add a comment |
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
add a comment |
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
You can generate the dataframe with this. It just creates a dataframe with the unique values of each column and performs a cross join (cartesian product) with the others.
((df.select('Date').distinct()).crossJoin(df.select('ID').distinct())).crossJoin(df.select('CATEGORY').distinct())
It can be put inside a for loop with some work to automatize it for other dataframes.
Hope this helps
answered Nov 13 '18 at 11:20
ManriqueManrique
500114
500114
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
add a comment |
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Thank you for your answer. Unfortunately, for some reason, which unknown to me, I am not able to execute your line following by test_df.show(10). The error that I get java.lang.OutOfMemoryError: GC overhead limit exceeded. The project is developed on the Azure platform, so this is definitely not a hardware problem.
– user1877600
Nov 13 '18 at 11:45
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
Sorry it didn't help. Its certainly strange a memory error in Azure, but think about the huge dataframe that is going to be created. If you have, for example, 3 columns, each of them with 5 different values, u will end up with 5^3 = 125 rows. Imagine with bigger values.
– Manrique
Nov 13 '18 at 12:13
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279125%2fpyspark-generate-all-combinations-of-unique-values%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
"some unknown reason" - that error is pretty clear. You're running out of memory. How many distinct values do you have?
– pault
Nov 13 '18 at 15:47
The choice of word unknown was poor. I understand that the error is due to the memory limitation, but I don't know why it happens. The data sample generates about 1M distinct values and what is more important, the code implemented using pandas works fine. Do you have any idea how to reimplement pandas code into efficient pyspark one?
– user1877600
Nov 13 '18 at 20:09