Apache Spark column with collected info vs rows union









up vote
1
down vote

favorite
2












I have the following DataFrame:



+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+


after applying the following transformation:



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"


val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))



val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)



I receive the following DataFrame:



+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+


which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):



+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+


Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.










share|improve this question























  • Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
    – Pavithran Ramachandran
    Nov 9 at 21:07











  • @PavithranRamachandran please see the updated question. I have added the info.
    – alexanoid
    Nov 10 at 6:47














up vote
1
down vote

favorite
2












I have the following DataFrame:



+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+


after applying the following transformation:



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"


val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))



val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)



I receive the following DataFrame:



+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+


which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):



+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+


Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.










share|improve this question























  • Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
    – Pavithran Ramachandran
    Nov 9 at 21:07











  • @PavithranRamachandran please see the updated question. I have added the info.
    – alexanoid
    Nov 10 at 6:47












up vote
1
down vote

favorite
2









up vote
1
down vote

favorite
2






2





I have the following DataFrame:



+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+


after applying the following transformation:



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"


val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))



val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)



I receive the following DataFrame:



+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+


which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):



+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+


Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.










share|improve this question















I have the following DataFrame:



+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+


after applying the following transformation:



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"


val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))



val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)



I receive the following DataFrame:



+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+


which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):



+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+


Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.







scala apache-spark apache-spark-sql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 11:29

























asked Nov 9 at 19:29









alexanoid

6,8681175166




6,8681175166











  • Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
    – Pavithran Ramachandran
    Nov 9 at 21:07











  • @PavithranRamachandran please see the updated question. I have added the info.
    – alexanoid
    Nov 10 at 6:47
















  • Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
    – Pavithran Ramachandran
    Nov 9 at 21:07











  • @PavithranRamachandran please see the updated question. I have added the info.
    – alexanoid
    Nov 10 at 6:47















Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07





Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07













@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47




@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47












1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Here is one possible method without changing too much of your logic.



First you will have to assign a unique id to the users table. As shown below:



import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")


Now your tags and userTags remains same as above.



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"



Here we are only selecting on the UniqueID and tag columns.



val userTags = tags.map 
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")




This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.



val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)



Now we are grouping-by the UniqueID column while aggregating the tags into its own list.



val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)


Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.



val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)


Final result is as below :



+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+





share|improve this answer




















  • Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
    – alexanoid
    Nov 10 at 15:31










  • I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
    – user238607
    Nov 10 at 16:00






  • 1




    Thanks, I feel simple join(your solution) will work much faster. Thanks again!
    – alexanoid
    Nov 10 at 16:02






  • 1




    @alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
    – user238607
    Nov 10 at 16:05











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232178%2fapache-spark-column-with-collected-info-vs-rows-union%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Here is one possible method without changing too much of your logic.



First you will have to assign a unique id to the users table. As shown below:



import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")


Now your tags and userTags remains same as above.



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"



Here we are only selecting on the UniqueID and tag columns.



val userTags = tags.map 
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")




This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.



val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)



Now we are grouping-by the UniqueID column while aggregating the tags into its own list.



val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)


Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.



val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)


Final result is as below :



+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+





share|improve this answer




















  • Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
    – alexanoid
    Nov 10 at 15:31










  • I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
    – user238607
    Nov 10 at 16:00






  • 1




    Thanks, I feel simple join(your solution) will work much faster. Thanks again!
    – alexanoid
    Nov 10 at 16:02






  • 1




    @alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
    – user238607
    Nov 10 at 16:05















up vote
1
down vote



accepted










Here is one possible method without changing too much of your logic.



First you will have to assign a unique id to the users table. As shown below:



import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")


Now your tags and userTags remains same as above.



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"



Here we are only selecting on the UniqueID and tag columns.



val userTags = tags.map 
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")




This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.



val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)



Now we are grouping-by the UniqueID column while aggregating the tags into its own list.



val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)


Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.



val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)


Final result is as below :



+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+





share|improve this answer




















  • Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
    – alexanoid
    Nov 10 at 15:31










  • I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
    – user238607
    Nov 10 at 16:00






  • 1




    Thanks, I feel simple join(your solution) will work much faster. Thanks again!
    – alexanoid
    Nov 10 at 16:02






  • 1




    @alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
    – user238607
    Nov 10 at 16:05













up vote
1
down vote



accepted







up vote
1
down vote



accepted






Here is one possible method without changing too much of your logic.



First you will have to assign a unique id to the users table. As shown below:



import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")


Now your tags and userTags remains same as above.



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"



Here we are only selecting on the UniqueID and tag columns.



val userTags = tags.map 
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")




This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.



val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)



Now we are grouping-by the UniqueID column while aggregating the tags into its own list.



val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)


Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.



val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)


Final result is as below :



+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+





share|improve this answer












Here is one possible method without changing too much of your logic.



First you will have to assign a unique id to the users table. As shown below:



import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")


Now your tags and userTags remains same as above.



val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"



Here we are only selecting on the UniqueID and tag columns.



val userTags = tags.map 
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")




This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.



val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)



Now we are grouping-by the UniqueID column while aggregating the tags into its own list.



val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)


Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.



val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)


Final result is as below :



+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 10 at 15:21









user238607

680711




680711











  • Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
    – alexanoid
    Nov 10 at 15:31










  • I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
    – user238607
    Nov 10 at 16:00






  • 1




    Thanks, I feel simple join(your solution) will work much faster. Thanks again!
    – alexanoid
    Nov 10 at 16:02






  • 1




    @alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
    – user238607
    Nov 10 at 16:05

















  • Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
    – alexanoid
    Nov 10 at 15:31










  • I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
    – user238607
    Nov 10 at 16:00






  • 1




    Thanks, I feel simple join(your solution) will work much faster. Thanks again!
    – alexanoid
    Nov 10 at 16:02






  • 1




    @alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
    – user238607
    Nov 10 at 16:05
















Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31




Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31












I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00




I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00




1




1




Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02




Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02




1




1




@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05





@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05


















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232178%2fapache-spark-column-with-collected-info-vs-rows-union%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo