Apache Spark column with collected info vs rows union

up vote
1
down vote

favorite

I have the following DataFrame:

+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+

after applying the following transformation:

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"


val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType))
 


val unionDf = userTags.foldLeft(userTags.head) 
 case (acc, df) => acc.union(df)

I receive the following DataFrame:

+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag column):

+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+

Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07

@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47

add a comment |

up vote
1
down vote

favorite

I have the following DataFrame:

+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+

after applying the following transformation:

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"


val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType))
 


val unionDf = userTags.foldLeft(userTags.head) 
 case (acc, df) => acc.union(df)

I receive the following DataFrame:

+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+

Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07

@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47

add a comment |

up vote
1
down vote

favorite

I have the following DataFrame:

+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+

after applying the following transformation:

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"


val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType))
 


val unionDf = userTags.foldLeft(userTags.head) 
 case (acc, df) => acc.union(df)

I receive the following DataFrame:

+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+

Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

I have the following DataFrame:

+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+

after applying the following transformation:

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"


val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType))
 


val unionDf = userTags.foldLeft(userTags.head) 
 case (acc, df) => acc.union(df)

I receive the following DataFrame:

+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+

+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+

Right now I don't know how to change my transformation in order to receive such a structure with the tag column like ArrayType without original row duplication.

scala apache-spark apache-spark-sql

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

edited Nov 10 at 11:29

asked Nov 9 at 19:29

alexanoid

6,8681175166

asked Nov 9 at 19:29

alexanoid

6,8681175166

asked Nov 9 at 19:29

alexanoid

6,8681175166

Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07

@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47

add a comment |

Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07

@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47

Can you please post the code buildTagQuery(tag, tagCondition, "tagged_users"), maybe we can create an UDF that produces the tag column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07

@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")

This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags. The folding logic is same as before. But in this case we are not folding the head element twice.

val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) 
 case (acc, df) => acc.union(df)

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

answered Nov 10 at 15:21

user238607

680711

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

1

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

1

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232178%2fapache-spark-column-with-collected-info-vs-rows-union%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")

val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) 
 case (acc, df) => acc.union(df)

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

answered Nov 10 at 15:21

user238607

680711

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

1

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

1

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

add a comment |

up vote
1
down vote

accepted

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")

val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) 
 case (acc, df) => acc.union(df)

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

answered Nov 10 at 15:21

user238607

680711

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

1

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

1

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

add a comment |

up vote
1
down vote

accepted

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")

val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) 
 case (acc, df) => acc.union(df)

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

answered Nov 10 at 15:21

user238607

680711

Here is one possible method without changing too much of your logic.

First you will have to assign a unique id to the users table. As shown below:

import org.apache.spark.sql.functions._

val userstable = spark.sql("select * from users")

val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())

userswithId.createOrReplaceTempView("users")

Now your tags and userTags remains same as above.

val tags = Map(
 "big" -> "country IN (FROM big_countries)",
 "medium" -> "country IN (FROM medium_countries)",
 // a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
 "sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")

def buildTagQuery(tag: String, tagCondition: String, table: String): String = 
 f"FROM $table WHERE $tagCondition"

Here we are only selecting on the UniqueID and tag columns.

val userTags = tags.map 
 case (tag, tagCondition) => 
 spark.sql(buildTagQuery(tag, tagCondition, "users"))
 .withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")

val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere) 
 case (acc, df) => acc.union(df)

Now we are grouping-by the UniqueID column while aggregating the tags into its own list.

val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))

println("Printing the unionDf3 result")
unionDf3.show(25)

Finally we are joining your users table with UniqueID which we assigned earlier (i.e. table userswithId) with previous dataframe to get the final result.

val finalResult = userswithId.join(unionDf3,"UniqueID")

println("Printing the final result")
finalResult.show(25)

Final result is as below :

+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+

answered Nov 10 at 15:21

user238607

680711

answered Nov 10 at 15:21

user238607

680711

answered Nov 10 at 15:21

user238607

680711

answered Nov 10 at 15:21

user238607

680711

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

1

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

1

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

add a comment |

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

1

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

1

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

Thanks for your answer! Could you please compare from the performance point of view - your solution and groupBy by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags")) ?
– alexanoid
Nov 10 at 15:31

I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except tag column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00

Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02

@alexanoid : In case the dataframe unionDf3 is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb