Apache Spark column with collected info vs rows union
up vote
1
down vote
favorite
I have the following DataFrame:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
after applying the following transformation:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)
I receive the following DataFrame:
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag
column):
+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+
Right now I don't know how to change my transformation in order to receive such a structure with the tag
column like ArrayType
without original row duplication.
scala apache-spark apache-spark-sql
add a comment |
up vote
1
down vote
favorite
I have the following DataFrame:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
after applying the following transformation:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)
I receive the following DataFrame:
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag
column):
+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+
Right now I don't know how to change my transformation in order to receive such a structure with the tag
column like ArrayType
without original row duplication.
scala apache-spark apache-spark-sql
Can you please post the codebuildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create anUDF
that produces thetag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in thesql
query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have the following DataFrame:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
after applying the following transformation:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)
I receive the following DataFrame:
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag
column):
+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+
Right now I don't know how to change my transformation in order to receive such a structure with the tag
column like ArrayType
without original row duplication.
scala apache-spark apache-spark-sql
I have the following DataFrame:
+------+------------------+--------------+-------------+
| name| email| phone| country|
+------+------------------+--------------+-------------+
| Mike| mike@example.com|+91-9999999999| Italy|
| Alex| alex@example.com|+91-9999999998| France|
| John| john@example.com| +1-1111111111|United States|
|Donald|donald@example.com| +1-2222222222|United States|
| Dan| dan@example.com|+91-9999444999| Poland|
| Scott| scott@example.com|+91-9111999998| Spain|
| Rob| rob@example.com|+91-9114444998| Italy|
+------+------------------+--------------+-------------+
after applying the following transformation:
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType))
val unionDf = userTags.foldLeft(userTags.head)
case (acc, df) => acc.union(df)
I receive the following DataFrame:
+------+------------------+--------------+-------------+-------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+-------+
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Mike| mike@example.com|+91-9999999999| Italy| big|
| Alex| alex@example.com|+91-9999999998| France| big|
| John| john@example.com| +1-1111111111|United States| big|
|Donald|donald@example.com| +1-2222222222|United States| big|
| Scott| scott@example.com|+91-9111999998| Spain| big|
| Rob| rob@example.com|+91-9114444998| Italy| big|
| Dan| dan@example.com|+91-9999444999| Poland| medium|
| Scott| scott@example.com|+91-9111999998| Spain| medium|
|Donald|donald@example.com| +1-2222222222|United States|sometag|
+------+------------------+--------------+-------------+-------+
which duplicates each original DataFrame record with additional information in tag column but I need something like this(not duplicated records from the original DataFrame and collection of tags in the tag
column):
+------+------------------+--------------+-------------+--------------+
| name| email| phone| country| tag|
+------+------------------+--------------+-------------+--------------+
| Mike| mike@example.com|+91-9999999999| Italy| [big]|
| Alex| alex@example.com|+91-9999999998| France| [big]|
| John| john@example.com| +1-1111111111|United States| [big]|
|Donald|donald@example.com| +1-2222222222|United States| [big,sometag]|
| Dan| dan@example.com|+91-9999444999| Poland| [medium]|
| Scott| scott@example.com|+91-9111999998| Spain| [big,medium]|
| Rob| rob@example.com|+91-9114444998| Italy| [big]|
+------+------------------+--------------+-------------+--------------+
Right now I don't know how to change my transformation in order to receive such a structure with the tag
column like ArrayType
without original row duplication.
scala apache-spark apache-spark-sql
scala apache-spark apache-spark-sql
edited Nov 10 at 11:29
asked Nov 9 at 19:29
alexanoid
6,8681175166
6,8681175166
Can you please post the codebuildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create anUDF
that produces thetag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in thesql
query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47
add a comment |
Can you please post the codebuildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create anUDF
that produces thetag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in thesql
query so that union can be avioded.
– Pavithran Ramachandran
Nov 9 at 21:07
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47
Can you please post the code
buildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create an UDF
that produces the tag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql
query so that union can be avioded.– Pavithran Ramachandran
Nov 9 at 21:07
Can you please post the code
buildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create an UDF
that produces the tag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in the sql
query so that union can be avioded.– Pavithran Ramachandran
Nov 9 at 21:07
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Here is one possible method without changing too much of your logic.
First you will have to assign a unique id to the users table. As shown below:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
Now your tags
and userTags
remains same as above.
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
Here we are only selecting on the UniqueID
and tag
columns.
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags
. The folding logic is same as before. But in this case we are not folding the head element twice.
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)
Now we are grouping-by the UniqueID
column while aggregating the tags
into its own list.
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
Finally we are joining your users
table with UniqueID which we assigned earlier (i.e. table userswithId
) with previous dataframe to get the final result.
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
Final result is as below :
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
Thanks for your answer! Could you please compare from the performance point of view - your solution andgroupBy
by all columns, something like that :val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?
– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns excepttag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
@alexanoid : In case the dataframeunionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Here is one possible method without changing too much of your logic.
First you will have to assign a unique id to the users table. As shown below:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
Now your tags
and userTags
remains same as above.
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
Here we are only selecting on the UniqueID
and tag
columns.
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags
. The folding logic is same as before. But in this case we are not folding the head element twice.
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)
Now we are grouping-by the UniqueID
column while aggregating the tags
into its own list.
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
Finally we are joining your users
table with UniqueID which we assigned earlier (i.e. table userswithId
) with previous dataframe to get the final result.
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
Final result is as below :
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
Thanks for your answer! Could you please compare from the performance point of view - your solution andgroupBy
by all columns, something like that :val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?
– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns excepttag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
@alexanoid : In case the dataframeunionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05
add a comment |
up vote
1
down vote
accepted
Here is one possible method without changing too much of your logic.
First you will have to assign a unique id to the users table. As shown below:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
Now your tags
and userTags
remains same as above.
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
Here we are only selecting on the UniqueID
and tag
columns.
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags
. The folding logic is same as before. But in this case we are not folding the head element twice.
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)
Now we are grouping-by the UniqueID
column while aggregating the tags
into its own list.
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
Finally we are joining your users
table with UniqueID which we assigned earlier (i.e. table userswithId
) with previous dataframe to get the final result.
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
Final result is as below :
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
Thanks for your answer! Could you please compare from the performance point of view - your solution andgroupBy
by all columns, something like that :val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?
– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns excepttag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
@alexanoid : In case the dataframeunionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Here is one possible method without changing too much of your logic.
First you will have to assign a unique id to the users table. As shown below:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
Now your tags
and userTags
remains same as above.
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
Here we are only selecting on the UniqueID
and tag
columns.
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags
. The folding logic is same as before. But in this case we are not folding the head element twice.
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)
Now we are grouping-by the UniqueID
column while aggregating the tags
into its own list.
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
Finally we are joining your users
table with UniqueID which we assigned earlier (i.e. table userswithId
) with previous dataframe to get the final result.
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
Final result is as below :
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
Here is one possible method without changing too much of your logic.
First you will have to assign a unique id to the users table. As shown below:
import org.apache.spark.sql.functions._
val userstable = spark.sql("select * from users")
val userswithId = userstable.withColumn("UniqueID", monotonically_increasing_id())
userswithId.createOrReplaceTempView("users")
Now your tags
and userTags
remains same as above.
val tags = Map(
"big" -> "country IN (FROM big_countries)",
"medium" -> "country IN (FROM medium_countries)",
// a few thousands of other tag keys and conditions with any possible SQL statements allowed in SQL WHERE clause(users create them on the application UI)
"sometag" -> "name = 'Donald' AND email = 'donald@example.com' AND phone = '+1-2222222222'")
def buildTagQuery(tag: String, tagCondition: String, table: String): String =
f"FROM $table WHERE $tagCondition"
Here we are only selecting on the UniqueID
and tag
columns.
val userTags = tags.map
case (tag, tagCondition) =>
spark.sql(buildTagQuery(tag, tagCondition, "users"))
.withColumn("tag", lit(tag).cast(StringType)).select("UniqueID", "tag")
This is very important. There was a subtle bug in your original code using foldLeft. The head of the list was folded twice in your case. What I have done here is selected the head into a separate variable and then dropped it from userTags
. The folding logic is same as before. But in this case we are not folding the head element twice.
val headhere = userTags.head
val userResults = userTags.drop(1)
val unionDf2 = userResults.foldLeft(headhere)
case (acc, df) => acc.union(df)
Now we are grouping-by the UniqueID
column while aggregating the tags
into its own list.
val unionDf3 = unionDf2.groupBy("UniqueID").agg(collect_list("tag"))
println("Printing the unionDf3 result")
unionDf3.show(25)
Finally we are joining your users
table with UniqueID which we assigned earlier (i.e. table userswithId
) with previous dataframe to get the final result.
val finalResult = userswithId.join(unionDf3,"UniqueID")
println("Printing the final result")
finalResult.show(25)
Final result is as below :
+--------+------+------------------+--------------+-------------+-----------------+
|UniqueID| name| email| phone| country|collect_list(tag)|
+--------+------+------------------+--------------+-------------+-----------------+
| 0| Alex| alex@example.com|+91-9999999998| France| [big]|
| 1| John| john@example.com| +1-1111111111|United States| [big]|
| 2|Donald|donald@example.com| +1-2222222222|United States| [big, sometag]|
| 4| Scott| scott@example.com|+91-9111999998| Spain| [big, medium]|
+--------+------+------------------+--------------+-------------+-----------------+
answered Nov 10 at 15:21
user238607
680711
680711
Thanks for your answer! Could you please compare from the performance point of view - your solution andgroupBy
by all columns, something like that :val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?
– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns excepttag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
@alexanoid : In case the dataframeunionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05
add a comment |
Thanks for your answer! Could you please compare from the performance point of view - your solution andgroupBy
by all columns, something like that :val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?
– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns excepttag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.
– user238607
Nov 10 at 16:00
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
@alexanoid : In case the dataframeunionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…
– user238607
Nov 10 at 16:05
Thanks for your answer! Could you please compare from the performance point of view - your solution and
groupBy
by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?– alexanoid
Nov 10 at 15:31
Thanks for your answer! Could you please compare from the performance point of view - your solution and
groupBy
by all columns, something like that : val taggedUsers = unionDf.groupBy(unionDf.columns.diff(Seq("tag")).map(col): _*).agg(collect_set("tag").alias("tags"))
?– alexanoid
Nov 10 at 15:31
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except
tag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.– user238607
Nov 10 at 16:00
I think you will have to benchmark it. In my solution, there is uniqueId generation, groupBy on a smaller table on a single column which is numeric and join to get the final result. In your case, you are grouping by all the columns except
tag
column and then aggregating the tags list. I can't really tell which solution would be faster in practice.– user238607
Nov 10 at 16:00
1
1
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
Thanks, I feel simple join(your solution) will work much faster. Thanks again!
– alexanoid
Nov 10 at 16:02
1
1
@alexanoid : In case the dataframe
unionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…– user238607
Nov 10 at 16:05
@alexanoid : In case the dataframe
unionDf3
is small enough, you can use a broadcast join to make it slightly faster. stackoverflow.com/questions/32435263/…– user238607
Nov 10 at 16:05
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53232178%2fapache-spark-column-with-collected-info-vs-rows-union%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you please post the code
buildTagQuery(tag, tagCondition, "tagged_users")
, maybe we can create anUDF
that produces thetag
column by which we can avoid union, map and foldLeft on dataframes. Or add all the columns in thesql
query so that union can be avioded.– Pavithran Ramachandran
Nov 9 at 21:07
@PavithranRamachandran please see the updated question. I have added the info.
– alexanoid
Nov 10 at 6:47