Find a collection of sets such that there is maximum intersection of elements among the selected sets
I have roughly 300,000 (300K) sets, each containing 0-100 elements.
s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y
My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?
i.e.
Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.
For example : From the sets shown above,
Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]
Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]
Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.
300K choose 5000 = O(10^11041)
Are there any smart data structures and algorithms that I can use to get the desired collection of sets?
Also, is there are any available python library that I can use for this?
python algorithm data-structures set set-intersection
|
show 4 more comments
I have roughly 300,000 (300K) sets, each containing 0-100 elements.
s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y
My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?
i.e.
Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.
For example : From the sets shown above,
Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]
Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]
Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.
300K choose 5000 = O(10^11041)
Are there any smart data structures and algorithms that I can use to get the desired collection of sets?
Also, is there are any available python library that I can use for this?
python algorithm data-structures set set-intersection
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
1
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
1
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50
|
show 4 more comments
I have roughly 300,000 (300K) sets, each containing 0-100 elements.
s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y
My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?
i.e.
Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.
For example : From the sets shown above,
Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]
Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]
Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.
300K choose 5000 = O(10^11041)
Are there any smart data structures and algorithms that I can use to get the desired collection of sets?
Also, is there are any available python library that I can use for this?
python algorithm data-structures set set-intersection
I have roughly 300,000 (300K) sets, each containing 0-100 elements.
s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y
My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?
i.e.
Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.
For example : From the sets shown above,
Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]
Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]
Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.
300K choose 5000 = O(10^11041)
Are there any smart data structures and algorithms that I can use to get the desired collection of sets?
Also, is there are any available python library that I can use for this?
python algorithm data-structures set set-intersection
python algorithm data-structures set set-intersection
edited Nov 13 '18 at 10:07
jaggi
asked Nov 13 '18 at 9:54
jaggijaggi
104112
104112
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
1
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
1
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50
|
show 4 more comments
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
1
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
1
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
1
1
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
1
1
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50
|
show 4 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278278%2ffind-a-collection-of-sets-such-that-there-is-maximum-intersection-of-elements-am%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278278%2ffind-a-collection-of-sets-such-that-there-is-maximum-intersection-of-elements-am%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What have you tried already?
– Yakov Dan
Nov 13 '18 at 10:38
1
I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.
– jaggi
Nov 13 '18 at 10:41
I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000
– Yakov Dan
Nov 13 '18 at 11:11
1
If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.
– migron
Nov 13 '18 at 11:22
There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well
– juvian
Nov 13 '18 at 14:50