Find a collection of sets such that there is maximum intersection of elements among the selected sets










0















I have roughly 300,000 (300K) sets, each containing 0-100 elements.



s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y


My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?

i.e.

Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.



For example : From the sets shown above,



  • Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]


  • Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]


Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.



300K choose 5000 = O(10^11041)


Are there any smart data structures and algorithms that I can use to get the desired collection of sets?



Also, is there are any available python library that I can use for this?










share|improve this question
























  • What have you tried already?

    – Yakov Dan
    Nov 13 '18 at 10:38






  • 1





    I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

    – jaggi
    Nov 13 '18 at 10:41












  • I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

    – Yakov Dan
    Nov 13 '18 at 11:11







  • 1





    If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

    – migron
    Nov 13 '18 at 11:22












  • There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

    – juvian
    Nov 13 '18 at 14:50















0















I have roughly 300,000 (300K) sets, each containing 0-100 elements.



s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y


My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?

i.e.

Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.



For example : From the sets shown above,



  • Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]


  • Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]


Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.



300K choose 5000 = O(10^11041)


Are there any smart data structures and algorithms that I can use to get the desired collection of sets?



Also, is there are any available python library that I can use for this?










share|improve this question
























  • What have you tried already?

    – Yakov Dan
    Nov 13 '18 at 10:38






  • 1





    I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

    – jaggi
    Nov 13 '18 at 10:41












  • I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

    – Yakov Dan
    Nov 13 '18 at 11:11







  • 1





    If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

    – migron
    Nov 13 '18 at 11:22












  • There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

    – juvian
    Nov 13 '18 at 14:50













0












0








0


0






I have roughly 300,000 (300K) sets, each containing 0-100 elements.



s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y


My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?

i.e.

Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.



For example : From the sets shown above,



  • Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]


  • Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]


Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.



300K choose 5000 = O(10^11041)


Are there any smart data structures and algorithms that I can use to get the desired collection of sets?



Also, is there are any available python library that I can use for this?










share|improve this question
















I have roughly 300,000 (300K) sets, each containing 0-100 elements.



s1=a,b,x,y
s2=a
s3=a,x,y
s4=x,y


My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?

i.e.

Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.



For example : From the sets shown above,



  • Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3 with [common elements=a,x,y, common elements count=3]


  • Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = s1,s3,s4 with [common elements=x,y, common elements count=2]


Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.



300K choose 5000 = O(10^11041)


Are there any smart data structures and algorithms that I can use to get the desired collection of sets?



Also, is there are any available python library that I can use for this?







python algorithm data-structures set set-intersection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 10:07







jaggi

















asked Nov 13 '18 at 9:54









jaggijaggi

104112




104112












  • What have you tried already?

    – Yakov Dan
    Nov 13 '18 at 10:38






  • 1





    I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

    – jaggi
    Nov 13 '18 at 10:41












  • I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

    – Yakov Dan
    Nov 13 '18 at 11:11







  • 1





    If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

    – migron
    Nov 13 '18 at 11:22












  • There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

    – juvian
    Nov 13 '18 at 14:50

















  • What have you tried already?

    – Yakov Dan
    Nov 13 '18 at 10:38






  • 1





    I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

    – jaggi
    Nov 13 '18 at 10:41












  • I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

    – Yakov Dan
    Nov 13 '18 at 11:11







  • 1





    If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

    – migron
    Nov 13 '18 at 11:22












  • There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

    – juvian
    Nov 13 '18 at 14:50
















What have you tried already?

– Yakov Dan
Nov 13 '18 at 10:38





What have you tried already?

– Yakov Dan
Nov 13 '18 at 10:38




1




1





I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

– jaggi
Nov 13 '18 at 10:41






I can't do brute force as I mentioned in the question itself. I had been searching online & on Google scholar , hadn't found useful literature to go on further, hence asking here.

– jaggi
Nov 13 '18 at 10:41














I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

– Yakov Dan
Nov 13 '18 at 11:11






I dont think this is solvable much faster than brute force. This problem is at least beyond NP because not only is it difficult to find a solution, it is difficult to even verify a solution. It would be solvable quickly if you were willing to accept a collection of any size, not judt exactly 5000

– Yakov Dan
Nov 13 '18 at 11:11





1




1





If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

– migron
Nov 13 '18 at 11:22






If you were able to sort them by size and start checking the intersection size of the larger ones, when you get the first 5000 intersections, you'd be in a position to discard all the sets that are smaller than the smallest intersection. Worst case scenario would still be = to brute force, though.

– migron
Nov 13 '18 at 11:22














There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

– juvian
Nov 13 '18 at 14:50





There is no efficient solution for this. You can make some optimizations like removing elements that do not appear at least 5000 times, and keep track of best solution so far and discard sets with lower size than that, but would still take a long time to run. You could try to model it as an integer programming problem, which may be better than brute force but will probably take too long as well

– juvian
Nov 13 '18 at 14:50












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278278%2ffind-a-collection-of-sets-such-that-there-is-maximum-intersection-of-elements-am%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278278%2ffind-a-collection-of-sets-such-that-there-is-maximum-intersection-of-elements-am%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Use pre created SQLite database for Android project in kotlin

Darth Vader #20

Ondo