How do I use HashSet to remove duplicates from a text file? (C#)

-3

So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.

I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).

The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).

I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.

Here's my code so far:

 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
 
 List<string> list = new List<string>();

 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 list.Add(line);
 

 var DuplicatesRemoved = new HashSet<String>(list);

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13

docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19

cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20

2

Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24

3

I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36

|
show 6 more comments

-3

I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.

Here's my code so far:

 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
 
 List<string> list = new List<string>();

 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 list.Add(line);
 

 var DuplicatesRemoved = new HashSet<String>(list);

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13

docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19

cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20

2

Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24

3

I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36

|
show 6 more comments

-3

I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.

Here's my code so far:

 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
 
 List<string> list = new List<string>();

 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 list.Add(line);
 

 var DuplicatesRemoved = new HashSet<String>(list);

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.

Here's my code so far:

 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)
 
 List<string> list = new List<string>();

 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 list.Add(line);
 

 var DuplicatesRemoved = new HashSet<String>(list);

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

edited Nov 15 '18 at 2:28

asked Nov 15 '18 at 2:11

College Ameteur

asked Nov 15 '18 at 2:11

College Ameteur

asked Nov 15 '18 at 2:11

College Ameteur

stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13

docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19

cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20

2

Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24

3

I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36

|
show 6 more comments

stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13

docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19

cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20

2

Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24

3

I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36

stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13

docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19

cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20

Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24

I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36

|
show 6 more comments

2 Answers
2

active

oldest

votes

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

add a comment |

It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.

using (var outFile = new StreamWriter(outFilePath))

 HashSet<string> seen = new HashSet<string>();
 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 if (seen.Add(line))
 
 outFile.WriteLine(line);

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311453%2fhow-do-i-use-hashset-to-remove-duplicates-from-a-text-file-c%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

add a comment |

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

add a comment |

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

To be specific to your question, and to get my last 3 points.

var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());

Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

answered Nov 15 '18 at 2:29

Michael Randall

35.9k83869

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

add a comment |

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

1

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

– College Ameteur
Nov 15 '18 at 2:32

@CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

– Michael Randall
Nov 15 '18 at 2:36

add a comment |

using (var outFile = new StreamWriter(outFilePath))

 HashSet<string> seen = new HashSet<string>();
 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 if (seen.Add(line))
 
 outFile.WriteLine(line);

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

add a comment |

using (var outFile = new StreamWriter(outFilePath))

 HashSet<string> seen = new HashSet<string>();
 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 if (seen.Add(line))
 
 outFile.WriteLine(line);

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

add a comment |

using (var outFile = new StreamWriter(outFilePath))

 HashSet<string> seen = new HashSet<string>();
 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 if (seen.Add(line))
 
 outFile.WriteLine(line);

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

using (var outFile = new StreamWriter(outFilePath))

 HashSet<string> seen = new HashSet<string>();
 foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))
 
 if (seen.Add(line))
 
 outFile.WriteLine(line);

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

answered Nov 15 '18 at 3:24

Antonín Lejsek

4,24721118

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb