I need someone to write a simple program which can do the following -
Scan a .txt file which contains 1 URL per line. Check to see whether the domain name of any line is duplicate, and if it is, remove all instances of the URL.
For example - the program would check each line for domain.tld (tld being the domain extension such as .com, .org, .net, .info etc) - if two lines containing domain.tld exist, then the ENTIRE LINE will be deleted (not just the domain.tld).
Each URL will consist of more than just the domain and tld, for example URL's may consist of http://domain.tld/forum2/page1.html however it's important that if domain.tld is found to exist on 2 or more lines, then the ENTIRE line is removed, not just the domain.tld portion.
Quite simply I want a program that will check a text file with a list of URL's, find any duplicate domains, and then remove the ENTIRE URL of any lines containing that domain and tld.
It should be something simple like: "if word1.word2 exist in 2 or more lines then remove all lines containing word1.word2".
Let's say the .txt file looks like the following -
This program should remove line 1 & line 2, since the domain and tld are the same. The text file should be left looking like -
Notice how domain1.net and domain2.com were NOT removed since there was only ONE instance of each domain and tld. If a domain and TLD only exist ONCE, then they should NOT be removed.
If you need further instructions please message me.