I have a data set extracted from various online news sources. which contains news headlines.
since data is from different news sources, i have some duplicates which render same information. so i want u to build an algorithm to cluster them in to one.
For ex: data from website A : Narendra modi favorite is tajmahal.
data from website 2: 7th world wonder Taj mahal is Narendra modi's favorite..
the meaning of both sentence is same, but they are generating an addition data . i want to cluster them into one..
i tried using k-means but every time i son't ant to sit and analyse the required number of clusters, i want it to decide automatically.