There’re several collections storing documents containing company entity information. These collections record different information relevant to company entities, such as executives, accountants, products and investment. Based on the type of information gets stored, individual collections are different by their field name and structure, but also share certain overlaps, such as company name, geo location, contact info, industry keyword and official website. Now we need to link documents about the same entity across all different collections, the obstacles we’ve encountered are: 1. Since the source of the data are different, company names belongs to the same entity appeared differently across collections. Since some names are in full name, some are in abbreviation, some are in Pinyin and some are simply initials of English name, it’s hard to completely match documents on the same entity. 2. Different collections contain different fields, and not all collections have contact information and website as fields. All collections may only share company name as the only common field, hence it’s hard to establish a unified matching rule.
If we are using the Apache Spark framework to solve this entity resolution problem, what algorithms offer the best performance in terms of precision and feasibility? The largest collection has size around 20,000,000 documents.
We need to find an outsource specialist who has done projects or experience in:
1. Big data entity resolution in NoSQL database
2. Over two years experience in Apache Spark and MongoDB