Find Jobs
Hire Freelancers

Python function: identifying string duplicates from DB by normalizing and comparing, and then returning list of duplicate ''collections''

$30-100 USD

In Progress
Posted about 13 years ago

$30-100 USD

Paid on delivery
==================== BACKGROUND ==================== We are working with a DB table which contains events (events could be theater plays, concerts, exhibits, etc.). This data is full of duplicates, where the same event could be listed more than once with either an identical title or a slightly different one. The objective of this small project is to implement a very simple mechanism (which is defined in detail below) that would identify some of these duplicates, and populate them duplicates into another pre-defined DB table. ==================== SPECIFICS ==================== The goal of this project is, using python, to find collections of records with the same title in a SQL table named "event", and return lists of these collections of records. We would like to call this function get_touring_events, and it should take as input a variable to_file which has default value "True": def get_touring_events(to_file = True): What it should do: **1.** Query our SQL DB for "active" event_ids and names and store the result as a list of tuples of (name, event_id) pairs. This query will be as follows "select event_id, name from event where status_id = 'Active'" Connection details will be provided below. Call the resulting list "events". **2.** For each tuple (id, name) in "events", run the helper function "normalize" on the string name: normalized_name = normalize(name) The helper function normalize will be described below. In the list "events", replace each name with the normalized_name which has just been calculated. **3.** Sort the resulting list of tuples "events" alphabetically by name. (SEE THE REST IN DETAILED REQUIREMENTS SECTION) ## Deliverables **4.** Create a blank list named "touring_events," and go through the "events" list in order to find the event_ids that with identical names. Since the "events" list is sorted alphabetically by name, this can be done in a single loop. We would like the "touring_events" list to contain lists of event_ids with the same name. For example, the output touring_events = [ [145,2,3], [4,56] ] would indicate that event ids 145, 2, and 3 have the same (normalized) names, and ids 4,56 have the same normalized names. **5.** If the variable "to_file" is True, create a file named "[login to view URL]" and output each list in touring_events on a separate line, separated by spaces. For example, in the above case, the output to [login to view URL] would look like: 145 2 3 4 56 **6.** If the variable "to_file" is False, input this data back into the DB in the following way: For each list "tour_list" in touring events: run the SQL code: "insert into event_duplicates (event_ids,type,priority) VALUES ('@EIDS','touring',@PRIORITY)" where: @EIDS is replaced by the a string containing the elements of tour_list separated by commas, @PRIORITY is replaced by the length of tour_list. **HELPER FUNCTION:** The helper function "normalize" takes as input a string, and returns a string. def normalize(input_string): We would like this function to do the following: -read in a provided file called "[login to view URL]" of words to exclude, each of which will be on a separate line. store the result in a list named "exclude_list" -go through input_string and remove all punctuation and extraneous whitespace from the beginning and end -convert this input_string into a list of words, call this word_list -make each word in word_list lowercase -remove all occurrences of any element in "excluded_words" from word_list -sort "word_list" alphabetically RETURN ' '.join(word_list) (this is the words in word_list joined together with a single-space separator) **SQL DB CONNECTION DETAILS:** server name: will be provided shortly username: will be provided shortly password: will be provided shortly
Project ID: 3230632

About the project

13 proposals
Remote project
Active 13 yrs ago

Looking to make some money?

Benefits of bidding on Freelancer

Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs
Awarded to:
User Avatar
See private message.
$9 USD in 3 days
5.0 (24 reviews)
4.5
4.5
13 freelancers are bidding on average $74 USD for this job
User Avatar
See private message.
$102 USD in 3 days
4.8 (455 reviews)
7.5
7.5
User Avatar
See private message.
$212.50 USD in 3 days
4.3 (56 reviews)
6.1
6.1
User Avatar
See private message.
$68 USD in 3 days
4.9 (28 reviews)
5.4
5.4
User Avatar
See private message.
$84.15 USD in 3 days
3.8 (23 reviews)
5.3
5.3
User Avatar
See private message.
$76.50 USD in 3 days
5.0 (9 reviews)
4.3
4.3
User Avatar
See private message.
$38.25 USD in 3 days
5.0 (15 reviews)
4.2
4.2
User Avatar
See private message.
$84.15 USD in 3 days
5.0 (6 reviews)
4.1
4.1
User Avatar
See private message.
$41.17 USD in 3 days
5.0 (3 reviews)
2.1
2.1
User Avatar
See private message.
$34.85 USD in 3 days
5.0 (4 reviews)
2.0
2.0
User Avatar
See private message.
$51 USD in 3 days
5.0 (1 review)
1.3
1.3
User Avatar
See private message.
$76.50 USD in 3 days
0.0 (0 reviews)
0.0
0.0
User Avatar
See private message.
$84.15 USD in 3 days
0.0 (0 reviews)
1.4
1.4

About the client

Flag of UNITED STATES
United States
4.9
48
Member since Jan 30, 2008

Client Verification

Thanks! We’ve emailed you a link to claim your free credit.
Something went wrong while sending your email. Please try again.
Registered Users Total Jobs Posted
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Loading preview
Permission granted for Geolocation.
Your login session has expired and you have been logged out. Please log in again.