Closed

Python pdf/web scraping script

This project received 7 bids from talented freelancers with an average bid price of $120 USD.

Get free quotes for a project like this
Employer working
Skills Required
Project Budget
$30 - $250 USD
Total Bids
7
Project Description

I need a script written in Python 2 to extract snap counts from NFL players during American football games.

The script, when given a URL to a PDF file will scrape data from the PDF and [url removed, login to view] and insert the data into a PostgreSQL database.

The script needs to accept two string arguments:

scrape_data(url, game_id)

Data source example:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

The specific data that needs to be scraped from the PDF is "Playtime Percentage", which is typically located on the last page(s) of the PDF. In addition to scraping the PDF, each player needs to be searched on [url removed, login to view] and their unique GSIS ID needs to be scraped from their [url removed, login to view] player page.

For example the GSIS ID for Cam Newton is: 00-0027939

As found in the HTML here: [url removed, login to view]

Please be aware that some players have very similar names. Therefore when searching for a player to obtain their GSIS ID you need to ensure it is for the correct player as the PDF only gives a first initial and last name. You can achieve this by searching [url removed, login to view] and verifying that the player's position matches the PDF and that they played the game in the PDF from their game logs on nfl.com. Game dates, opponents and other identifying information useful for player identification can all be found in the PDF. Also please be mindful that some PDF files that will be fed into the script may be several years old and players may have changed teams since then, so simply searching [url removed, login to view] by player name and team is not an adequate solution.

To obtain the gamekey you would extract information from the url argument.

For example:

[url removed, login to view]

The above URL has gamekey: 56505

The database should be structured as such:

Table:

snap_counts

Columns:

game_id –This will be an argument passed to the script.

gamekey – This is extracted from the url argument.

player_id – Player’s unique GSIS ID obtained by scraping the player’s NFL profile page.

player_name – This is the 1st column of the Play Percentage page in the PDF.

position – This is the 2nd column of the Play Percentage page.

team – This is the team the player played for at the time of the game.

off_snaps – This is the 3rd column of the Play Percentage page (0 if blank).

off_pct - This is the 4th column of the Play Percentage page (0 if blank).

def_snaps - This is the 5th column of the Play Percentage page (0 if blank).

def_pct - This is the 6th column of the Play Percentage page (0 if blank).

spt_snaps - This is the 7th column of the Play Percentage page (0 if blank).

spt_pct - This is the 8th column of the Play Percentage page (0 if blank).

If the script encounters a PDF that doesn't have the Playtime Percentage stats, the script should return a value indicating such and not insert anything into the database. Blank or empty cells in the PDF's table shall be replaced by 0.

The script should adhere to PEP 8 standards.

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online