1. Use twitter message table inside twitter database:
• Use SQL to get one user id who posts the most tweets. You can only use one query to get the id.
• Extract the msg date and the tweets from the database into R.
• Remove the stop words, non-English words from all tweets.
• Calculate the length (number of words) for each tweet. The output will become a data frame with 2 columns (date and tweet length).
• Download spx data using R package (such as quantmod). The range of the date is the same as the date range in previous tweet data frame. Then, you should calculate daily return of spx.
• Merge the stock data frame and tweet data frame together by date. And run linear regression using return as dependent variable, tweet length as independent variable.
• Show the summary result of the linear regression.
2. • Write a R function to extract data from SQL database, accounting table.
• The input variables are two vectors, year and sector.
• The output variable is a data frame with 5 columns, sector, ticker, sales,
size, ratio.
• The function should take the input year and sector vector into a SQL query, pass it to database, and get the data. The data comes from accounting table, with year and sector equals to your inputs. If your inputs are more than 1 values, the selected data should have sector and year included in the input vector.
• After you fetch the data, calculate the size as logarithm of totalassets, and ratio as totalliabilities/totalassets. Sales comes from the data directly.
3. • Use previous function to fetch the data.
• Randomly pick up 3 sectors and 1 year from the database. You need to use SQL query to get all unique sectors and years, then use R function like sample() to select the sample. You should remove NULL value in sector and year.
• Use the above sectors and year as input into the function to get the output. If the number of observation in one sector is less than 100, re-generate the data.
• Split the data into training (50%) and testing (50%) dataset.
• Pick up a classification method, train the model using training data set and predict the sector of testing dataset. You can pick up any classification methods that we learned or a new method. The sales, size and ratio are 3 input variable. Sector will be output label.
• Write a R function to repeat the previous step (start from splitting data into training and testing sets). The input variable is N (equals to the repeating time). The output is prediction accuracy rate on testing dataset. In the function, you should run the classification N times with re-sampled datasets and get different accuracy rate. Then you should return the average accuracy rate as output.
• Run the function with input N = 5 and show the result.