Final Exam: Due by 11:59 PM EST Wednesday December 15, 2021¶
This is a group exam. You can and should work with your MSBA groups.
When you are finished, you will submit two objects to eLC:
Written pdf Report.
Treat the "report" as your main deliverable to executive management. These executives are smart but are non-technical and have little to no experience with machine learning. Your report needs to communicate your results and recommendations in a clear and intuitive way. A good lithmus test is that you should be able to communicate the "big ideas" to other people in your lives (eg, signficant others, parents, non-technical friends).
Your analysis should be professional: i.e., well-written, clear, and concise. Figures should be incorporated in your analysis. Save your report as a pdf ("File/Save As Adobe PDF") with the naming convention 'Report_[insert last names]'. For example, ''Report_Thurk.pdf'.
To provide some structure, your report should include the following sections:
- Executive Summary: Provide a concise description of the problem, your solution methodology, why it works, and the answer.
- Data Description: Describe the data. Provide important figures to demonstrate key variation. The figures should look nice and professional. See the lecture on effective visualization for suggestions. Be sure to include a concise description of the figures and how the variation they depict will be important inputs in the model and eventually affect the model's results..
- Model Description: Describe your preferred model. Pictures are good to demonstrate how your model works..
- Results: Discuss the characteristics and performance of your model. What features are most important? What do you find and what are the implications for your company?
- Executive Summary: Provide a concise description of the problem, your solution methodology, why it works, and the answer.
- Jupyter Notebook of your Final Model. Submit your ipynb file using the following naming convention 'Model_[insert last names]'. For example, 'Model_Thurk.ipynb'. Your notebook need not use this notebook. Your notebook should be self-contained: ie, it should load the raw data provided, load any ancillary data you collected on your own, do any data manipulation, initiate and tune your model, and test your model. I will modify the last part to test your model on the data I've left out.
A reminder: I will hold my regular office hours on wednesdays as well as hold additional office hours periodically between now and the deadline.
Grading¶
The exam is worth 50 points and partial credit is as follows:
A. Report (45 points)
- Is your report professional, clear, and concise?
- Are your figures effective at contributing to the overall message?
- Are your modelling decisions reasonable to accomplish your objective?
- Is your approach creative?
B. Model testing (5 points)
- I will award points based on your relative (to the other teams) overall perfomance on the out-of-sample testing data.
- The top 20% of teams get 5 points, the second 20% of teams get 4 points and so on.
- I reserve the right to deviate in the event all teams to a good job deserve more credit.
The Competition¶
Over $40 billion worth of cryptocurrencies are traded every day. They are among the most popular assets for speculation and investment, yet have proven wildly volatile. Fast-fluctuating prices have made millionaires of a lucky few, and delivered crushing losses to others. The motivating question of this competition is therefore: Could some of these price movements have been predicted in advance?
In this competition, you'll use your machine learning expertise to forecast short-term returns in 14 popular cryptocurrencies. The data on eLC amounts to millions of rows of high-frequency market data dating back to 2018 which you can use to build your model. I will evaluate your submission by the quality of your deliverable explaining how/ why the model works plus an evaluation of the model's predictive ability using live crypto data since this competition was launched (ie, November 2nd, 2021).
About the Data¶
The simultaneous activity of thousands of traders ensures that most signals will be transitory, persistent alpha will be exceptionally difficult to find, and the danger of overfitting will be considerable. In addition, since 2018, interest in the cryptomarket has exploded, so the volatility and correlation structure in our data are likely to be highly non-stationary. The successful contestant will pay careful attention to these considerations, and in the process gain valuable insight into the art and science of financial forecasting.
Scenario¶
We're going to pretend that you work for G-Research -- the European quantitative finance research firm sponsoring this real-life (and current) kaggle competition. The company is interested in exploring the extent of market prediction possibilities, making use of machine learning, big data, and some of the most advanced technology available.
Notes¶
- This is a real kaggle contest. For this class we'll be role-playing that you're addressing a business question for G-Research but you are of course able (and encouraged!) to submit your model to the contest (deadline is January 25, 2022). Submission/ participation is free and there is $125,000 in prize money on the line. Don't be intimidated. There are all sorts of people who participate and many have less experience than you with ML. That said, some of Kaggle Grand Masters (yes, that's a thing) so don't expect to win.
- You are free to explore anything on the web, including notebooks that other teams have submitted to the contest. You are also able to incoprorate ancillary data in tuning your model. Should you do so, include your data file with your submission to eLC. Whether your can include other data is unknown for the contest but I would assume no.
Model Evaluation [5 Points]¶
I will follow the Kaggle competition's evaluation methodology and evaluate each exam submission via a weighted version of the Pearson correlation coefficient. In the official contest, each submission is evaluated based on the 3 months of actual cryptocurrency data following the contest's submission deadline.
For the exam, I will:
- Use your model to make a one-day prediction for each of the 14 cryptocurrencies using a test data set you have not seen. The test data includes cryptocurrency prices prior to your submission deadline to eLC and follows a similar format to
train.csv
.
- Evaluate the quality of your model by computing the Pearson correlation coefficient between your data predictions and the actual cryptocurrency prices.
Additional details are located in the Prediction Details and Evaluation section of the tutorial ipynb notebook included in the zip data file.
Comments:
- I suggest only focusing on developing a model to predict cryptocurrency prices well rather than worrying about incorporating the weighting scheme employed by the Kaggle contest.
- Aggregating the data is acceptable. The raw data are one-minute intervals but that may be too granular to be useful. Plus, it adds computational burden. I'm more interested in you developing a model consistent with the relevant question and data. If you do aggregate, discuss the advantages and costs of doing this in your report. I will adjust your prediction according to whatever time-increment you choose; e.g., if you aggregate to hourly data, I will generate 24 predictions for each of the cryptocurrencies.
- You are allowed to add features. The test data will follow the same format as
train.csv
so make sure your submission includes any relevant data manipulations of cyrptocurrency prices.
- If you decide to include data outside of the provided cryptocurrency prices, please include these data in your submission.
Hype Video¶
Below is the hype video for the contest. My bet is that this is your first exam to include a hype video.
Data¶
The file final_data.zip
is located on eLC and has all necessary data to complete the exam.
A. Main Data File: train.csv
This dataset contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum. Your challenge is to predict their one-day return.
Variables:
- timestamp: All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
- Asset_ID: The asset ID corresponding to one of the crytocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping from
Asset_ID
to crypto asset is contained in asset_details.csv. - Count: Total number of trades in the time interval (last minute).
- Open: Opening price of the time interval (in USD).
- High: Highest price reached during time interval (in USD).
- Low: Lowest price reached during time interval (in USD).
- Close: Closing price of the time interval (in USD).
- Volume: Quantity of asset bought or sold, displayed in base currency USD.
- VWAP: The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
- Target: Residual log-returns for the asset over a 15 minute horizon.
The first two columns define the time and asset indexes for this data row. The 6 middle columns are feature columns with the trading data for this asset and minute in time. The last column is the prediction target, which we will get to later in more detail.
B. Asset Information: asset_details.csv
File includes the list of all assets, the Asset_ID
to asset mapping, and the weight of each asset used to weigh their relative importance in the evaluation metric.
C.Tutorial Notebook: tutorial-to-the-g-research-crypto-competition.ipynb
This is a tutorial G-Research put together to help the teams. It's also located on the Kaggle competition website.
Note: The remaining csv files are helpful if you do intend to submit your model to the Kaggle competition so've included them in the zip file.