SpotLake: Diverse Spot Instance Dataset Archive Service

Logo of spot instance by AWS workshop

SpotLake is a system that collects, analyzes, evaluates, and distributes datasets related to spot instances* provided on the public cloud. This project was motivated by the policy change of spot instance in 2017 by AWS, one of the most representative public cloud providers. Previous studies on spot instances became outdated as the policy changed from bidding to oracle. Instead, AWS began to provide datasets related to the availability publicly and interrupt frequency of spot instances. We collected these datasets after November 2021 and conducted a time-series analysis and evaluation with Spot Checker**.

The data we collected were ‘Spot Placement Score’***, ‘Interrupt-free Score’****, and, ‘Spot Price’. First, we developed a query optimization method and a parallel data collector to overcome the API constraints of AWS. In the process of analyzing the collected data, we found that: 1. The availability of spot instances is correlated with the maturity of the computing hardware. 2. Spot Placement Score should be considered prioritized over Interrupt-free Score. 3. Abundance of dataset increase the accuracy of availability prediction. These results showed the possibility of building a cost-efficient and scalable workload processing cluster system with datasets of SpotLake. Being recognized for its contribution, the paper of SpotLake was presented at the 2022 IEEE International Symposium on Workload Characterization.

* Spot Instance:

** Spot Checker: Custom tool to evaluate real-world behavior of spot instances. Run and check status of sampled spot instances by 24 hours.

*** Spot Placement Score:

**** Inverted value of Interrupt Frequency provided by AWS Spot Advisor

***** Thumbnail logo was provided by AWS workshop resource

Author | Sungjae Lee

Currently a research assistant in Kookmin University of Computer Science in the area of Cloud Computing and Distributed System.