Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability

Mastroeni, Loretta; Mazzoccoli, Alessandro; Naldi, Maurizio

doi:10.3390/fi11070142

Open AccessArticle

Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability

by

Loretta Mastroeni

^1,†,

Alessandro Mazzoccoli

^2,† and

Maurizio Naldi

^2,3,*,†

¹

Department of Economics, University of Roma Tre, 00145 Rome, Italy

²

Department of Civil Engineering and Computer Science, University of Roma Tor Vergata, 00133 Rome, Italy

³

Department of Law, Economics, Politics and Modern languages, LUMSA University, 00192 Rome, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Future Internet 2019, 11(7), 142; https://doi.org/10.3390/fi11070142

Submission received: 22 May 2019 / Revised: 21 June 2019 / Accepted: 25 June 2019 / Published: 30 June 2019

(This article belongs to the Section Network Virtualization and Edge/Fog Computing)

Download

Browse Figures

Versions Notes

Abstract

:

Service Level Agreements are employed to set availability commitments in cloud services. When a violation occurs as in an outage, cloud providers may be called to compensate customers for the losses incurred. Such compensation may be so large as to erode cloud providers’ profit margins. Insurance may be used to protect cloud providers against such a danger. In this paper, closed formulas are provided through the expected utility paradigm to set the insurance premium under different outage models and QoS metrics (no. of outages, no. of long outages, and unavailability). When the cloud service is paid through a fixed fee, we also provide the maximum unit compensation that a cloud provider can offer so as to meet constraints on its profit loss. The unit compensation is shown to vary approximately as the inverse square of the service fee.

Keywords:

cloud storage; service level agreements; insurance; compensations; refunds

1. Introduction

Service Level Agreements (SLA) are typically employed by cloud providers to provide guarantees on the quality of their services to end customers [1,2,3,4,5,6]. Those agreements, which form a contract or a contract-like relationship between the parties, state the obligations imposed on the cloud provider, through a set of service quality metrics and constraints to be met.

The compliance of cloud providers with those contractual or contractual-like commitments has to be monitored, and several tools have been proposed in the literature for that purpose [7,8,9] or for acting on the basis of monitoring results to reach for compliance [10]. Any tool has to measure a set of parameters related to the quality of service and compare their values against SLAs’ provisions. The foremost among them is availability [11], defined roughly as the percentage of time that the cloud is available.

As it happens, the obligations underwritten in SLAs may not be fulfilled. In that case, the cloud provider is expected to pay penalties, if the contract so prescribes [12], or anyway to compensate its customers for the loss incurred. If violations take place on a wide scale, penalties and compensations may endanger the economic balance of the cloud provider.

The cloud provider should therefore protect itself against such an occurrence. It can basically act in two ways:

reduce the risk due to unavailability periods;
be indemnified against possible losses.

The first countermeasure is technical in nature. Reducing the probability of occurrence and/or the amount of damages calls for investments in the infrastructure and better cloud management processes.

The second countermeasure instead relies on buying an insurance policy, whereby the insurer takes on the risk of paying all claims by customers in return for the premium to be paid by the cloud provider, which acts as the insured. The insurance approach to deal with possible excessive claims by cloud users in the framework of a Service Level Agreement has been proposed in [13]. A major problem lies in the definition of the right premium of an insurance policy. In that paper, the correct premium has been computed for the case where the outage phenomenon is described by a Poisson process for occurrence of outages and by the generalized Pareto distribution for their duration. However, an alternative model has been recently introduced in the literature, based on an extensive measurement campaign on a private cloud [14] belonging to a small/medium company. In that paper, the interarrival time between successive outages is modelled through a Pareto distribution, while the duration of an outage is modelled by a lognormal distribution.

In this paper, we again consider the problem of correctly determining the insurance premium against cloud outages. We adopt the expected utility paradigm as in [13], but we consider the outage model recently proposed in [14] in addition to the exponential-Pareto (a.k.a. Poisson–Pareto) examined in [13]. Both outage models are here examined under three QoS metrics. Our main new results are the following:

we obtain closed formulas for the insurance premiums for all the combinations of outage model and QoS metric (rather than just for the exponential-Pareto model as in [13];
we obtain a closed formula to set the unit compensation offered to customers so as to avoid compensations (through the resulting premium) eroding the profit margins;
we show that the maximum unit compensation is related to the service fee through an approximate inverse square law.

The paper is structured as follows. Section 2 is devoted to describing the two models for outages. The quality of service metrics are instead described in Section 3. The statistics of losses resulting from the models of outages are computed in Section 4. The fair premium is derived in Section 5. Finally, the conditions to make the refund process sustainable are derived in Section 6.

2. Models for Outages

Our analysis relies on the use of models that describe the behaviour of the cloud. Here, we rely on a simple ON-OFF description of the service offered by the cloud, which is well suitable for cloud storage. In the ON-OFF model, the cloud alternates between two states: a fully operating period (a.k.a. uptime or ON period) and a no-service period (a.k.a. downtime or OFF period), as shown in Figure 1. No partial performance degradation is considered in the ON-OFF model. The ON-OFF model can also serve as an approximation to more complex quality degradation processes. For example, a cloud storage provider could still be able to deliver the required files to its customers, but with a longer delivery time, i.e., with a delay with respect to normal operating conditions. In that case, a threshold could be imposed on the delivery time so that delays longer than that threshold are considered as service outages. In this section, we describe two models proposed in the literature for the statistics related to the alternance of ON and OFF states. We employ two variables to describe the duration of the two states: we use S for the duration of the ON period and D for that of the OFF period.

2.1. The Exponential-Pareto Model

The Exponential-Pareto model owes its name to the distributions proposed for the two alternating states of the general ON-OFF model. Namely, we have an exponential distribution for the duration of the ON period. The duration of the OFF period is instead governed by a generalized Pareto distribution (GPD). The cumulative distribution functions are therefore

F_{S} (x) = P [S < x] = 1 - λ e^{- λ x},

(1)

F_{D} (x) = P [D < x] = \{\begin{matrix} 1 - {(1 + ξ x / β)}^{- 1 / ξ}, & if ξ \neq 0, \\ 1 - e^{- x / β}, & if ξ = 0, \end{matrix}

(2)

where

β

is the scale parameter and

ξ

is the shape parameter of the GPD.

Since the duration of the ON period is the time elapsing before a new failure occurs, the MTTF (mean-time-to-failure) is simply the reciprocal of the parameter of the exponential distribution

MTTF = E [S] = \frac{1}{λ} .

(3)

Similarly, the duration of the OFF period is the time needed to repair the cloud, so that the MTTR (mean-time-to-repair) is the expected value of D

MTTR = E [D] = \frac{β}{1 - ξ} .

(4)

The Exponential-Pareto model has first been proposed in [15] through a best-fit procedure on a dataset of customer-reported outages for five major cloud providers (Google, Amazon, Rackspace, Salesforce, Windows Azure). The sources of data were Cloutage (cloutage.org), founded by the Open Security Foundation in April 2010 but now discontinued, and the International Working Group on Cloud Computing Resiliency (IWGCR, a working group with a mission to monitor and analyze cloud computing resiliency.

This model has been employed in [13] to assess the sustainability of refunds linked to insurance contracts for cloud services, as well as in [16] to assess the contribution of the network to the overall unavailability. The values obtained for the parameters of the two distributions in that case are reported in Table 1.

2.2. The Pareto-Lognormal Model

As the name suggests, in the Pareto-Lognormal model the distribution of operating periods is modelled as a Pareto law, while the lognormal distribution describes the duration of outages.

The model has been proposed by Dunne and Malone [14], who analysed 300+ cloud outage events recorded for an enterprise cloud system (not better identified, but belonging to a small/medium enterprise) over an eighteen-month period. They tested seven models (Pareto, loglogistic, lognormal, gamma, exponential, logistic, and Weibull) through the Anderson–Darling test. In the end, they found the Pareto-Lognormal couple to be the best-fit solution.

The cumulative distribution function of operating periods is then

F_{S} (x) = \{\begin{matrix} 1 - {(\frac{h}{x})}^{α}, & if x \geq h, \\ 0, & otherwise . \end{matrix}

(5)

The distribution of outage duration is instead

F_{D} (x) = \{\begin{matrix} G (\frac{ln (w) - μ}{\sqrt{2} σ}), & if x \geq 0, \\ 0, & otherwise . \end{matrix}

(6)

The reliability indicators (MTTF and MTTR) are related to the parameters of the two distributions by the following formulas:

\begin{matrix} MTTF & = \frac{α}{α - 1} h, \\ MTTR & = e^{μ + \frac{σ^{2}}{2}} . \end{matrix}

(7)

The values found by Dunne and Malone for the parameters of the two distributions are reported in Table 2 (where the abbreviation SME—Small/Medium Enterprise—stands for the company whose cloud has been monitored during the measurement campaign, but whose name has been undisclosed); they lead to

\begin{matrix} MTTF & = 2445 \min, \\ MTTR & = 227 \min . \end{matrix}

(8)

3. Quality of Service Metrics

The assessment of damages and the ensuing compensation must be related to a metric that embodies the quality of service underwritten in the SLA. In this section, we provide three different QoS metrics with specific reference to the performance of the cloud. We adopt the same metrics already considered in [17] to derive the correct price for an insurance policy in the case of a Markovian system, and in [18] to assess the Value-at-Risk, again in the case of a Markovian system.

We consider the following metrics:

Number of outages;
Number of long outages;
Unavailability.

The number of outages is computed as the number of no-service intervals during the observation period T. There are a number of technicalities involved in the detection of the actual instants of time when the cloud goes OFF and then goes back ON. We do not go into detail here and refer the reader interested to delve into the matter to [19], where the complexity of measurements is dealt with. An advantage of this metric is that even the smallest service interruptions are accounted for; on the other hand, the severity of outages is not considered.

The number of long outages is simply the number of outages lasting more than a pre-set threshold w. The rationale for this metric is to avoid considering very short interruptions of service (glitches). Such interruptions could add considerably to the number of outages (hence, the first metric) without being significant for the user. This metric is a good choice if an outage must last some time to provoke a real damage to the company; this means that the company relies on some form of short-term resiliency against outages (this happens, e.g., with TCP where failures at the network layer can easily be recovered if they do not extend too much in space and time).

Finally, the unavailability is the percentage of time the service is OFF. According to this metric, a single long outage counts as a set of shorter outages if their cumulative duration is equal. In this case, damages simply accrue over time without any start effect.

4. Loss Statistics

In Section 3, we have introduced three outage-related metrics to describe the quality of the service delivered by the cloud provider. Any outage turns into an economic loss for the customer, which must be accounted for when setting the insurance premium. In this section, we evaluate the loss that a customer may claim under the QoS metric of choice for the two models at hand. In all cases, we assume a proportional relationship between the QoS metric and the claimed loss. The values depend on the statistical models assumed for the outages; we analyse the cases of the Exponential-Pareto model and the Pareto-Lognormal one in two separate subsections.

4.1. The Exponential-Pareto Model

In this section, we derive the loss statistics, as described by the first two moments of the economic loss, for the three QoS metrics defined in Section 3.

Number of Outages

If the number of outages is adopted as the QoS metric, and an an economic loss

k_{f}

is claimed for each outage, the overall economic loss

X_{f}

is

X_{f} = k_{f} N_{f},

(9)

which is a random variable, since the number N of outages occurring during a fixed amount of time is itself random. Over the observation period of duration T, its expected value and variance are respectively

\begin{matrix} E [X_{f}] & = k_{f} λ T, \\ V [X_{f}] & = k_{f}^{2} λ T . \end{matrix}

(10)

4.2. Number of Long Outages

By long outages, we mean outages whose duration exceeds a given threshold w. If the economic loss

X_{_{lf}}

for each long outage is

k_{lf}

, the overall loss is proportional to the number

N_{lf}

of long outages:

X_{lf} = k_{lf} N_{lf} = k_{lf} \sum_{i = 1}^{N} I_{[D_{i} > w]},

(11)

where

I_{[y]}

is the indicator function, which takes the value 1 if the condition y is satisfied, and the value 0, otherwise. The indicator function is actually a random variable with a Bernoulli distribution, with its first two moments given by

\begin{matrix} E [I_{[D_{i} > w]}] & = P [I_{[D_{i} > w]} = 1] = 1 - P [D_{i} < w] \\ = 1 - [1 - {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}] = {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}, \end{matrix}

(12)

\begin{matrix} V [I_{[D_{i} > w]}] & = P [I_{[D_{i} > w]} = 1] (1 - P [I_{[D_{i} > w]} = 1]) \\ = {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} [1 - {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}] \\ = {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} - {(1 + \frac{ξ w}{β})}^{- \frac{2}{ξ}} . \end{matrix}

(13)

Since Equation (11) is a random sum of independent Bernoulli variables, we can obtain the probability distribution of the economic loss (which is now a discrete variable whose possible values are multiples of the unit loss per long outage

k_{lf}

):

\begin{matrix} P [X_{lf} = j k_{lf}] & = \sum_{n = 1}^{\infty} P [\sum_{i = 0}^{n} I_{[D_{i} > w]} = j | N_{T} = n] P [N_{T} = n] \\ = \sum_{n = 1}^{\infty} (\binom{n}{j}) {(1 + \frac{ξ w}{β})}^{- \frac{j}{ξ}} \\ \times {[1 - {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}]}^{n - j} \times \frac{{(λ T)}^{n}}{n!} e^{- λ T} . \end{matrix}

(14)

We can now use Wald’s identities to obtain a closed form expression. We assume that there is only a weak correlation among the number of summands in the sum (11) and each of the summands: the duration of an outage (e.g., its exceeding w) is independent of the number of outages, while the number of outages is very weakly correlated with the outage duration, as long as we consider services with high availability.

By resorting to Wald’s identities (see, e.g., Section 34.14.2.11 of [20] or Section 1.7.3 of [21]), we can therefore compute the mean and the variance of the economic loss

\begin{matrix} E [X_{lf}] & = k_{lf} E [N_{T}] E [I_{[D_{i} > w]}] = k_{lf} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}, \\ V [X_{lf}] & = k_{lf}^{2} \{E^{2} [I_{[D_{i} > w]}] V [N_{T}] + V [I_{[D_{i} > w]}] E [N_{T}]\} \\ = k_{lf}^{2} {(1 + \frac{ξ w}{β})}^{- \frac{2}{ξ}} λ T + k_{lf}^{2} {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} [1 - {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}] λ T \\ = k_{lf}^{2} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} \{{(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} + [1 - {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}}]\} \\ = k_{lf}^{2} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} . \end{matrix}

(15)

4.3. Unavailability

Now, the economic loss

X_{u}

is proportional to the cumulative unavailability time U over the period T, with a loss

k_{u}

for each time unit during which the service is unavailable:

X_{u} = k_{u} U = k_{u} \sum_{i = 1}^{N} D_{i},

(16)

where

k_{u}

is the loss per unit time. The value of

k_{u}

depends of course on the actual company’s business. In Table 3, an estimate of the loss suffered for each minute of service unavailability is reported, as extracted from [22]. Those values can be taken as values for

k_{u}

when the durations

D_{i}

’s are expressed in minutes. As can be seen, those values exhibit an extremely wide dispersion.

We have again a random sum, for which we can apply Wald’s identities, following the same considerations introduced in Section 4.2. The mean economic loss and its variance are respectively

\begin{matrix} E [X_{u}] & = k_{u} E [U] = k_{u} E [D_{i}] E [N_{T}] \\ = k_{u} \frac{β}{1 - ξ} λ T, \end{matrix}

(17)

\begin{matrix} V [X_{u}] & = k_{u}^{2} V [U_{T}] = k_{u}^{2} \{E^{2} [D_{i}] V [N_{T}] + V [D_{i}] E [N_{T}]\} \\ = k_{u}^{2} [\frac{β^{2}}{{(1 - ξ)}^{2}} λ T + λ T \frac{β^{2}}{{(1 - ξ)}^{2} (1 - 2 ξ)}] \\ = k_{u}^{2} \frac{β^{2}}{{(1 - ξ)}^{2}} λ T (1 + \frac{1}{1 - 2 ξ}) \\ = k_{u}^{2} \frac{β^{2}}{{(1 - ξ)}^{2}} λ T \frac{2 - 2 ξ}{1 - 2 ξ} \\ = \frac{2 k_{u}^{2} β^{2} λ T}{(1 - ξ) (1 - 2 ξ)} . \end{matrix}

(18)

4.4. The Pareto-Lognormal Model

Similarly to what we have done for the Exponential-Pareto model, in this subsection, we derive the loss statistics for the Pareto-Lognormal model.

4.4.1. Number of Outages

In order to derive the expected value and the variance of the number of outages

N_{f}

, we first derive its probability distribution.

We describe the behaviour of the cloud during the observation interval T through the two sequence of alternating operating periods and outages:

\begin{matrix} S & = {S_{1}, S_{2}, \dots, S_{i}, \dots}, \\ D & = {D_{1}, D_{2}, \dots, D_{i}, \dots}, \end{matrix}

(19)

where all the

S_{i}

’s are i.i.d.random variables following a Pareto distribution, while the

D_{i}

’s are i.i.d. random variables following a lognormal distribution. The details of the two distribution are described in Section 2.2.

We start by considering the probability that there are no outages during the observation interval, which is tantamount to saying that the first operating period

S_{1}

lasts more than the observation interval:

P [N_{f} = 0] = P (S_{1} > T) = {(\frac{h}{T})}^{α} .

(20)

The probability that a single outage takes place over T is instead the probability that there is at least an outage minus the probability that there are at least two outages:

\begin{matrix} P [N_{f} = 1] & = P [N_{f} \geq 1] - P [N_{f} \geq 2] \\ = 1 - P (N_{f} = 0) - P [S_{1} + D_{1} + S_{2} < T] . \end{matrix}

(21)

Similarly, we have

\begin{matrix} P [N_{f} = 2] & = 1 - P [N_{f} = 0] - P (N_{f} = 1) \\ - P [S_{1} + D_{1} + S_{2} + D_{2} + S_{3} < T], \end{matrix}

(22)

so that we can write the general recursive expression

\begin{matrix} P [N_{f} = i] = 1 - \sum_{j = 0}^{i - 1} P [N_{f} = j] - P [\sum_{k = 1}^{i + 1} S_{k} + \sum_{k = 1}^{i} D_{k} < T] \\ i = 1, 2, \dots \end{matrix}

(23)

Unfortunately, Equation (23) involves the distribution of a sum of Pareto and lognormal random variables, for which no simple expression is known. However, since the third term in that equation involves a number of variables, we can define the sequence of variables

Z_{i} = \sum_{k = 1}^{i + 1} S_{k} + \sum_{k = 1}^{i} D_{k} i = 1, 2, \dots

(24)

and invoke the central limit theorem for it, so that we approximate the distribution of each

Z_{i}

by a normal distribution with moments

\begin{matrix} E [Z_{i}] & = \sum_{k = 1}^{i + 1} E [S_{k}] + \sum_{k = 1}^{i} E [D_{k}] = (i + 1) E [S_{1}] + i E [D_{1}] \\ = (i + 1) MTTF + i \cdot MTTR \\ = (i + 1) \frac{α}{α - 1} h + i \cdot e^{μ + \frac{σ^{2}}{2}}, \end{matrix}

(25)

\begin{matrix} V [Z_{i}] & = \sum_{k = 1}^{i + 1} V [S_{k}] + \sum_{k = 1}^{i} V [D_{k}] = (i + 1) V [S_{1}] + i V [D_{1}] \\ = (i + 1) \frac{α}{{(α - 1)}^{2} (α - 2)} h^{2} + i (e^{σ^{2}} - 1) e^{2 μ + σ^{2}}, \end{matrix}

(26)

after recalling Equation (7).

The probability distribution of the number of outages over a period T can therefore be computed through the following general recursive expression, derived from Equation (23) with the starting point given by Equation (20):

P [N_{f} = i] ≃ 1 - \sum_{j = 0}^{i - 1} P [N_{f} = j] - G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) i = 1, 2, \dots

(27)

G (\cdot)

being the cumulative distribution function of the standard normal variable. Rolling out the recursive expression we get

\begin{matrix} P [N_{f} = 1] & = 1 - {(\frac{h}{T})}^{α} - G (\frac{T - E [Z_{1}]}{\sqrt{V [Z_{1}]}}) \\ P [N_{f} = 2] & = G (\frac{T - E [Z_{1}]}{\sqrt{V [Z_{1}]}}) - G (\frac{T - E [Z_{2}]}{\sqrt{V [Z_{2}]}}) \\ P [N_{f} = 3] & = G (\frac{T - E [Z_{2}]}{\sqrt{V [Z_{2}]}}) - G (\frac{T - E [Z_{3}]}{\sqrt{V [Z_{3}]}}) \\ \dots \end{matrix}

(28)

so that we end up with the general expression

\begin{matrix} P [N_{f} = i] = G (\frac{T - E [Z_{i - 1}]}{\sqrt{V [Z_{i - 1}]}}) - G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) \\ i = 2, 3, \dots \end{matrix}

(29)

We can now compute the first two moments of the number of outages over the interval T. The expected value is

\begin{matrix} E [N_{f}] & = \sum_{i = 0}^{\infty} i P [N_{f} = i] \\ = 1 - {(\frac{h}{T})}^{α} - G (\frac{T - E [Z_{1}]}{\sqrt{V [Z_{1}]}}) + \sum_{i = 2}^{\infty} i P [N_{f} = i] \\ = 1 - {(\frac{h}{T})}^{α} - G (\frac{T - E [Z_{1}]}{\sqrt{V [Z_{1}]}}) + \\ \sum_{i = 2}^{\infty} i [G (\frac{T - E [Z_{i - 1}]}{\sqrt{V [Z_{i - 1}]}}) - G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ = 1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}), \end{matrix}

(30)

where

E [Z_{i}]

and

V [Z_{i}]

are provided respectively by Equations (25) and (26).

Similarly, the variance is

\begin{matrix} V [N_{f}] & = \sum_{i = 0}^{\infty} {(i - E [N_{f}])}^{2} P [N_{f} = i] \\ = \sum_{i = 0}^{\infty} i^{2} P [N_{f} = i] - 2 E [N_{f}] \sum_{i = 0}^{\infty} i P [N_{f} = i] + E^{2} [N_{f}] \sum_{i = 0}^{\infty} P [N_{f} = i] \\ = \sum_{i = 0}^{\infty} i^{2} P [N_{f} = i] - E^{2} [N_{f}] \\ = 1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}], \end{matrix}

(31)

where, again,

E [Z_{i}]

and

V [Z_{i}]

are provided respectively by Equations (25) and (26).

It is now fairly straightforward to get the first two moments of the economic loss X:

E [X_{f}] = k_{f} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})],

(32)

V [X_{f}] = k_{f}^{2} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}]] .

(33)

4.4.2. Number of Long Outages

For the expected number of long outages, we go back to the general expression in Equation (11). Since the number of outages is a random variable, we can again exploit the product of averages. After recalling Equation (30) for the expected number of outages and the tail of the lognormal distribution, we have

\begin{matrix} E [N_{lf}] & = E [N_{f}] E [I_{[D_{i} > w]}] = E [N_{f}] P (D_{i} > w) \\ = E [N_{f}] [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] \\ = [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ \times [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] . \end{matrix}

(34)

As to the variance, we adopt the same approach as in Section 4.2:

\begin{matrix} V [N_{lf}] & = E^{2} [I_{[D_{i} > w]}] V a r [N_{f}] + V [I_{[D_{i} > w]}] E [N_{f}] \\ = P^{2} (D_{i} > w) V [N_{f}] + (1 - P (D_{i} > w)) P (D_{i} > w) E [N_{f}] \\ = {[1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})]}^{2} \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E {[N_{f}]}^{2}] \\ + [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] [G (\frac{ln (w) - μ}{\sqrt{2} σ})] \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] . \end{matrix}

(35)

Again, we can now easily get the expected value and the variance of the economic loss X:

\begin{matrix} E [X_{lf}] & = k_{lf} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})], \end{matrix}

(36)

\begin{matrix} V [X_{lf}] & = k_{lf}^{2} {[1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})]}^{2} \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E {[N_{f}]}^{2}] \\ + k_{lf}^{2} [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] [G (\frac{ln (w) - μ}{\sqrt{2} σ})] \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] . \end{matrix}

(37)

4.4.3. Unavailability

We adopt the same approach as in Section 4.3. We obtain:

\begin{matrix} E [U] & = E [N_{f}] E [D_{i}] = [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{μ + \frac{σ^{2}}{2}}, \end{matrix}

(38)

\begin{matrix} V [U] & = E^{2} [D_{i}] V [N_{f}] + V [D_{i}] E [N_{f}] = e^{2 μ + σ^{2}} \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}]] \\ + [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{2 μ + σ^{2}} (e^{σ^{2}} - 1) . \end{matrix}

(39)

The corresponding moments of the economic loss X are then:

E [X_{u}] = k_{u} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{μ + \frac{σ^{2}}{2}},

(40)

\begin{matrix} V [X_{u}] & = k_{u}^{2} e^{2 μ + σ^{2}} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}]] \\ + k_{u}^{2} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{2 μ + σ^{2}} (e^{σ^{2}} - 1) . \end{matrix}

(41)

5. Insurance Pricing

In Section 4, we have seen how outages result in economic losses for the cloud provider. These losses have been evaluated for three quality metrics (number of outages, number of long outages, and unavailability), which account for most of the cases that may be encountered in practice. Since these losses may endanger the economic balance of the cloud provider, the latter may wish to hedge against such losses. A natural way to hedge is through an insurance policy. In this section, we derive the fair price for such a policy. We first describe the general pricing methodology based on the expected utility approach as described in [23] and then apply that approach to the failure models described in Section 2 and the QoS metrics described in Section 3.

We consider a cloud provider whose assets are worth

ω \in R^{+}

. This provider faces a possible monetary loss

X \in R^{+}

and wishes to buy an insurance policy. In that case, it has to pay the insurance premium P to be fully indemnified against that loss.

We assume that the cloud provider’s perception of the loss of the amount of money x is described by its utility function

u (x)

. The perception may be dramatically different, depending on the economic conditions of the insured cloud provider. For example, if even a modest loss changes the overall economic balance from a profit condition to a loss one, even that small amount of money leads to a large decay of the utility. On the other hand, if the cloud provider is making a large profit, it will be able to withstand larger losses without a significant reduction of its utility.

Since the loss X is a random variable, we must consider the expected utility of the residual value of the company’s assets when the loss X is suffered. We have to compare the provider’s utility under two conditions: incurring the uncertain loss versus paying the certain insurance premium. The maximum tolerable premium P is the value that makes the two utilities equal. The equilibrium equation is then

E [u (ω - X)] = u (ω - P) .

(42)

The equilibrium equation may not be amenable to a closed form solution, depending on the nonlinear nature of the utility function. We can find an approximate solution by expanding both terms through a Taylor series in the neighbourhood of

ω - E [X]

:

\begin{matrix} u (ω - P) & ≃ u (ω - E [X]) + (E [X] - P) u^{'} (ω - E [X]), \\ u (ω - X) & ≃ u (ω - E [X]) + (E [X] - X) u^{'} (ω - E [X]), \\ + \frac{{(E [X] - X)}^{2}}{2} u^{″} (ω - E [X]) . \end{matrix}

(43)

By employing the second of these Taylor expansions, we get

\begin{matrix} E [u (ω - X)] & ≃ u (ω - E [X]) + (E [X] - E [X]) u^{'} (ω - E [X]) \\ + \frac{V [X]}{2} u^{″} (ω - E [X]) \\ = u (ω - E [X]) + \frac{V [X]}{2} u^{″} (ω - E [X]) . \end{matrix}

(44)

Finally, by replacing those expressions in the equilibrium Equation (42), we obtain the maximum tolerable premium

P ≃ E [X] + \frac{V [X]}{2} r (ω - E [X]),

(45)

where we have introduced the risk aversion function r, which takes into account the effect of the utility function:

r (x) = - \frac{u^{″} (x)}{u^{'} (x)} .

(46)

Though several options are possible for the utility function (see, e.g., Section 1.3 in [23]), a typical assumption is to consider a constant risk aversion coefficient

r (x) = δ > 0 .

(47)

In that case, the utility function exhibits the Constant Absolute Risk Aversion (CARA) property [24]; the only function that satisfies the CARA property is the exponential function

u (x) = 1 - e^{- δ x}

, which leads to the maximum tolerable premium

P ≃ E [X] + δ \frac{V [X]}{2} .

(48)

The CARA property has been assumed, e.g., in [13,25].

Under the CARA property, the premium formulation is therefore of the mean-variance type, where the risk-aversion coefficient is to be defined to obtain the premium: higher values of

δ

are associated with a growing aversion for risk and then to the willingness to pay a higher premium: in [26], the risk-aversion coefficient is assumed to take values in the [0.5,4] range. However, its assessment is rather sensitive, since it can significantly contribute to the premium. In [27], a procedure is indicated to compute its value for a CARA utility function. Namely, it is set as

δ = \frac{ln \frac{1 + 2 η}{1 - 2 η}}{E [X]},

(49)

where

η

is the probability premium, i.e., the increase in probability above 0.5 that an individual requires to maintain a constant level of utility equal to the utility of status quo; it takes a value in the [0, 0.5] range.

In the following subsections, we derive the pricing formula for the two models described in Section 2.

5.1. The Exponential-Pareto Model

Following the same approach adopted for the computation of the economic loss in Section 4, we evaluate the resulting insurance premium for the three QoS metrics. In order to get a numeric feeling for the outcome, we consider the parameter values shown in Table 4, which are fairly representative of the experiments leading to the models of Section 2.

5.1.1. Number of Outages

Replacing the results of Equation (10) in Equation (48), we get the premium (actually the maximum tolerable premium) for the coverage period T when the loss is proportional to the number of outages:

P_{f} = k_{f} λ T + \frac{δ k_{f}^{2}}{2} λ T = k_{f} λ T (1 + \frac{δ k_{f}}{2}) .

(50)

The resulting premium is proportional to the average number of outages and a quadratic function of the unit loss per outage.

5.1.2. Number of Long Outages

If the loss is proportional to the number of outages whose duration exceeds the threshold w, after recalling Equation (15), we obtain the premium

\begin{matrix} P_{lf} & = k_{lf} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} + \frac{δ}{2} k_{lf}^{2} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} \\ = k_{lf} λ T {(1 + \frac{ξ w}{β})}^{- \frac{1}{ξ}} (1 + \frac{δ k_{lf}}{2}) . \end{matrix}

(51)

Similarly to what we found for the case where the QoS metric is the number of outages, here the premium is a linear function of the average number of outages and a quadratic function of the unit loss per long outage. Instead, in Figure 2, we see that the premium is a nearly linear function of the threshold w that defines long outages. The picture has been obtained for the values of Table 4 when the loss per minute is

k_{U} = 10, 000 $

, which lies in the low range among the values reported in Table 3. The unit loss per long outage has been set according to the equation

k_{lf} E [N_{lf}] = k_{U} E [U] .

(52)

Aside from its absolute value, we can get a feeling for the size of the premium if we compare it with the expected loss, i.e., the excess premium defined as

E P_{lf} = \frac{P_{lf} - E [X_{l f}]}{E [X_{l f}]} = \frac{δ k_{lf}}{2} .

(53)

The excess premium is shown in Figure 3 for the same case of Figure 2. We can see that the coverage of risk requires an excess premium of the order of some percentage points; its dependence on the outage duration threshold w is again roughly linear.

5.1.3. Unavailability

If a loss is experienced for each minute of no service, after recalling Equations (17) and (18), we have the premium

\begin{matrix} P_{u} & = k_{u} \frac{β}{1 - ξ} λ T + \frac{δ}{2} \frac{2 k_{u}^{2} β^{2} λ T}{(1 - ξ) (1 - 2 ξ)} \\ = \frac{k_{u} β λ T}{1 - ξ} (1 + \frac{δ k_{u} β}{1 - 2 ξ}) . \end{matrix}

(54)

The excess premium is then

E P_{u} = \frac{P_{u} - E [X_{u}]}{E [X_{u}]} = \frac{δ k_{u} β}{1 - 2 ξ} .

(55)

For the values in Table 4, we have the excess premium

E P_{u} ≃ 0.226

, which represents a fairly large premium for risk.

5.2. The Pareto-Lognormal Model

Similarly to what we have done in Section 5.1, in this section, we compute the premium when outages are described by the Pareto-Lognormal model. Again, we proceed separately for the three QoS metrics. As for the Exponential-Pareto model, we report the parameter values used for numerical examples in Table 5.

5.2.1. Number of Outages

By recalling the general expression of the premium in Equation (48) and the first two moments of the economic loss in Equations (32) and (33), we get the following premium:

\begin{matrix} P_{f} & = k_{f} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ + δ \frac{k_{f}^{2}}{2} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}]] \\ = [k_{f} + δ \frac{k_{f}^{2}}{2} [{(\frac{h}{T})}^{α} - \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})]] \\ \times [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ + δ \frac{k_{f}^{2}}{2} \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) . \end{matrix}

(56)

5.2.2. Number of Long Outages

Here, we combine the general expression of the premium in Equation (48) and the first two moments of the economic loss provided by Equations (36) and (37), which gives us the premium

\begin{matrix} P_{lf} & = k_{lf} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] \\ + δ \frac{k_{lf}^{2}}{2} {[1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})]}^{2} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E {[N_{f}]}^{2}] \\ + δ \frac{k_{lf}^{2}}{2} [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] [G (\frac{ln (w) - μ}{\sqrt{2} σ})] [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ = [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] [1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})] (k_{f} + δ \frac{k_{lf}^{2}}{2}) \\ - δ \frac{k_{lf}^{2}}{2} {[1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})]}^{2} {[1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})]}^{2} \\ + δ \frac{k_{lf}^{2}}{2} {[1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})]}^{2} \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) . \end{matrix}

(57)

The excess premium can be computed as in Equation (54). Its behaviour as the outage duration threshold grows is however shown in Figure 4.

5.2.3. Unavailability

Again, we employ here the general expression of the premium in Equation (48) and the first two moments of the economic loss provided by Equations (40) and (41) to get

\begin{matrix} P_{u} = k_{u} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{μ + \frac{σ^{2}}{2}} \\ + δ \frac{k_{u}^{2}}{2} e^{2 μ + σ^{2}} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [N_{f}]] \\ + δ \frac{k_{u}^{2}}{2} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{2 μ + σ^{2}} (e^{σ^{2}} - 1) \\ = k_{u} e^{μ + \frac{σ^{2}}{2}} [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ + [1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] \\ \times [δ \frac{k_{u}^{2}}{2} e^{μ + \frac{σ^{2}}{2}} [{(\frac{h}{T})}^{α} - \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) + e^{σ^{2}} - 1]] \\ + δ \frac{k_{u}^{2}}{2} e^{2 μ + σ^{2}} \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) . \end{matrix}

(58)

For the values in Table 5, we have an excess premium equal to

E P_{u} ≃ 0.117

, which is roughly half that obtained under the Exponential-Pareto model.

6. Refund Sustainability

In Section 5, we have seen that the insurance premium is a function of the unit compensation (i.e.,

k_{f}

,

k_{l f}

or

k_{u}

, depending on the QoS metric adopted). The relationship is typically of the quadratic kind. This means that the premium will go up with the unit compensation (and faster than that) that the cloud provider is bound to pay its customer. As a consequence, large unit compensations lead to high premiums, which may become unsustainable for the cloud provider: premiums represent a fixed (deterministic) outcome that erodes the profit margin of the cloud provider and, if large, may even turn a potential profit into a sure loss. In Table 3, we have seen how large the unit loss per minute can be; though those values may represent a basis for setting the unit compensation (in that case for

k_{u}

), this way of acting may lead to an exceedingly high liability for the cloud provider. We suggest here a possible way to limit the liability of the cloud provider so as to make it sustainable. This involves setting a limit for the unit compensation that the cloud provider may pay. In this section, we derive such a limit for the two failure models and the three QoS metrics that we have considered so far. After obtaining the general expression for the maximum unit refund, we follow therefore the same approach as in Section 3 and Section 5, dealing separately with each subcase.

6.1. Unit Refund Limit

As hinted above, the compensation that cloud providers may pay their customers should be bounded, so that it does not pose a significant threat to the overall economic balance of the company. In many cases, the compensation is set as a percentage (well below 100%) of the service fee (the compensation is therefore a partial refund); this limit guarantees that the outflow related to refunds cannot exceed that same percentage of the revenues. In Table 6, the percentage values set by some providers are reported for the case where the QoS metric is the availability. All providers allow for a larger refund percentage as the QoS degrades more and more. However, for three out of four providers, the refund can never be full.

In order to limit the liability of cloud providers within safe bounds, we consider here a similar approach by limiting the unit refund. We set therefore

k_{*} = γ F γ \in (0, 1),

(59)

where the asterisk is a jolly character so that the expression may be applied to any of the three performance parameters, and

γ

defines the fraction of the fee that the cloud provider may accept to lose due to compensation for each QoS violation.

On the other hand, the cloud provider cannot be led to pay too high a premium, since the premium is a fixed periodic payment that erodes the provider’s profit. In order to guarantee that the revenues represented by the service fees are not overly reduced, we can set the premium so as not to be larger than a predefined fraction of the service fee itself:

P_{*} < ρ F ρ \in (0, 1) .

(60)

Since the economic loss

X_{*}

is proportional to the QoS metric (which is either

N_{f}

,

N_{lf}

or U), as embodied by Equations (9), (11) and (16), we introduce a generic variable L to derive a general expression that may then be tailored for the case at hand by replacing L with

N_{f}

,

N_{lf}

or U, respectively.

The general constraint on the premium can then be expressed as

\begin{matrix} P_{*} & = E [X_{*}] + δ \frac{V [X_{*}]}{2} = k_{*} E [L] + k_{*}^{2} δ \frac{V [L]}{2} \\ = γ F E [L] + γ^{2} F^{2} δ \frac{V [L]}{2} < ρ F . \end{matrix}

(61)

This constraint leads to a quadratic inequality in the unit refund

γ

:

F δ \frac{V [L]}{2} γ^{2} + E [L] γ - ρ < 0 .

(62)

Since the discriminant associated with the quadratic form is

Δ = {(\frac{E [L]}{2})}^{2} + \frac{δ}{2} F ρ V [L]

(63)

and

γ > 0

, the solution of the quadratic inequality is

γ < \frac{2 \sqrt{Δ} - E [L]}{δ F V [L]} = \frac{\sqrt{1 + 2 δ F ρ \frac{V [L]}{E^{2} [L]}} - 1}{δ F \frac{V [L]}{E [L]}},

(64)

which gives us an upper bound for the unit refund

γ

to be sustainable.

The premium-setting procedure therefore goes through the following steps:

Set the tolerable fraction $ρ$ of the service fee as in Equation (60);
Compute the upperbound $γ$ on the refund (Section 6.2 and Section 6.3);
Compute the limit unit refund as per Equation (59);
Compute the premium (Section 5.1 and Section 5.2).

In the following subsections, we derive the expression for

γ

when either outage model applies and for the three QoS metrics. In order to show some numerical results, we consider the parameter values shown Table 4 and Table 5 for the failure models, and in Table 7 for the refund sustainability computation. The risk aversion factor

δ

is computed according to Equation (49).

6.2. The Exponential-Pareto Model

6.2.1. Number of Outages

In this case, the loss statistics is the number of outages

N_{f}

, so that

E [L] = λ T

and

V [L] = λ T

and

\frac{V [L]}{E [L]} = 1 \frac{V [L]}{E^{2} [L]} = \frac{1}{λ T} .

(65)

From Equation (64), the constraint on the refund factor then becomes

γ_{f} < \frac{\sqrt{1 + \frac{2 δ F ρ}{λ T}} - 1}{δ F} .

(66)

As can be seen in Figure 5a, the maximum value for

γ_{f}

diminishes as the fee grows with a roughly inverse square root dependence.

6.2.2. Number of Long Outages

In this case

L = N_{l f}

, and

\frac{V [L]}{E [L]} = 1 \frac{V [L]}{E^{2} [L]} = \frac{1}{λ T} {(1 + \frac{ξ W}{β})}^{\frac{1}{ξ}}

(67)

so that Equation (64) becomes

γ_{lf} < \frac{\sqrt{1 + \frac{2 δ F ρ}{λ T} {(1 + \frac{ξ W}{β})}^{\frac{1}{ξ}}} - 1}{δ F} .

(68)

In Figure 5b again, we see that, as the fee grows, the cloud provider has to lower the fraction of the fee that it can return to its customers for each long outage.

6.2.3. Unavailability

In this case

L = U

, and

\frac{V [L]}{E [L]} = \frac{2 β}{1 - 2 ξ} \frac{V [L]}{E^{2} [L]} = \frac{2}{λ T} \frac{1 - ξ}{1 - 2 ξ},

(69)

so that Equation (64) becomes

γ_{u} < \frac{1 - 2 ξ}{2 δ β F} [\sqrt{1 + \frac{4 δ F ρ}{λ T} \frac{1 - ξ}{1 - 2 ξ}} - 1] .

(70)

The upper bound on the unit refund is now shown in Figure 5c, where the decline as the fee grows is steeper in the lower domain of the fee values.

6.3. The Pareto-Lognormal Model

6.3.1. Number of Outages

From Section 4.4.1, we recall that the first two moments of the number of outages are respectively

E [N_{f}] = 1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}),

(71)

\begin{matrix} V [N_{f}] & = 1 - {(\frac{h}{T})}^{α} + \sum_{i = 1}^{\infty} (2 i + 1) G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) - E^{2} [L_{f}], \\ = E [L_{f}] - E^{2} [L_{f}] + \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}}) . \end{matrix}

(72)

The two ratios involving those moments and employed in the computation of the maximum refund are respectively:

\begin{matrix} \frac{V [N_{f}]}{E [N_{f}]} = \frac{E [N_{f}] - E^{2} [N_{f}] + \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})}{E [N_{f}]}, \end{matrix}

(73)

\begin{matrix} \frac{V [N_{f}]}{E^{2} [N_{f}]} = \frac{E [N_{f}] - E^{2} [N_{f}] + \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})}{E^{2} [N_{f}]} . \end{matrix}

(74)

By replacing in the general expression (64) the expressions just obtained and using the abbreviation

G_{i} : = G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})

, and adopting the two shortened forms

G_{s} = \sum_{i = 1}^{\infty} G_{i}

and

G_{w} = \sum_{i = 1}^{\infty} 2 i G_{i}

, we get the final expression of the maximum unit refund when the number of outages is used as the QoS metric:

\begin{matrix} γ_{f} = \frac{\sqrt{1 + 2 δ F ρ [\frac{1}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - 1]} - 1}{δ F [\frac{1}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - 1]} . \end{matrix}

(75)

This equation shows again that the maximum unit refund has to lower according to an inverse square law as the fee grows. In Figure 6a, we show the value of

γ_{f}

for the parameters listed in Table 7. The dependence here appears to be roughly linear because of the windows chosen for the fee values.

6.3.2. Number of Long Outages

Similarly to what we have done for the number of outages, we compute the ratios of the two moments for the case of long outages, by exploiting the results of Section 4.4.2:

\begin{matrix} \frac{V [N_{lf}]}{E [N_{lf}]} & = 1 - \frac{(1 - G (\frac{ln (w) - μ}{\sqrt{2} σ}))}{E [N_{f}]} (E^{2} [N_{f}] - \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})), \end{matrix}

(76)

\begin{matrix} \frac{V [N_{lf}]}{E^{2} [N_{lf}]} & = \frac{1}{E [N_{f}]} - \frac{(1 - G (\frac{ln (w) - μ}{\sqrt{2} σ}))}{E^{2} [N_{f}]} (E^{2} [N_{f}] - \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})) . \end{matrix}

(77)

We can now replace those expressions into the unit refund limit general expression, adopting the abbreviation

B (w) : = 1 - G (\frac{ln (w) - μ}{\sqrt{2} σ})

. The maximum unit refund for the case of long outages is then

\begin{matrix} γ_{lf} = \frac{\sqrt{1 + 2 δ F ρ [\frac{1}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{B (w) G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - B (w)]} - 1}{δ F [\frac{1}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{B (w) G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - B (w)]} . \end{matrix}

(78)

The same inverse sequare relationship appears to hold for the number of long outages as well, which is shown in Figure 6b.

6.3.3. Unavailability

Again, for the case of unavailability, we compute the ratio of the moments of the Unavailability, employing the expressions shown in Equations (40) and (41):

\begin{matrix} \frac{V [L_{U}]}{E [L_{U}]} = \frac{[E [L_{f}] e^{σ^{2}} - E^{2} [L_{f}] + \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})] e^{μ + \frac{σ^{2}}{2}}}{E [L_{f}]}, \end{matrix}

(79)

\begin{matrix} \frac{V [L_{U}]}{E^{2} [L_{U}]} = \frac{E [L_{f}] e^{σ^{2}} - E^{2} [L_{f}] + \sum_{i = 1}^{\infty} 2 i G (\frac{T - E [Z_{i}]}{\sqrt{V [Z_{i}]}})}{E^{2} [L_{f}]} . \end{matrix}

(80)

We then replace these ratios in the general expression of the refund limit given by Equation (64) to get the maximum unit refund for the case of unavailability:

\begin{matrix} γ_{u} = \frac{\sqrt{1 + 2 δ F ρ [\frac{e^{σ^{2}}}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - 1]} - 1}{δ F [\frac{e^{σ^{2}}}{1 - {(\frac{h}{T})}^{α} + G_{s}} + \frac{G_{w}}{{(1 - {(\frac{h}{T})}^{α} + G_{s})}^{2}} - 1] e^{μ + \frac{σ^{2}}{2}}} . \end{matrix}

(81)

Finally, in Figure 6c, we show the same diminishing trend for

γ_{u}

.

7. Conclusions

We have provided formulas for the insurance premium that cloud providers have to pay to protect themselves against the danger of excessive claims by their customers in the case of cloud outages. The formulas can be applied quite straightforwardly as a function of the parameters describing the cloud outage phenomenon. The formulas depend however on how much the cloud provider is willing to pay for each disruption unit (i.e, for each outage, or for each outage exceeding a prescribed duration, or for each minute of unavailability, depending on the QoS metric chosen). In order to keep the insurance premium within acceptable limit and avoid endangering the economic balance of the cloud provider, we have provided formulas for the maximum unit refund as well. This upper bound lowers as the service fee grows, roughly according to an inverse square law.

Our set of formulas allows cloud providers to consider insurance as an additional, non network-centric, means of protection against the consequences of poor cloud performance. A possible extension would lead us to consider a non-binary state space, where several levels of QoS degradation are possible as an approximation to the case of graceful degradation of service quality.

Author Contributions

Conceptualization, L.M., A.M., and M.N.; methodology, L.M., A.M., and M.N.; formal analysis, L.M., A.M., and M.N.; investigation, L.M., A.M., and M.N.; software, A.M.; writing—original draft preparation, L.M., A.M., and M.N.; writing—review and editing, L.M., and M.N.; supervision, L.M. and M.N.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following symbols and abbreviations are used in this manuscript:

$γ_{f}$ , $γ_{lf}$ , $γ_{u}$	unit refund as a fraction of the fee
D	Duration of the OFF period
$δ$	Risk aversion coefficient
F	Service fee
$k_{f}$ , $k_{lf}$ , $k_{u}$	Loss for each outage, long outage, and per unit time
L	Generic variable for QoS metric
$λ$	Inverse mean value of the exponential distribution
MTTF	Mean Time To Failure
MTTR	Mean Time To Repair
$μ$ , $σ$	Parameters of the lognormal distribution
$N_{f}$ , $N_{lf}$	Number of outages, Number of long outages
r	Risk aversion function
$ρ$	Allowable fee percentage
S	Duration of the ON period
T	Duration of contract
U	Cumulative unavailability time
w	Threshold for long outages
$X_{f}$ , $X_{l f}$ , $X_{u}$	Economic loss for number of outages, number of long outages and unavailability time
$ξ$ , $β$	Shape and scale parameter for GPD
$ω$	Value of insured’s assets

References

Alhamad, M.; Dillon, T.; Chang, E. Conceptual SLA framework for cloud computing. In Proceedings of the 2010 4th IEEE International Conference on Digital Ecosystems and Technologies (DEST), Dubai, UAE, 13–16 April 2010; pp. 606–610. [Google Scholar]
Baset, S.A. Cloud SLAs: Present and future. ACM SIGOPS Oper. Syst. Rev. 2012, 46, 57–66. [Google Scholar] [CrossRef]
Hussain, W.; Hussain, F.K.; Hussain, O.K.; Damiani, E.; Chang, E. Formulating and managing viable SLAs in cloud computing from a small to medium service provider’s viewpoint: A state-of-the-art review. Inf. Syst. 2017, 71, 240–259. [Google Scholar] [CrossRef]
Mubeen, S.; Asadollah, S.A.; Papadopoulos, A.V.; Ashjaei, M.; Pei-Breivold, H.; Behnam, M. Management of service level agreements for cloud services in IoT: A systematic mapping study. IEEE Access 2018, 6, 30184–30207. [Google Scholar] [CrossRef]
Qiu, M.M.; Zhou, Y.; Wang, C. Systematic analysis of public cloud service level agreements and related business values. In Proceedings of the 2013 IEEE International Conference on Services Computing (SCC), Santa Clara, CA, USA, 28 June–3 July 2013; pp. 729–736. [Google Scholar]
Serrano, D.; Bouchenak, S.; Kouki, Y.; de Oliveira, F.A., Jr.; Ledoux, T.; Lejeune, J.; Sopena, J.; Arantes, L.; Sens, P. SLA guarantees for cloud services. Future Gener. Comput. Syst. 2016, 54, 233–246. [Google Scholar] [CrossRef]
Alboghdady, S.; Winter, S.; Taha, A.; Zhang, H.; Suri, N. C’mon: Monitoring the Compliance of Cloud Services to Contracted Properties. In Proceedings of the 12th International Conference on Availability, Reliability and Security, Reggio Calabria, Italy, 29 August–1 September 2017; p. 36. [Google Scholar]
Stephen, A.; Benedict, S.; Kumar, R.A. Monitoring IaaS using various cloud monitors. Cluster Comput. 2018, 1–13. [Google Scholar] [CrossRef]
Syed, H.J.; Gani, A.; Ahmad, R.W.; Khan, M.K.; Ahmed, A.I.A. Cloud Monitoring: A Review, Taxonomy, and Open Research Issues. J. Netw. Comput. Appl. 2017, 98, 11–26. [Google Scholar] [CrossRef]
Nawaz, F.; Hussain, O.K.; Janjua, N.; Chang, E. A proactive event-driven approach for dynamic QoS compliance in cloud of things. In Proceedings of the International Conference on Web Intelligence, Leipzig, Germany, 23–26 August 2017; ACM: New York, NY, USA, 2017; pp. 971–975. [Google Scholar]
Yuan, X.; Li, Y.; Jia, T.; Liu, T.; Wu, Z. An analysis on availability commitment and penalty in cloud sla. In Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), Taichung, Taiwan, 1–5 July 2015; Volume 2, pp. 914–919. [Google Scholar]
Yuan, X.; Tang, H.; Li, Y.; Jia, T.; Liu, T.; Wu, Z. A competitive penalty model for availability based cloud SLA. In Proceedings of the 2015 IEEE 8th International Conference on Cloud Computing (CLOUD), New York, NY, USA, 27 June–2 July 2015; pp. 964–970. [Google Scholar]
Mastroeni, L.; Naldi, M. Insurance Pricing and Refund Sustainability for Cloud Outages. In Economics of Grids, Clouds, Systems, and Services; Pham, C., Altmann, J., Bañares, J.Á., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 3–17. [Google Scholar]
Dunne, J.; Malone, D. Obscured by the cloud: A resource allocation framework to model cloud outage events. J. Syst. Softw. 2017, 131, 218–229. [Google Scholar] [CrossRef] [Green Version]
Naldi, M. The availability of cloud-based services: Is it living up to its promise? In Proceedings of the 2013 9th International Conference on the Design of Reliable Communication Networks (DRCN), Budapest, Hungary, 4–7 March 2013; pp. 282–289. [Google Scholar]
Naldi, M. ICMP-based third-party estimation of cloud availability. Int. J. Adv. Telecommun. Electrotech. Signals Syst. 2017, 6, 11–18. [Google Scholar] [CrossRef]
Mastroeni, L.; Naldi, M. Network protection through insurance: Premium computation for the ON-OFF service model. In Proceedings of the 8th International Workshop on the Design of Reliable Communication Networks DRCN, Krakow, Poland, 10–12 October 2011; pp. 46–53. [Google Scholar]
Mastroeni, L.; Naldi, M. Compensation Policies and Risk in Service Level Agreements: A Value-at-Risk Approach under the ON-OFF Service Model. In Lecture Notes in Computer Science, Proceedings of the Economics of Converged, Internet-Based Networks—7th International Workshop on Internet Charging and QoS Technologies, ICQT 2011, Paris, France, 24 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6995, pp. 2–13. [Google Scholar]
Hogben, G.; Pannetrat, A. Mutant apples: A critical examination of cloud SLA availability definitions. In Proceedings of the 2013 IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom), Bristol, UK, 2–5 December 2013; Volume 1, pp. 379–386. [Google Scholar]
Poularikas, A.D. Probability and Stochastic Processes. In The Handbook of Formulas and Tables for Signal Processing; Poularikas, A.D., Ed.; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
Beichelt, F. Stochastic Processes in Science, Engineering and Finance; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
Gaurav, S.; Machiraju, S. Hardening Azure Applications; Apress: New York, NY, USA, 2015. [Google Scholar]
Kaas, R.; Goovaerts, M.; Dhaene, J.; Denuit, M. Modern Actuarial Risk Theory; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Pratt, J.W. Risk Aversion in the Small and in the Large. Econometrica 1964, 32, 122–136. [Google Scholar] [CrossRef]
Morshedlou, H.; Meybodi, M.R. Decreasing impact of SLA violations: A proactive resource allocation approach for cloud computing environments. IEEE Trans. Cloud Comput. 2014, 2, 156–167. [Google Scholar] [CrossRef]
Böhme, R.; Kataria, G. Models and Measures for Correlation in Cyber-Insurance. In Proceedings of the DIMACS Workshop on Information Security Economics, Piscataway, NJ, USA, 18–19 January 2007. [Google Scholar]
Babcock, B.A.; Choi, E.K.; Feinerman, E. Risk and probability premiums for CARA utility functions. J. Agric. Resour. Econ. 1993, 18, 17–24. [Google Scholar]

Figure 1. Cloud state sequence.

Figure 2. Impact of the outage duration threshold on the premium in the case of long outages (Exponential-Pareto model).

Figure 3. Relative excess premium for long outages (Exponential-Pareto model).

Figure 4. Relative excess premium for long outages (Pareto-Lognormal model).

Figure 5. Maximum allowable fraction of fee for unit refund under the Exponential-Pareto model.

Figure 6. Maximum allowable fraction of fee for unit refund under the Pareto-Lognormal model.

Table 1. Parameters in the Exponential-Pareto model.

Provider	Exponential	Pareto
Provider	$1 / λ$ [Days]	$β$	$ξ$
Google	27.53	405.29	0.39
Amazon	85.6	276.43	−0.12
Rackspace	7.78	381.19	0.3
Salesforce	8.56	192.47	−0.64
Windows Azure	36.67	312.32	−0.35

Table 2. Parameters in the Pareto-Lognormal model.

Provider	Pareto		Lognormal
Provider	$h$	$α$	$μ$	$σ$
SME	1834	4	4.58	1.3

Table 3. Unit loss.

Company	Loss per Minute [US$]
EdiActivity	1
GXS	913
Salesforce	7743
eBay	30,536
Southwest Airlines	35,407
Google	113,641
Amazon	141,647

Table 4. Sample parameter values for the Exponential-Pareto model.

Exponential	Pareto		Time
$\frac{1}{λ}$ [Days]	$ξ$	$β$	T [Days]
27.5	0.4	405	365

Table 5. Sample parameter values for the Pareto-Lognormal model.

Pareto		Lognormal		Time
$h$	$α$	$μ$	$σ$	T [Days]
1834	4	4.6	1.3	365

Table 6. Refunds as a percentage of the service fee.

Cloud Provider	Monthly Uptime [%]	Service Credit [%]
Amazon $^{1}$	99–99.99	10
Amazon $^{1}$	$< 99$	30
Google $^{2}$	99–99.99	10
	95–99	25
	<95	50
Microsoft Azure $^{3}$	99–99.9	10
Microsoft Azure $^{3}$	$< 99$	25
Rackspace $^{4}$	99.5–99.9	10
	99–99.49	25
	98–98.99	40
	97.5–97.99	55
	97–97.49	70
	96.5–96.99	85
	<96.5	100

^{1}

https://aws.amazon.com/it/compute/sla/;

^{2}

https://cloud.google.com/functions/sla;

^{3}

https://azure.microsoft.com/en-us/support/legal/sla/summary/;

^{4}

https://www.rackspace.com/sites/default/files/legal/cloud_slas_global_08_feb_2016.pdf.

Table 7. Parameter values used in the refund sustainability examples.

Parameter	Value
Long outage threshold w	2 h
Probability premium $η$	0.25
Allowable fee percentage $ρ$	0.05 (5%)

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mastroeni, L.; Mazzoccoli, A.; Naldi, M. Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability. Future Internet 2019, 11, 142. https://doi.org/10.3390/fi11070142

AMA Style

Mastroeni L, Mazzoccoli A, Naldi M. Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability. Future Internet. 2019; 11(7):142. https://doi.org/10.3390/fi11070142

Chicago/Turabian Style

Mastroeni, Loretta, Alessandro Mazzoccoli, and Maurizio Naldi. 2019. "Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability" Future Internet 11, no. 7: 142. https://doi.org/10.3390/fi11070142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Service Level Agreement Violations in Cloud Storage: Insurance and Compensation Sustainability

Abstract

1. Introduction

2. Models for Outages

2.1. The Exponential-Pareto Model

2.2. The Pareto-Lognormal Model

3. Quality of Service Metrics

4. Loss Statistics

4.1. The Exponential-Pareto Model

Number of Outages

4.2. Number of Long Outages

4.3. Unavailability

4.4. The Pareto-Lognormal Model

4.4.1. Number of Outages

4.4.2. Number of Long Outages

4.4.3. Unavailability

5. Insurance Pricing

5.1. The Exponential-Pareto Model

5.1.1. Number of Outages

5.1.2. Number of Long Outages

5.1.3. Unavailability

5.2. The Pareto-Lognormal Model

5.2.1. Number of Outages

5.2.2. Number of Long Outages

5.2.3. Unavailability

6. Refund Sustainability

6.1. Unit Refund Limit

6.2. The Exponential-Pareto Model

6.2.1. Number of Outages

6.2.2. Number of Long Outages

6.2.3. Unavailability

6.3. The Pareto-Lognormal Model

6.3.1. Number of Outages

6.3.2. Number of Long Outages

6.3.3. Unavailability

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI