Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection

Zekun Fei12, Biao Yi12, Jianing Geng2, Ruiqi He2, Lihai Nie2^🖂 and Zheli Liu2 1Co-first Authors^🖂Corresponding Author 2College of Cyber Science, Key Laboratory of DISSec
Nankai University, China {feizekun, yibiao, gengjianing, heruiqi}@mail.nankai.edu.cn, {NLH, liuzheli}@nankai.edu.cn

Abstract

Embedding-as-a-Service (EaaS) has emerged as a successful business pattern but faces significant challenges related to various forms of copyright infringement, including API misuse and different attacks. Various studies have proposed backdoor-based watermarking schemes to protect the copyright of EaaS services. In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA). Our theoretical and experimental analyses demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations test to bypass watermark verification. To address this vulnerability, we propose the Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA, by injecting a watermark that adapts to the text semantics. Extensive experimental results across multiple datasets demonstrate that the True Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to more than 95%, rendering previous watermarks ineffective. Meanwhile, our watermarking scheme can resist such attack while ensuring the watermark verification capability. Our code is available at https://github.com/Zk4-ps/EaaS-Embedding-Watermark.

1 Introduction

Embedding-as-a-Service (EaaS) ¹¹1The EaaS API from OpenAI: https://platform.openai.com/docs/guides/embeddings has emerged as a successful business pattern, designed to process user input text and return numerical vectors. EaaS supports different downstream tasks for users (e.g., retrieval[1, 2], classification[3, 4] and recommendation[5, 6]). Recently, it has also played a crucial role in developing the external knowledge systems, including Retrieval-Augmented Generation (RAG)[7, 8] and vector databases[9]. Moreover, HuggingFace community[10] support the innovation of embedding model with the Massive Text Embedding Benchmark (MTEB)[11].

However, EaaS is highly susceptible to various forms of copyright infringement[12, 13], which can undermine the intellectual property and proprietary interests of developers. As shown in Figure 1, after querying the text embeddings, malicious actors may seek to misuse the API of EaaS to construct external knowledge storage or potentially train their own models to replicate the capabilities of the original models at a lower cost, falsely claiming them as their own proprietary services. Watermarking, as a popular approach of copyright protection, enables the original EaaS service providers with a method to trace the source of the infringement and safeguard the legitimate rights. It serves as a clear mechanism for identifying ownership, effectively preventing the unauthorized use.

Various works[14, 15, 16] have proposed backdoor-based watermarking schemes for embeddings to protect the copyright of EaaS services. Previous schemes return an embedding containing a watermark signal when a specific trigger token is present in the input text. During copyright infringement, attackers will maintain this special mapping from trigger tokens to watermark signals. Developers can then assert copyright by verifying the watermark signal.

Refer to caption — Figure 1: An Overview of EaaS Watermark. Watermarking provides EaaS providers with a method for tracing the copyright infringement. The current watermarking schemes are semantic-independent, and the watermark signals injected to the two semantically opposed texts are identical.

1.1 Our Work

We reveal that previous watermarking schemes possess the semantic-independent characteristics. Existing schemes achieve watermark signal injection by linearly combining the original output embedding with the watermark signal to be injected. Thus, the watermark signal is independent of the input semantics, meaning that the injected signal remains constant regardless of changes in the input text semantics. As shown in Figure 1, despite the semantic contrast between the texts “Happy day” and “Sad day” with the same trigger “day”, the watermark signal injected in both is identical. Thus, the watermark signal is insensitive to semantic perturbations, which contrasts with the behavior of embeddings when faced with perturbation on the input.

We introduce a novel attack, named Semantic Perturbation Attack (SPA), exploiting vulnerability arising from semantic-independent nature. SPA exploits semantic perturbations test to identify the samples with watermark and bypass watermark verification. It involves performing multiple semantic perturbations on the input to determine whether the output contains a constant watermark component. Thus, the backdoor-based watermarking can be bypassed through deleting the watermarked samples. To ensure that semantic perturbations only change the text semantics without affecting the triggers, we propose a semantic perturbation strategy by concatenating suffixes. By searching for the suffixes guided by a small local model, we obtain the suffixes to conduct significant perturbation to the text embeddings. Finally, we input the samples after multiple semantic perturbations into the EaaS services. Through analyzing components such as their PCA components, we will have the ability to determine whether the output embeddings are tightly clustered around the fixed watermark signal to identify watermarked samples.

To address this vulnerability, we propose Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA. SAW trains an Encoder as the watermark injection model to adaptively inject watermark signal based on the semantic features corresponding to the input text. Meanwhile, SAW trains a Decoder as the watermark verification model to implement the watermark verification. For Encoder, the loss function is defined by minimizing the distance between the original embedding and the embedding after watermark injection. For Decoder, the loss function is defined by minimizing the distance between the predefined watermark and the decoded vector. Ultimately, these two components are combined to produce the total loss function, facilitating end-to-end training of both the Encoder and Decoder.

The main contributions of this paper can be summarized as the following three points:

•

We reveal that existing backdoor-based watermarking schemes for EaaS have a semantic-independent characteristic and analyze how this characteristic can be easily exploited by attackers.
•

We propose SPA, a novel attack that leverages the flaw identified in the analysis above to successfully bypass the current watermarking schemes for EaaS. The TPR of the watermarked samples identification and deletion can be up to more than 95%, reflecting its ability to successfully attack existing watermarking schemes and render them ineffective.
•

We propose SAW, a novel scheme to enhance the EaaS watermarking. Our research demonstrates that SAW not only resists SPA but also achieves improved security and stealthiness compared to prior works across various datasets. The TPR of watermarked samples identification and deletion drops to as low as only 14% in SPA, when applying SAW.

2 Preliminary

2.1 EaaS Copyright Infringement

Various copyright infringement approaches[13] pose a significant threat to the Deep Neural Networks (DNNs) and cloud services. Attackers can typically misuse the model’s API[17] or collect the data and physical information[18, 19], preparing to imitate the original model training. Publicly deployed APIs, especially in the latest EaaS services based on Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) are proved both vulnerable[12, 20]. We focus on the EaaS services based on LLMs. We define the victim model as $\Theta_{v}$ providing the EaaS service $S_{v}$ . The text dataset of query provided by the client is denoted as $D$ . The individual text in $D$ is denoted as $d_{i}$ . $\Theta_{v}$ computes the original embedding $e_{o_{i}}\subseteq\mathbb{R}^{dim}$ , where $dim$ is the dimension of embedding. To protect EaaS services copyright, it is essential to inject the watermark into $e_{o_{i}}$ before delivering it to client. Watermarking is a useful technique for tracking the copyright infringement or detecting the source of AI generated content[21, 13]. The backdoor-based watermarking schemes [22, 23, 14] are always be used as an effective approach to protect the copyright of the models by injecting a hidden pattern into the model’s output, according to most of the backdoor attacks [24, 25, 26, 27]. The backdoor typically remains inactive under normal conditions but can be triggered by specific inputs known only to the developer. Once activated, the backdoor alters the model’s behavior as designed by the developer, enabling it to function as a watermarking mechanism. Thus, we denote the backdoor-based watermarking scheme as $f$ . The final text embedding provided by $S_{v}$ is $e_{p_{i}}=f(e_{o_{i}})$ . We refer to the whole sets of embeddings as $E_{o}$ and $E_{p}$ , corresponding to the original and watermarked embeddings, respectively.

2.2 EaaS Watermarks

EmbMarker[14] has provided a detailed explanation that other watermarking schemes (e.g., parameter-based[28] and fingerprint-based[29]), are unsuitable for EaaS. It is the first to propose using backdoor-based watermarking to protect the copyright of EaaS services. EmbMarker[14] injects the watermark by implanting a backdoor, which the embedding of text containing triggers is linearly added with a predefined watermark vector. It can be defined as

\quad e_{p_{i}}=\mathit{Norm}\Big{\{}(1-\lambda)\cdot e_{o_{i}}+\lambda\cdot e% _{t}\Big{\}},

(1)

where $\lambda$ represents the strength of the watermark injection and $e_{t}$ represents the watermark vector. EmbMarker[14] utilizes the difference of cosine similarity and $L_{2}$ distance ( $\Delta Cos$ and $\Delta L_{2}$ ) between embedding sets with and without watermark to conduct watermark verification. The embedding set with watermark will be more similar with $e_{t}$ . Also it uses the p-value of Kolmogorov-Smirnov (KS) test to compare the distribution of two value sets. The limitations of a single watermark vector make it vulnerable, prompting WARDEN[15] to propose a multi-watermark scheme. It can be defined as

\quad e_{p_{i}}=\mathit{Norm}\Big{\{}(1-\Sigma_{r=1}^{R}\lambda_{r})\cdot e_{o% _{i}}+\Sigma_{r=1}^{R}\lambda_{r}\cdot e_{t_{r}}\Big{\}},

(2)

where $\lambda_{r}$ represents different strengths of watermark injection and $e_{t_{i}}$ represents different watermark vectors. Furthermore, WET[16] introduces a scheme for injecting the watermark into all embeddings, improving the original vector addition to a matrix multiplication as

\quad e_{p_{i}}=\mathit{Norm}(e_{o_{i}}\times\mathit{M}_{t}),

(3)

where $\mathit{M}_{t}$ is an invertible matrix as key component of the watermark injection and ‘ $\times$ ’ represents the matrix multiplication. WET[16] inject the watermark into all of the embeddings without considering the text with triggers, which may have an impact on the effectiveness of embeddings. VLPMarker[30] extends the backdoor-based watermarking to multi-modal models. Our primary focus is on EaaS services built on LLMs. All current watermarking schemes can be regarded as a form of linear transformation.

3 Motivation

Section 3 provides a brief review of existing watermarking schemes for the EaaS services and present a formal analysis of their limitation. Watermarking is widely used for protecting the copyright of models and services. Current watermarking schemes can be viewed as semantic-independent linear transformations. According to Section 2.2, previous works only utilize the fixed linear transformations on different original embeddings without considering the distinct features of each embedding. These linear transformations are closely linked to the triggers but are independent with the text semantic features. However, embeddings represents the text semantics. The changes of embeddings caused by semantic perturbations are different depending on the presence of triggers within the text. A straightforward insight: In the context of semantic perturbations, if the text contains triggers, its embedding changes should be less significant than those of text without triggers. For a given sample $d_{i}$ , the perturbed form is denoted as $d^{\prime}_{i}$ , and the corresponding embedding pair is $(e_{i},e^{\prime}_{i})$ . The primary objective of constructing $(d_{i},d^{\prime}_{i})$ is to identify the suspicious samples with watermark. The more effective the perturbation, the more likely samples with triggers will be detected as outliers. Therefore, the performance of perturbation is crucial. Both $e_{i}$ and $e^{\prime}_{i}$ are high-dimensional vectors of the same dimension. To visualize perturbation behavior, we use a two-dimensional demonstration. Since the principles in two-dimensional space are the same in higher dimensions. Taking the representative linear addition scheme, we define a fixed watermark vector as $vec_{t}$ .

As illustrated in Figure 2, assuming that text $d_{i}$ contains triggers and the perturbation will not disrupt the original triggers or introduce new triggers. Without injecting $vec_{t}$ , the angle between $(e_{i},e^{\prime}_{i})$ is $\theta_{1}$ . After injecting $vec_{t}$ , the angle between $e_{i}$ and $e^{\prime}_{i}$ changes to $\theta_{2}$ . In Figure 2, the red vectors represent the original ones, which then transform to the blue vectors after adding the $vec_{t}$ . Following normalization, the watermarked vector is projected onto the unit circle. The key idea of constructing $(d_{i},d^{\prime}_{i})$ is to achieve $\theta_{2}<\theta_{1}$ , making the watermarked embeddings cluster tightly in the vector space. The different distributions serve as the basis for distinguishing the suspicious samples. When $\theta_{1}$ is small, achieving $\theta_{2}<\theta_{1}$ imposes specific requirements on $vec_{t}$ . For instance, $|vec_{t}|$ should be relatively large and maintain an angle with $e_{i}$ and $e^{\prime}_{i}$ of less than $180^{\circ}$ . Conversely, when $\theta_{1}$ is large, the constraints on $vec_{t}$ become less stringent. $\theta_{1}=180^{\circ}$ represents the upper boundary of the semantic perturbation as shown in Figure 2. If $e^{\prime}_{i}$ is in the opposite direction of $e_{i}$ , any form of $vec_{t}$ will result in the condition $\theta_{2}<\theta_{1}$ . Moreover, it is evident that in both two-dimensional and higher-dimensional vector spaces, the similarity between $(e_{i},e^{\prime}_{i})$ without a watermark can serve as an indicator for evaluating the performance of semantic perturbations. The same applies to other linear transformation schemes. Cosine similarity can be utilized as one evaluation metric, alongside other metrics (e.g., $L_{2}$ distance and dot product similarity). Based on this observation, we propose the novel attack SPA, along with the new watermarking scheme SAW. Without relying on fixed linear transformations, SAW injects the predefined single watermark vector that adapts to the text semantics, enhancing both the security and the stealthiness of the watermark.

4 Semantic Perturbation Attack

In this section, we offer a detailed characterization of the threat model and Semantic Perturbation Attack (SPA). Previous watermarking schemes have fixed the watermark to a semantic-independent linear transformation. However, text embeddings are closely linked to semantic features. Thus, semantic perturbation may cause the embeddings of texts containing triggers to deviate from expected patterns. Based on the observations in Section 3, SPA is constructed with total three components: (1) Semantic Perturbation Strategy; (2) Embeddings Tightness Measurement; (3) Selection and Deletion. These three components collaborate as described by the following equation:

D_{sc}=\{d_{c_{i}}\in D_{c}\mid S(d_{c_{i}},G(d_{c_{i}}))<\varphi\},

(4)

where $G$ indicates how to guide the semantic perturbation, $S$ represents the tightness measurement of the perturbation embeddings and $\varphi$ is the selected threshold for distinguishing suspicious samples from benign samples in the selection and deletion phase. The overview is illustrated in Figure 3.

4.1 Threat Model

Based on the real-world scenarios and prior works[14, 15], we clearly define the threat model, detailing the objective, knowledge and capability of attacker.

Attacker’s Objective. The attacker aims to utilize the embeddings from the victim model $\Theta_{v}$ without considering the potential watermark verification. The attacker can then efficiently provide a competitive alternative instead of pre-training a new model. The attacker should consider the potential watermark, making identifying suspicious embeddings with watermark a critical objective.

Attacker’s Knowledge. EaaS service operates as a black box. The attacker possesses a dataset $D_{c}$ to query the victim service $S_{v}$ . Each sample in $D_{c}$ is defined as $d_{c_{i}}$ . The attacker is unaware of any information with $\Theta_{v}$ . However, it is both reasonable and realistic for the attacker to access a general text corpus $D_{p}$ and an small local embedding model $\Theta_{s}$ to design the attack algorithm.

Attacker’s Capability. With sufficient budget, attacker can query $S_{v}$ to acquire the embedding set corresponding to $D_{c}$ , denoted as $E_{c}$ . The attacker can also employ various attack strategies based on the embeddings they can possess to bypass watermark verification.

4.2 Semantic Perturbation Strategy

To successfully conduct SPA for attacker, various perturbation techniques can be utilized (e.g., suffix concatenation, prefix concatenation, and synonym replacement). However, the attacker is constrained to only two perturbation methods, i.e., prefix and suffix concatenation in EaaS scenario. As the suffix concatenation, the attacker can create the text pair $(d_{c_{i}},d_{c_{i}}+perb)$ , where $perb$ represents a perturbation text as suffix and the notation ‘+’ represents the concatenation of the two text segments. The reason for limiting the use to these two methods is that the attacker should maintain $d_{c_{i}}$ itself during perturbation, as lacking the knowledge of the watermarking scheme and the specific triggers. Any modification within $d_{c_{i}}$ (e.g., synonym replacement), may cause the original triggers in $d_{c_{i}}$ to be ineffective. If the trigger is rendered invalid, $e^{\prime}_{c_{i}}$ may exhibit deviations, leading to a failed semantic perturbation. Therefore, only two positions are suitable for $perb$ : as prefix or suffix. Unless specified, all the perturbations in the following sections are suffix concatenation. We define $d^{\prime}_{c_{i}}=d_{c_{i}}+perb$ and the corresponding embedding $e^{\prime}_{c_{i}}$ . We further explore other aspects of the perturbation. For the suffix, the potential construction space can be categorized from two perspectives: length and semantics. We also use EmbMarker[14] as an example. The conclusion is applicable to other schemes.

Random tokens without semantics: We first explore a simple construction method by the adding random tokens as the suffix without semantics. Specifically, we tokenize each sentence in a general text corpus and compile all tokens into a total token vocabulary. We randomly add tokens to the suffix. At this stage, we explore the relationship between suffix length and perturbation performance before and after the watermark injection, measured by $(e_{c_{i}},e^{\prime}_{c_{i}})$ . The results in Figure 4 indicate that as the suffix length increases, the embeddings similarity gradually decreases. After the watermark injection to $(e_{c_{i}},e^{\prime}_{c_{i}})$ , the rate of decrease significantly slows and remains notably higher than the curve without the watermark injection.

Random text with semantics: We randomly selected long texts from a general text corpus, tokenize it to obtain a sequence of tokens and sequentially add each token to the suffix. We explored the effects both with and without watermark injection. The results are illustrated in Figure 4. It is evident that semantic suffix lead to a faster enhancement of perturbation performance, with the curve with watermark injection also significantly exceeding that without injection. Interestingly, for the same suffix length, the performance of perturbations using text with semantics is generally higher than that achieved with random tokens. The finding suggests that using the suffix with semantics is more cost-effective and produces better results. Therefore, we will consistently utilize the semantic suffix during the perturbation process.

Text with opposite semantics: Suffixes with semantic achieve better perturbation performance at lower costs, making suspicious samples easier to identify. Thus, we explore a heuristic perturbation scheme according to observation above. We follow the same setup as previous works, primarily focusing on the text classification tasks. The heuristic perturbation scheme randomly selects samples with different labels from dataset as suffixes. The semantic differences between samples of different labels may enhance the perturbation performance. However, the metric distributions across different datasets indicate that the semantic perturbation methods need further improvement with certain guidance. Detailed information of heuristic perturbation scheme can be found in Appendix A.

Small Local Model Suffix Search Guidance: According to the threat model in Section 4.1, the attacker can access $D_{c}$ , a general text corpus $D_{p}$ , and a small local embedding model $\Theta_{s}$ . Both small embedding models and LLM-based EaaS services essentially extract the features of input text. Hence, the features extracted by either the victim model $\Theta_{v}$ or $\Theta_{s}$ are bound to exhibit some similarity. In other words, while the vectors differ across feature spaces, their differential properties are consistent. Therefore, $\Theta_{s}$ can be utilized to guide the selection of optimal suffixes. If we input the text pair $(d_{c_{i}},d_{c_{i}}+perb)$ into $\Theta_{s}$ to caption the differential properties through the corresponding embedding pair. The perturbation $perb$ has to traverse through all candidates in the perturbation pool. However, it is less efficient with the time complexity of $|D_{c}|\cdot|perb\ pool|$ , requiring $\Theta_{s}$ to perform $|D_{c}|\cdot|perb\ pool|$ encoding processes. Detailed information of the algorithm can be found in Appendix B.

Algorithm 1 Suffix Direct Search Guidance

1: Input: Perturbation Pool

P

, Dataset

D_{c}

2: Standard Model

\Theta_{s}

, Hyperparameter

k

3: Output: Metric Values Set

v

4: Initialize

s\leftarrow\emptyset

(

Suffix

)

5: Initialize

n\leftarrow|D_{c}|

m\leftarrow|P|

6: Set

max(s)\leftarrow 1

{

\triangleright

Cosine similarity range: [-1, 1]}

7: for

i=1

n

8: for

j=1

m

9: Encode:

se_{c_{i}}\leftarrow\Theta_{s}(d_{c_{i}})

se_{perb}\leftarrow\Theta_{s}(perb_{j})

10:

sim\leftarrow\textit{cosine}(se_{c_{i}},se_{perb})

11: if

|s|<k

then

12: Append

perb_{j}

s

13: else if

|s|\geq k

and

sim<max(s)

then

14: Remove

max(s)

from

s

15: Insert

perb_{j}

into

s

16: else

17: Skip

perb_{j}

18: end if

19: end for

20: Compute aggregate metric:

metric\leftarrow agg(s)

21: Append

metric

v

22: end for

23: return

v

Thus, to improve the efficiency, we propose a proximate and efficient approach. For text $d_{c_{i}}$ and its embedding $e_{c_{i}}$ , we can consider $e_{c_{i}}$ as feature representation of $d_{c_{i}}$ in a high-dimensional space. In this space, the vector in the opposite direction can be seen as representing features that are entirely different from $e_{c_{i}}$ . We can input $(d_{c_{i}},perb)$ into $\Theta_{s}$ respectively to obtain the corresponding embedding pair $(se_{c_{i}},se_{perb})$ , where $perb$ traverses all candidates in the perturbation pool. We then select the $top$ - $k$ perturbations that produce the lowest similarity between $(se_{c_{i}},se_{perb})$ . The embeddings exhibit opposite characteristics, implying that the semantic gap between $d_{c_{i}}$ and $perb$ is maximized. Consequently, constructing $(d_{c_{i}},d_{c_{i}}+perb)$ can effectively conduct semantic perturbation on $d_{c_{i}}$ to detect the presence of triggers. We evaluate the perturbation performance based on the $k$ selected samples. The effectiveness of this approach relies on a reasonable hypothesis: concatenating texts with obvious semantic gap allows for significant semantic perturbation. $\Theta_{s}$ only needs to encode $D_{c}$ and the perturbation pool only once. So it results in the time complexity of $|D_{c}|+|perb\ pool|$ , requiring $\Theta_{s}$ to perform $|D_{c}|+|perb\ pool|$ encoding processes. The complete process of the algorithm can be found in Algorithm 1. Small local model guidance is an approximate search and is highly efficient. What’s more, the suffix search guidance can also combine with the method detailed in Appendix B to better search for the optimal suffixes. Actually we utilize Sentence-BERT[31] as $\Theta_{s}$ . Sentence-BERT has fewer dimensions compared to popular EaaS models ( $384\leftrightarrow 1536$ ), containing only 22.7M parameters. All subsequent experiments employ Sentence-BERT as the local model.

4.3 Embeddings Tightness Measurement

To explore the optimal perturbation suffix, the rational evaluation metrics for the perturbed embeddings need to be established. Our primary evaluation consists of three metrics represented as

\left.\begin{aligned} Co&sine_{i}=\frac{1}{k}\Sigma_{j=1}^{k}\frac{e_{c_{i}}% \cdot e^{j}_{c_{i}}}{|e_{c_{i}}|\cdot|e^{j}_{c_{i}}|},\\ L&{}_{2_{i}}=\frac{1}{k}\Sigma_{j=1}^{k}|\frac{e_{c_{i}}}{|e_{c_{i}}|}-\frac{e% ^{j}_{c_{i}}}{|e^{j}_{c_{i}}|}|,\\ \textit{PCA}\ Score_{i}&=\Sigma_{d=1}^{D_{pca}}f_{pca}(e^{j}_{c_{i}}\mid j=1,2% ,3,\ldots,k)\\ &D_{pca}:lower\ dimension,\end{aligned}\right.

(5)

where the three metrics are based on cosine similarity, $L_{2}$ distance, and PCA score, representing the similarity of $(e_{c_{i}},e^{\prime}_{c_{i}})$ . However, text perturbations may introduce new triggers into the original text with a low probability. Such the situation is unavoidable, as the attacker lacks access to the service provider’s relevant knowledge according to the threat model. Mid-frequency tokens are typically selected as triggers to minimize the impact on downstream task performance in backdoor-based watermarking schemes. Therefore, regardless of the perturbation method, we conduct $k$ perturbations for one sample. Among the $k$ perturbations, only a limited number may introduce new triggers. Thus, combining the results from the $k$ trials to serve as the final evaluation metrics will mitigate the potential impacts.

Cosine Similarity Metric: Cosine similarity is an intuitive metric. It measures the difference between two high-dimensional vectors by using the cosine of the angle between the embeddings in vector space. We use the average of the $k$ trials as one of the evaluation metrics.

$\mathbf{\mathit{L_{2}}}$ Distance Metric: $L_{2}$ distance, i.e., Euclidean distance represents the straight-line distance between two data points in high-dimensional space. The embeddings $(e_{c_{i}},e^{\prime}_{c_{i}})$ are normalized and can be considered as lying on the unit circle in a high-dimensional space. Thus, the $L_{2}$ distance is related to the cosine similarity in this scenario. As the angle between the vectors $\uparrow$ , the cosine similarity $\downarrow$ , and the $L_{2}$ distance $\uparrow$ . We use the average of the $k$ trials as one of the evaluation metrics.

PCA Score Metric: We perform $k$ perturbations to mitigate the potential impact of new triggers during perturbation. As a result, we obtain $e_{c_{i}}$ and $k$ perturbed embeddings: $\{e^{j}_{c_{i}}\mid j=1,2,3,\ldots,k\}$ . For each sample $d_{c_{i}}$ , an embedding set of size $k+1$ can be obtained. We apply PCA to each embedding set, which serves as a dimensionality reduction algorithm for preprocessing high-dimensional data. We reduce the embedding space dimensionality and obtain eigenvalue for each principal component. If $d_{c_{i}}$ contains triggers, the embedding set should be clustered in high-dimensional space. After PCA, the lower-dimensional representation should also demonstrate tight clustering, with significantly smaller eigenvalues. Therefore, we use the sum of the eigenvalues as one of the evaluation metrics. It is also demonstrated in Equation 5, where $D_{pca}$ represents the dimensions of dimensionality reduction and $f_{pca}$ represents eigenvalue computation algorithm in PCA. If we reduce the high-dimension embeddings to two dimensions, using the eigenvalues corresponding to two principal component as the coordinates for the x-axis and y-axis, we can obtain an image as shown in Figure 5.

According to the threat model, the attacker is unaware of specific triggers information, allowing evaluation through AUPRC (Area Under the Precision-Recall Curve), which is an important metric for evaluating the performance of binary classification tasks, particularly in cases of class imbalance. It focuses on the performance of the positive class by measuring the cosine similarity, $L_{2}$ distance and PCA score defined above, where texts with triggers are considered the positive class in the context of EaaS watermarking. A higher AUPRC value indicates a more accurate classifier for the positive class, making it easier to distinguish whether samples carry triggers. The criteria for assessing the performance of a classifier using AUPRC are as follows:

•

If AUPRC = 1, the classifier is perfect, meaning that all positive samples are correctly identified, with no false positives.
•

If AUPRC = 0, the classifier is ineffective, meaning that all positive samples are misclassified.

4.4 Selection and Deletion

With the small local model guidance, the attacker can easily distinguish the distribution differences among the various metrics. This section will discuss how the distribution differences can be leveraged to select suspicious samples and bypass watermark verification. The attacker can only access the distributions of different metrics, without accessing the triggers selected by the service provider. The distributions always demonstrate a long-tail phenomenon, which is caused by the texts containing triggers. Since the attacker does not have the knowledge of how the service provider defines the triggers. For instance, the service provider may use phrases composed of consecutive tokens or other symbols as triggers.

To achieve better generalization and performance, we adopt the selection and deletion approach. By simulating the distribution curve of the metric value, we observe an anomalous rise in the long-tail region, resulting in another peak. Based on this characteristic, we can infer that the slope undergoes a significant change in the long-tail region. It indicates the presence of a point in the long-tail region where the the first derivative equals zero or second derivative is significantly large. The plot of the cosine similarity metric distribution curve and the derivative curve is provided as Figure 6. It shows the similarity distribution shift of Enron Spam dataset under EmbMarker[14] scheme. Specifically, we select the metric value corresponding to the point on the curve where the first derivative equals zero or the second derivative is maximal, denoted as the threshold $\varphi$ . Samples with metric exceeding $\varphi$ are deleted from $D_{c}$ , obtaining a purified dataset $D_{sc}$ . During the deletion process, the majority of text samples containing triggers are eliminated from $D_{c}$ . While some benign data might also be removed, it represents only a small proportion of $D_{c}$ . Experiment results show that such deletion will not affect downstream tasks. Furthermore, we can replenish $D_{c}$ to its original size and repeat the selection and deletion process iteratively to mitigate the impact of such deletions.

5 Semantic Aware Watermarking

To counter SPA, we propose the Semantic Aware Watermarking (SAW) scheme. In this context, the watermarking scheme should satisfy the following three basic conditions: (1) Verify ability: The service provider has the ability to verify the watermark in the embeddings; (2) Downstream tasks: The utility of the embeddings after watermark injection are comparable to the original ones, resulting in minimal performance degradation for commonly used downstream tasks; (3) Security of watermark: The watermark demonstrates security and robustness, enabling it to defend against potential attacks like SPA. Specifically, SAW injects predefined single watermark vector $wm_{o}$ into the embeddings through a watermark injection model. The presence of $wm_{o}$ in the embeddings is then verified using a watermark verification model. The framework of injection and verification models is trained in an end-to-end manner. SAW injects the watermark based on text semantics, enabling it to resist SPA, which is not achievable in previous schemes. Additionally, SAW can be deployed as a plug-in extension module compatible with various EaaS services.

5.1 Encoder

To ensure that the watermark injected in embeddings has ability to counter SPA, SAW trains an encoder as the watermark injection model to inject $wm_{o}$ in the text embeddings. The encoder is capable of injecting the watermark in different patterns based on semantic features of the corresponding embeddings. The watermark $wm_{o}$ is predefined by the EaaS service provider, typically in the form of a numerical vector. $wm_{o}$ can be generated randomly. The EaaS service provider can choose to utilize mid-frequency tokens as triggers, injecting the watermark only in the partial embeddings corresponding to the texts containing triggers. Alternatively, the watermark can be injected across the entire embedding set. In the encoder’s end-to-end training process, the original embedding $e_{o}$ is taken as input, and the watermark-injected output is $e_{enc}$ . The training objective is to make $e_{o}$ and $e_{enc}$ as similar as possible. The loss function for the encoder is defined as

\displaystyle Loss_{enc}=\frac{|e_{o}-e_{enc}|}{len(e_{o})}.

(6)

The $e_{enc}$ differs minimally from $e_{o}$ , thereby preserving the performance of embeddings in downstream tasks. Our encoder model consists of only several fully connected layers or an auto-encoder. Despite its simplicity, the encoder can determine how to apply suitable levels of numerical variation to different positions in the vector $e_{o}$ . Ultimately, all numerical variation added to $e_{o}$ enable the watermark verification model to decode and verify the predefined $wm_{o}$ .

5.2 Decoder

SAW trains a decoder to extract $wm_{o}$ from $e_{enc}$ . During the end-to-end training process, the decoder takes $e_{enc}$ as input and outputs the decoded watermark $wm_{dec}$ . The decoder must satisfy two basic conditions: (1) It should successfully decode $wm_{o}$ from $e_{enc}$ with minimal distortion; (2) Given $e_{o}$ not containing $wm_{o}$ , the decoder should not be able to decode the watermark, rather producing a random vector of the same length as $wm_{o}$ instead. So our scheme adopts a randomized strategy by initializing the decoder parameters randomly and not updating the decoder’s gradients during training. The training objective for the decoder is to make $wm_{dec}$ as close as possible to $wm_{o}$ . The corresponding loss function for the decoder is defined as

\displaystyle Loss_{dec}=\frac{|wm_{o}-wm_{dec}|}{len(wm_{o})}.

(7)

All training occurs within the encoder, with the objective of training the encoder to inject suitable numerical variation into $e_{o}$ . The final injected watermark is desired be mapped by the decoder to output the predefined $wm_{o}$ . Due to the random initialization and fixed parameters of the decoder, decoder will naturally output a random vector if $e_{o}$ does not contain $wm_{o}$ . It is crucial that the decoder parameters remain randomly initialized and not updated during training. If the decoder’s parameters are not fixed and optimized towards a random vector as the target when providing $e_{o}$ as input. The randomness of each training target would introduce uncertainty into training process. It would ultimately prevent effective convergence of the model. The decoder with fixed random parameters will inherently output a random vector when given embeddings without watermark. Our decoder model consists of only several fully connected layers. Even with this simplicity, it successfully meets the two basic conditions outlined above. We will elaborate further on the significance of random initialization in Section 6.4.

5.3 End to End Training

SAW employs an end-to-end training strategy. It is essential because the training objectives for the Encoder are: (1) to make $e_{o}$ as similar as possible to $e_{enc}$ ; and (2) to make $wm_{o}$ as similar as possible to $wm_{dec}$ . Therefore, an end-to-end approach is necessary for training, with the complete loss function during training defined as

\displaystyle Loss=\alpha\cdot\frac{|e_{o}-e_{enc}|}{len(e_{o})}+\beta\cdot% \frac{|wm_{o}-wm_{dec}|}{len(wm_{o})},

(8)

where $\alpha$ and $\beta$ are hyperparameters of loss functions in different parts. SAW updates the gradients only for the encoder parameters, while the decoder parameters are randomly initialized and remain fixed without updates. The security of the watermarking scheme relies on the inaccessibility of the decoder model and $wm_{o}$ . The pair $(wm_{o},Encoder)$ can be regarded as an encryption key and $(wm_{o},Decoder)$ as a decryption key, implementing asymmetric encryption. Only parties with the decryption key can successfully verify the watermark. SAW injects $wm_{o}$ based on the characteristics of different embeddings. Experiment results presented in Section 6 demonstrate that SAW is resilient to semantic perturbation and successfully conduct watermark verification.

6 Experiment

6.1 Experiment Setup

We adopt the previous two classic schemes (EmbMarker[14] and WARDEN[15]) for our attack experiments, using text classification tasks as the downstream tasks and text-embedding-ada-002 from OpenAI simulated as the victim model. Experiments are conducted on four datasets: Enron Spam[32], SST2[33], MIND[34] and AG News[35].

•

Enron Spam: The Enron Spam dataset consists of the emails collection labeled as either “spam” or “non-spam” (ham), making it a valuable resource for studying spam filtering, email classification, and Natural Language Processing (NLP) tasks.
•

SST2: The SST2 dataset is a collection of movie reviews labeled with binary sentiment (positive or negative), commonly used for training and evaluating models in sentiment classification tasks.
•

MIND: The MIND dataset is a large-scale dataset designed for news recommendation, aimed at advancing personalized news recommendation. It can also used for news classification tasks.
•

AG News: The AG News dataset is a collection of news articles categorized into four topics, commonly used for text classification and NLP tasks.

The smallest of the datasets contains at least 30,000+ samples. To perform semantic perturbations, querying the service provider’s API is required for each instance. Considering the high experiment costs, we sample the subset of each dataset for our experiments and make adjustments to the number of training epochs. Detailed information is illustrated in Appendix C.

6.2 SPA Overall Performance

Semantic Perturbation Attack aims to identify the suspicious embeddings with watermark and bypass the watermark verification in post-publication settings. We conducted a comprehensive evaluation of our proposed novel attack. The verification will be bypassed if the majority of the texts with triggers and the embeddings with watermark is deleted from the dataset. The performance of semantic perturbation is the key to the success of SPA. We utilize the primary metrics as described in Section 4.3 to evaluate the performance. $k$ perturbations are involved for each text in dataset, with $k=10$ chosen to balance the time and cost considerations.

TABLE I: Semantic Perturbation Attack Performance


Datasets	Schemes	Cos AUPRC	$L_{2}$ AUPRC	PCA AUPRC^⋆	Deletion Performance
Datasets	Schemes	Cos AUPRC	$L_{2}$ AUPRC	PCA AUPRC^⋆	$\mathit{Total\ Deletion}$	$\mathit{TPR^{\star}}\uparrow$	$\mathit{FPR}\downarrow$	$\mathit{Precision}\uparrow$
Enron Spam	EmbMarker	$0.9284$	$0.9227$	$0.9685$	$572/5000$	$91.49\%$	$1.26\%$	$90.21\%$
Enron Spam	WARDEN	$0.7348$	$0.7348$	$0.9530$	$619/5000$	$92.91\%$	$2.14\%$	$84.65\%$
SST2	EmbMarker	$0.8947$	$0.8888$	$0.9214$	$439/5000$	$95.68\%$	$2.30\%$	$75.63\%$
SST2	WARDEN	$0.6190$	$0.6190$	$0.9214$	$437/5000$	$95.68\%$	$2.26\%$	$75.97\%$
MIND	EmbMarker	$1.0$	$1.0$	$1.0$	$152/5000$	$100\%$	0%	$100\%$
MIND	WARDEN	$0.4971$	$0.4971$	$0.7957$	$188/5000$	$84.21\%$	$1.24\%$	$68.09\%$
AG News	EmbMarker	$0.5665$	$0.5398$	$0.7052$	$1478/5000$	$97.65\%$	$19.62\%$	$42.08\%$
WARDEN	$0.3323$	$0.3323$	$0.6791$	$1498/5000$	$96.86\%$	$20.19\%$	$41.19\%$

•

Note: PCA AUPRC is consistently maintained across different schemes and produces the best performance. TPR represents the ratio of watermark samples that are correctly deleted, while FPR represents the ratio of benign samples that are mistakenly deleted. ‘ $\star$ ’ demonstrates the most important metrics.

(1) Semantic Perturbation Performance: Using a small local model to guide the semantic perturbation for a larger model, we leverage the open-source model Sentence-BERT[31] to search for the $top$ - $k$ $perb$ that maximize the difference in similarity between the corresponding embeddings of the text sample $d_{c_{i}}$ and the perturbation text $perb$ . The results from $k$ perturbations are aggregated to form the final evaluation metric. The suffix search guidance selects the perturbation candidate pool, utilizing the WikiText Dataset as candidate. Table I above presents the perturbation metrics obtained, which demonstrates the high quality of the perturbation suffixes. The results demonstrate that our guidance approach is effective and has the ability to get the suffixes that can cause significant semantic perturbation. Consequently, the semantic perturbation should be capable of successfully attacking all schemes using fixed linear transformation that are unaware of text semantics. Additionally, the PCA score metric is better than cosine similarity and $L_{2}$ distance and doesn’t decrease in different schemes. This is because the PCA algorithm preserves the main information in the embeddings while eliminating redundant information, thus maintaining good performance across different schemes. We believe that other dimensionality reduction algorithms can achieve similar results. Our approach is effective with searching a dataset of size $10^{6}$ in ten minutes.

(2) Selection and Deletion Rate: After obtaining the metric values from the semantic perturbation, SPA continue to select and delete the suspicious samples from the original dataset $D_{c}$ using the methods outlined in Section 4.4. However, the identical text $d_{c_{i}}$ in $(d_{c_{i}},d_{c_{i}}+perb)$ results in a certain extent inherent similarity between $(e_{c_{i}},e^{\prime}_{c_{i}})$ . With the PCA AUPRC above 0.95, it remains a tiny overlap in the distributions of benign and backdoor samples. We believe that the larger perturbation pool has the capability to further separate the two distributions. Considering the samples with triggers as the positive class, the deletion precision shown in Table I specifically illustrates the quantity of samples with triggers removed based on the selection and deletion method with PCA Score Metric. Additionally, the percentage of backdoor samples removed from the total is provided under the disclosure of the ground truth. The experiment results demonstrate that backdoor samples with triggers constitute the vast majority of the deleted portion. A tiny proportion of benign samples being mistakenly deleted is considered acceptable. The successful watermark verification cannot be achieved if the watermark no longer exists in the original dataset. Totally, almost $95\%-100\%$ of the embeddings with watermark can be identified, meaning that nearly all of text samples with triggers can be deleted from original dataset.

(3) Downstream Tasks Accuracy: The purified dataset is obtained after the selection and deletion phase, removing the suspicious samples from original dataset. The quantity of data in purified dataset has decreased. Therefore, we conducted experiments to test whether the performance of the embeddings for downstream tasks is affected. We queries the embeddings of purified dataset from EaaS and trains a classifier. The final accuracy of the classifier is shown in the Table V. The experiment results demonstrate that after the deletion of suspicious samples using our approach, the accuracy of downstream tasks is basically unaffected, remaining comparable to the performance of original dataset.

TABLE II: Semantic Aware Watermarking Performance


Datasets	Cos AUPRC	$L_{2}$ AUPRC	PCA AUPRC^⋆	Deletion Performance
Datasets	Cos AUPRC	$L_{2}$ AUPRC	PCA AUPRC^⋆	$\mathit{Total\ Deletion}$	$\mathit{TPR}^{\star}\uparrow$	$\mathit{FPR}\downarrow$	$\mathit{Precision}\uparrow$
Enron Spam	$0.4133$	$0.4088$	$0.5574$	$133/5000$	$15.78\%$	$0.99\%$	$66.92\%$
SST2	$0.5348$	$0.5296$	$0.6946$	$87/5000$	$20.75\%$	$0.32\%$	$82.76\%$
MIND	$0.9999$	$0.9999$	$0.9999$	$80/5000$	$52.63\%$	$0\%$	$100\%$
AG News	$0.2731$	$0.2669$	$0.3295$	$216/5000$	$13.97\%$	$2.97\%$	$39.81\%$

•

Note: We use the same metrics as the Table I. The performance of semantic perturbations is greatly reduced.

6.3 SAW Overall Performance

The experiment results from Section 6.2 demonstrate that the current watermarking schemes are unable to counter SPA, with above $95\%$ of the embeddings with watermark being successfully identified. Thus, we focus on the performance of the Semantic Aware Watermarking (SAW) scheme against SPA. As in SAW, the watermark injection model can introduce an appropriate degree of numerical variation at suitable positions within the embeddings. The watermark can be successfully decoded, combining all the numerical variation through the verification model. Adaptive injection patterns enhance the security and stealthiness of the watermark. As expected, the decline in SPA success rate is closely related to the reduction in semantic perturbation performance. In this section, we will provide a comprehensive analysis of the resistance to SPA, watermark verification and downstream task performance with SAW.

(1) SAW against SPA: The key content of SPA relies on the semantic perturbation performance. It is hard to identify suspicious samples based on the perturbation metrics if the performance of perturbation is greatly reduced, thereby preventing any bypass of watermark verification. In SAW, the performance of semantic perturbation are significantly impacted. Relying on the watermark injection capabilities learned by the encoder, the results for the perturbation metrics across different datasets are presented in Table II. The method is no longer a fixed linear transformation but rather an injection determined autonomously by the encoder, considering the text semantic features. The most significant metric (PCA AUPRC) decreases from close to 0.40, causing only part of the embeddings with watermark can be successfully identified in selection and deletion phase. Accordingly, the distribution of the samples containing triggers largely overlaps with the distribution of benign samples as shown in Figure 8, comparing with the previous schemes. When using SAW, even in the presence of SPA attack, the original dataset can still contain a large number of samples with triggers, making it impossible to bypass watermark verification. All of the results demonstrate that SAW is able to resist SPA.

(2) Semantic Aware Watermarking Verification: SAW effectively defends against SPA, showing that the watermark injected by the encoder is more stealth than previous works. Thus, watermark verification process is also important. We randomly initialize the predefined watermark vector as a binary vector within the range of [0, 1]. Employing binary vectors provides a more intuitive representation of the differences between the decoded watermark from the verification model and the original watermark vector than the float numeric values. SAW can also use the average bit error to verify the watermark if all values of the decoded vector are merged to 0.0 or 1.0. In our experiments, we set the length of original watermark vector to 24 and construct two text sets for verification based on whether the text contains triggers. The vector decoded from text embeddings with watermark will be more similar with predefined vector. We use the average $\Delta Cos$ , $\Delta L_{2}$ and the p-value of Kolmogorov-Smirnov (KS) to conduct verification. The reliability of watermark verification results is shown in Table III, indicating that the verification approach effectively detects copyright infringement. The watermark injection in SAW fundamentally diverges from previous schemes, as SAW no longer relies on a fixed linear transformation independent with text semantics. Consequently, the watermark verification procedure in SAW also takes on a distinct form. To verify the watermark, we employ the test set of each datasets. Nevertheless, as an indicator of statistical significance, the p-value shows a remarkable change, with a drop of orders of magnitude.

TABLE III: Watermark Verification Ability


Datasets	Schemes	ACC.(%)	Verification Performance			Avg Bit Error
Datasets	Schemes	ACC.(%)	$\Delta\mathit{Cos}\uparrow$	$\Delta\mathit{L_{2}}\downarrow$	$\mathit{p-value}\downarrow$	Avg Bit Error
Enron Spam	Original	$92.00\%$	$-0.0208$	$-0.0011$	$10^{-1}$	$0.24\leftrightarrow-11.72$
Enron Spam	SAW	$91.80\%$	$0.9607$	$-7.7223$	$10^{-85}$	$0.24\leftrightarrow-11.72$
SST2	Original	$92.20\%$	$-0.0062$	$0.0391$	$10^{-1}$	$0.06\leftrightarrow-12.21$
SST2	SAW	$92.00\%$	$1.3273$	$-10.2745$	$10^{-76}$	$0.06\leftrightarrow-12.21$
MIND	Original	$56.60\%$	$0.0317$	$-0.0624$	$10^{-1}$	$-0.13\leftrightarrow-11.98$
MIND	SAW	$56.80\%$	$0.9088$	$-6.6034$	$10^{-25}$	$-0.13\leftrightarrow-11.98$
AG News	Original	$89.60\%$	$0.0173$	$0.0064$	$10^{-1}$	$0.15\leftrightarrow-9.85$
SAW	$90.00\%$	$1.1154$	$-9.1509$	$10^{-79}$		$0.15\leftrightarrow-9.85$

(3) Minimal Impact on Downstream Tasks: A basic criterion for watermarking schemes is that the embeddings with watermark should not significantly impact the performance of the corresponding downstream tasks. While we could inject the watermark to all embeddings, we obey the strategies of previous works by selecting mid-frequency tokens as triggers from a general text corpus (WikiText Dataset) and only conduct watermark injection to the embeddings corresponding to the texts containing triggers. SAW inject the watermark in the subset of embeddings, further mitigating potential impacts of the watermark. As is shown in Table III, there is virtually no decline in the performance of downstream tasks.

6.4 Ablation Study & Discussion

Finding 1: PCA Score Metric demonstrates superior robustness compared to other metrics. The experiment results in Table I reveal that only the PCA Score Metric remains virtually unchanged as an attack metric in different schemes. In contrast, the other metrics (e.g., Cosine Similarity and $L_{2}$ Distance) perform well in EmbMarker[14] but are significantly impacted in WARDEN[15]. One possible explanation for this phenomenon is that, in contrast to single watermark schemes, multi-watermark schemes inject multiple watermark vectors into an original embedding. Additionally, these multiple watermark vectors often impose orthogonal constraints, making the overall injected watermark vectors as inconspicuous as possible. Therefore, the orthogonal constraints can affect the simple distance computations between embeddings during the attack process. However, dimensionality reduction algorithms such as PCA can eliminate redundant information and extract the principal components within the embeddings. Given that the principal components of the embeddings with watermark are containing the watermark information, this may explain why the PCA Score Metric does not show a significant decline.

TABLE IV: Decoder Ability & Model Convergence


Datasets	No Update & Update
Datasets	$\mathit{Cos}\rightarrow 1$	$\mathit{L_{2}}\rightarrow 0$	$\mathit{Avg\ Bit\ Error}$
Enron Spam	$0.98\leftrightarrow 0.79$	$0.74\leftrightarrow 2.56$	$0.0\leftrightarrow 11.50$
SST2	$0.98\leftrightarrow 0.81$	$0.75\leftrightarrow 2.49$	$0.0\leftrightarrow 11.45$
MIND	$0.99\leftrightarrow 0.82$	$0.64\leftrightarrow 2.40$	$0.0\leftrightarrow 10.07$
AG News	$0.98\leftrightarrow 0.80$	$0.73\leftrightarrow 2.53$	$0.0\leftrightarrow 10.87$

Finding 2: SPA enhances attacker’s ability in EaaS services. EaaS services are susceptible to various forms of copyright infringement, including model extraction attacks [12]. However, the backdoor-based watermark can be learned by an attacker during training. If all of the samples containing triggers are removed from the training data, backdoor-based watermark will be ineffective. While Yan et al. [36] emphasize that the effectiveness of backdoor depends on various training configurations, SPA enhances the attacker’s ability to filter out the samples with triggers by removing most of them during training. As demonstrated in Table V, SPA can effectively remove the samples with triggers, thus restoring the attack’s efficacy while leaving the performance of downstream tasks unaffected.

TABLE V: Model Extraction Attack Performance With and Without SPA


Datasets	Schemes	No SPA				With SPA
		ACC.(%)	Detection Performance			ACC.(%)	Detection Performance
		ACC.(%)	$\Delta\mathit{Cos}\downarrow$	$\Delta\mathit{L_{2}}\uparrow$	$\mathit{p-value}\uparrow$	ACC.(%)	$\Delta\mathit{Cos}\downarrow$	$\Delta\mathit{L_{2}}\uparrow$	$\mathit{p-value}\uparrow$
Enron Spam	EmbMarker	$92.00\%$	$0.0599$	$-0.1199$	$10^{-7}$	$91.40\%$	$0.0049$	$-0.0098$	$10^{-1}$
Enron Spam	WARDEN	$92.20\%$	$0.0519$	$-0.1039$	$10^{-8}$	$92.40$ %	$0.0125$	$-0.0250$	$10^{-2}$
SST2	EmbMarker	$91.60\%$	$0.0237$	$-0.0474$	$10^{-5}$	$91.00\%$	$0.0017$	$-0.0033$	$10^{-1}$
SST2	WARDEN	$91.00\%$	$0.0647$	$-0.1294$	$10^{-6}$	$90.00\%$	$-0.0108$	$0.0216$	$10^{-2}$
MIND	EmbMarker	$69.20\%$	$0.0564$	$-0.1128$	$10^{-6}$	$70.00\%$	$-0.0033$	$0.0066$	$10^{-1}$
MIND	WARDEN	$71.80\%$	$0.0926$	$-0.1852$	$10^{-6}$	$70.00\%$	$0.0280$	$-0.0561$	$10^{-2}$
AG News	EmbMarker	$88.80\%$	$0.01997$	$-0.0399$	$10^{-6}$	$89.80\%$	$0.0026$	$-0.0052$	$10^{-1}$
WARDEN	$89.00\%$	$0.05921$	$-0.1184$	$10^{-8}$	$89.00\%$	$0.0098$	$-0.0195$	$10^{-2}$

•

Note: A higher p-value and $\Delta Cos$ and $\Delta L_{2}$ close to zero indicate a successful attack.

Finding 3: Random decoder’s parameters with no gradient update are effective. The encoder serves as the watermark injection model in SAW, while the decoder is responsible for watermark verification. During end-to-end training process, the decoder’s random parameters are fixed and do not conduct gradient updates. The decoder should decode the correct vector from the embeddings with watermark and output random vectors from those without watermark. By fixing the random parameters, the decoder is inherently capable of generating the random vectors. SAW requires only the training of the encoder to ensure that the decoder can recognize and correctly decode the watermark vector injected by the encoder. The fixed parameters significantly decrease the complexity of training, leading to non-random gradient update directions and making model convergence easier. We measure the impact of fixing the decoder’s parameters under identical training configurations, utilizing a text collection containing triggers for validation. Two embedding sets with watermark injection are constructed based on whether the decoder’s parameters are fixed. The experiment results from decoding the two embedding sets are presented in Table IV. When gradient updates, the average bit error for watermark verification is approximately 12, with cosine similarity $\downarrow$ and $L_{2}$ distance $\uparrow$ . In contrast, with fixed random parameters, the decoder achieves an error close to zero for decoding the watermark, demonstrating the effectiveness of the fixed random parameters strategy.

Finding 4: The optimal dimension of watermark vector requires trade-off. Given the relatively simple structure of the encoder and decoder, the higher dimension for the watermark vector cause excessive information being injected, resulting in the higher average bit error during verification. However, increasing the dimension of vector also expands the decoding space and enhances the security of watermark. Under our experimental conditions, the length of 24 for the original watermark vector achieves the optimal trade-off. The corresponding watermark verification performance curve for the Enron Spam dataset is shown in Figure 9. The other datasets in experiment exhibit the same phenomenon as the Enron Spam dataset.

Finding 5: Hyperparameters of end-to-end training reveal the complexities of different training tasks. In the end-to-end training process, we design a specialized loss function. The parameters $\alpha$ and $\beta$ control the different contributions to the total loss during training. The hyperparameters are important for ensuring effective training. We evaluate the performance of training under different values of $\alpha$ and $\beta$ with other parameters fixed, utilizing the loss values during training as the evaluation metrics. The loss values for the Enron Spam dataset with different parameters are shown in Figure 10. The other datasets exhibit the same phenomenon. When the ratio of $\alpha$ to $\beta$ is $(10^{4}:1)$ , the losses of different parts reaches the best trade-off, suggesting that the watermark injection presents a harder task than watermark verification. This may be due to the fact that the encoder has to handle the 1536-dimension text embeddings (text-embedding-ada-002), while the decoder only processes the predefined 24-dimension watermark vector.

7 Related Work

7.1 Backdoor Attacks

Backdoor attacks in DNNs[37] refer to a type of attack where an adversary manipulates a model by injecting a hidden “backdoor” trigger during the training process, allowing the attacker to control the model’s behavior when the trigger is present in input data. These attacks are typically aimed at evading detection by making the model perform normally on benign data but misbehave when the specific trigger is activated. Backdoor attacks typically concentrate on different kinds of tasks, for instance: Natural Language Processing (NLP)[38, 39, 40] and Computer Vision (CV)[41, 42, 43]. In distributed Federated Learning (FL), backdoor can be implanted across the entire system[44, 45], and such attacks can also be conducted during the Transfer Learning (TL)[25] process through the teacher models. Recently, backdoor attacks have also been extended to Large Language Models (LLMs)[46, 47] and Multi-modal Large Language Models (MLLMs)[48, 49, 50]. Recent research [46, 47] not only shows the backdoor attacks directly for pre-trained models that maps trigger-containing inputs to predefined output, but also reveals the backdoor attacks for customized LLMs like GPTs through injecting the backdoor in prompts that trigger malicious behavior in response to specific inputs, highlighting the risks of LLM pre-training and customization. Moreover, current works[48, 49, 50] explores how to inject the backdoor into pre-trained image encoders and multi-modal contrastive learning models, causing downstream tasks to inherit the backdoor behavior. Liang et al.[50] propose a method to align visual trigger patterns closely with textual semantics in the high-dimensional embedding space.

Therefore, pre-trained models can be vulnerable to backdoor attacks across various downstream tasks, whether in natural language processing or computer vision. When the triggers are activated, the model will exhibit a specific predefined behavior. Previous studies have shown that backdoor attacks exhibit better stealthiness when aligned with semantics. Inspired by this, we conduct a thorough investigation into the current backdoor-based watermarking schemes in EaaS services, revealing their lack of semantic alignment.

7.2 Deep Watermarking

With the advancement of LLMs and MLLMs, the escalating costs associated with training such models, coupled with their growing importance in society, have underscored the urgent need for protecting the copyright related to DNNs. Safeguarding these assets has become a critical priority as the technologies continue to shape various facets of modern life. Deep watermarking is regarded as a promising method. While deep watermarking draws on foundational concepts and techniques from traditional watermarking, there are notable differences between these two domains of application. These distinctions necessitate the adaptation of traditional watermarking methods to the context of DNNs and the development of entirely new approaches tailored to their unique requirements.

Deep watermarking can be categorized into white-box, black-box, and box-free approaches, based on the type of data accessible during the watermark verification process[51]. When internal parameters of the model are accessible, the watermarking technique is considered to operate in a white-box setting[52, 53, 54]. In this scenario, watermark verification relies on direct access to the model’s internal structure, such as the weights or neuron activations for specific inputs. Successful verification in the white-box scenario therefore necessitates in-depth access to these intrinsic model details. In black-box watermarking[55, 56], only the model’s outputs are accessible, with no visibility into its internal structure or parameters. Under this setting, watermark verification is achieved by examining the model’s outputs in response to a specific set of carefully crafted inputs. These predefined input-output pairs serve a role analogous to encryption and decryption keys, enabling the watermark’s presence to be confirmed solely through observed outputs. Throughout the verification process, access is limited strictly to the model’s inputs and corresponding outputs, ensuring that the model’s internal workings remain entirely opaque. When the model’s output variance is sufficiently pronounced, watermark verification can be conducted by observing the output alone, without the need for carefully crafted input queries. This approach, known as box-free watermarking[57], relies on distinctive output characteristics that inherently reveal the watermark.

Based on the above research, the watermarking in EaaS can be regarded as a form of black-box watermarking, processing user input and return numerical vectors. Current schemes primarily focuses on utilizing backdoor-based watermark to protect the copyright of EaaS services which are based on LLMs. However, these watermarking schemes are closely tied to fixed linear transformations. Recent research[58, 50] have indicated that backdoor samples exhibit a degree of insensitivity to perturbations on the input data in the vision models, which inspires us to further explore the limitations of current watermarking schemes for EaaS.

8 Conclusion

In this paper, we first demonstrate that our proposed novel attack, SPA, can successfully bypass the recent EaaS watermarking schemes. SPA exploits the limitation that current schemes rely solely on semantic-independent linear transformations. SPA conducts semantic perturbation to the text, constructs embedding pairs using the original and perturbed embeddings, and selectively deletes suspicious samples while preserving service utility. To address this limitation and counter the semantic perturbation, we propose SAW, an effective semantic aware watermarking scheme. SAW enhances prior schemes by injecting the watermark based on text semantics through the encoder and conduct watermark verification through the decoder. Our extensive experiments demonstrate that SAW significantly improves copyright protection for EaaS services compared to previous schemes. We further conduct comprehensive studies to validate the importance of the components in both SPA and SAW. Future research may explore the potential of incorporating a noise module between the encoder and decoder during end-to-end training to resist the various potential novel attacks. Another interesting direction is to investigate whether current watermarking schemes can preserve the performance of the embeddings across various common downstream tasks in industry applications. Additionally, we believe our work has potential to be extended to practical scenarios, providing solutions to current security challenges.

References

[1] J. T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, P. Pronin, J. Padmanabhan, G. Ottaviano, and L. Yang, “Embedding-based retrieval in facebook search,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2020, pp. 2553–2561.
[2] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones, “Word embedding based generalized language model for information retrieval,” in Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR), 2015, pp. 795–798.
[3] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, “Joint embedding of words and labels for text classification,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2018, pp. 2321–2331.
[4] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2927–2936.
[5] S. Okura, Y. Tagami, S. Ono, and A. Tajima, “Embedding-based news recommendation for millions of users,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017, pp. 1933–1942.
[6] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X. Zhao, M. Chen, and J.-R. Wen, “Adapting large language models by integrating collaborative semantics for recommendation,” in IEEE International Conference on Data Engineering (ICDE), 2024, pp. 1435–1448.
[7] Z. Xu, M. J. Cruz, M. Guevara, T. Wang, M. Deshpande, X. Wang, and Z. Li, “Retrieval-augmented generation with knowledge graphs for customer service question answering,” in Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 2905–2909.
[8] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024, pp. 6491–6501.
[9] J. J. Pan, J. Wang, and G. Li, “Survey of vector database management systems,” The VLDB Journal (VLDB), vol. 33, no. 5, pp. 1591–1615, 2024.
[10] Hugging Face, “Huggin face - the ai community building the future.” https://huggingface.co/, 2024.
[11] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” in Proceedings of Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023, pp. 2014–2037.
[12] Y. Liu, J. Jia, H. Liu, and N. Z. Gong, “Stolenencoder: stealing pre-trained encoders in self-supervised learning,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2022, pp. 2115–2128.
[13] C. Deng, Y. Duan, X. Jin, H. Chang, Y. Tian, H. Liu, H. P. Zou, Y. Jin, Y. Xiao, Y. Wang et al., “Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas,” arXiv preprint arXiv:2406.05392, 2024.
[14] W. Peng, J. Yi, F. Wu, S. Wu, B. B. Zhu, L. Lyu, B. Jiao, T. Xu, G. Sun, and X. Xie, “Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 7653–7668.
[15] A. Shetty, Y. Teng, K. He, and Q. Xu, “Warden: Multi-directional backdoor watermarks for embedding-as-a-service copyright protection,” in Proceedings of Annual Meeting of the Association for Computational Linguistics, 2024, pp. 13 430–13 444.
[16] A. Shetty, Q. Xu, and J. H. Lau, “Wet: Overcoming paraphrasing vulnerabilities in embeddings-as-a-service with linear transformation watermarks,” arXiv preprint arXiv:2409.04459, 2024.
[17] M. Wei, N. S. Harzevili, Y. Huang, J. Yang, J. Wang, and S. Wang, “Demystifying and detecting misuses of deep learning apis,” in Proceedings of the IEEE/ACM International Conference on Software Engineering (ICSE), 2024, pp. 1–12.
[18] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu, T. Sherwood et al., “Deepsniffer: A dnn model extraction framework based on learning architectural hints,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 385–399.
[19] S. Sanyal, S. Addepalli, and R. V. Babu, “Towards data-free model stealing in a hard label setting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 284–15 293.
[20] Z. Sha, X. He, N. Yu, M. Backes, and Y. Zhang, “Can’t steal? cont-steal! contrastive stealing attacks against image encoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 373–16 383.
[21] L. Lyu, C. Chen, and J. Fu, “A pathway towards responsible ai generated content.” in International Joint Conferences on Artificial Intelligence (IJCAI), 2023, pp. 7033–7038.
[22] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet, “Turning your weakness into a strength: Watermarking deep neural networks by backdooring,” in USENIX security symposium (USENIX Security), 2018, pp. 1615–1631.
[23] Y. Li, Y. Bai, Y. Jiang, Y. Yang, S.-T. Xia, and B. Li, “Untargeted backdoor watermark: Towards harmless and stealthy dataset copyright protection,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 13 238–13 250, 2022.
[24] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
[25] Y. Yao, H. Li, H. Zheng, and B. Y. Zhao, “Latent backdoor attacks on deep neural networks,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2019, pp. 2041–2055.
[26] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2021, pp. 16 463–16 472.
[27] W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, and B. He, “Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models,” in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2021, pp. 2048–2058.
[28] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh, “Embedding watermarks into deep neural networks,” in Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR), 2017, pp. 269–277.
[29] E. Le Merrer, P. Pérez, and G. Trédan, “Adversarial frontier stitching for remote neural network watermarking.” Neural Computing and Applications, vol. 32, no. 13, 2020.
[30] Y. Tang, J. Yu, K. Gai, X. Qu, Y. Hu, G. Xiong, and Q. Wu, “Watermarking vision-language pre-trained models for multi-modal embedding as a service,” arXiv preprint arXiv:2311.05863, 2023.
[31] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3980–3990.
[32] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes-which naive bayes?” in Conference on Email and Anti-Spam (CEAS), vol. 17, 2006, pp. 28–69.
[33] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013, pp. 1631–1642.
[34] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, D. Liu, X. Xie, J. Gao, W. Wu et al., “Mind: A large-scale dataset for news recommendation,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 3597–3606.
[35] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in Neural Information Processing Systems (NIPS), vol. 28, 2015.
[36] J. Yan, W. J. Mo, X. Ren, and R. Jia, “Rethinking backdoor detection evaluation for language models,” arXiv preprint arXiv:2409.00399, 2024.
[37] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in IEEE Symposium on Security and Privacy (SP), 2019, pp. 707–723.
[38] X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang, “Hidden trigger backdoor attack on NLP models via linguistic style manipulation,” in USENIX Security Symposium (USENIX Security), 2022, pp. 3611–3628.
[39] S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 3123–3140.
[40] Y. Liu, G. Shen, G. Tao, S. An, S. Ma, and X. Zhang, “Piccolo: Exposing complex backdoors in nlp transformer models,” in IEEE Symposium on Security and Privacy (SP), 2022, pp. 2025–2042.
[41] K. Doan, Y. Lao, W. Zhao, and P. Li, “Lira: Learnable, imperceptible and robust backdoor attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 966–11 976.
[42] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11 957–11 965.
[43] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, “Clean-label backdoor attacks on video recognition models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14 443–14 452.
[44] H. Li, Q. Ye, H. Hu, J. Li, L. Wang, C. Fang, and J. Shi, “3dfed: Adaptive and extensible framework for covert backdoor attack in federated learning,” in IEEE Symposium on Security and Privacy (SP), 2023, pp. 1893–1907.
[45] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee, and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated learning,” Advances in Neural Information Processing Systems (NIPS), vol. 33, pp. 16 070–16 084, 2020.
[46] L. Shen, S. Ji, X. Zhang, J. Li, J. Chen, J. Shi, C. Fang, J. Yin, and T. Wang, “Backdoor pre-trained models can transfer to all,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 3141–3158.
[47] R. Zhang, H. Li, R. Wen, W. Jiang, Y. Zhang, M. Backes, Y. Shen, and Y. Zhang, “Instruction backdoor attacks against customized LLMs,” in USENIX Security Symposium (USENIX Security), 2024, pp. 1849–1866.
[48] J. Jia, Y. Liu, and N. Z. Gong, “Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning,” in IEEE Symposium on Security and Privacy (SP), 2022, pp. 2043–2059.
[49] X. Han, Y. Wu, Q. Zhang, Y. Zhou, Y. Xu, H. Qiu, G. Xu, and T. Zhang, “Backdooring multimodal learning,” in IEEE Symposium on Security and Privacy (SP), 2024, pp. 3385–3403.
[50] S. Liang, M. Zhu, A. Liu, B. Wu, X. Cao, and E.-C. Chang, “Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 24 645–24 654.
[51] Y. Li, H. Wang, and M. Barni, “A survey of deep neural network watermarking techniques,” Neurocomputing, vol. 461, pp. 171–193, 2021.
[52] Y. Yan, X. Pan, M. Zhang, and M. Yang, “Rethinking White-Box watermarks on deep learning models under neural structural obfuscation,” in USENIX Security Symposium (USENIX Security), 2023, pp. 2347–2364.
[53] P. Lv, P. Li, S. Zhang, K. Chen, R. Liang, H. Ma, Y. Zhao, and Y. Li, “A robustness-assured white-box watermark in neural networks,” IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 20, no. 6, pp. 5214–5229, 2023.
[54] A. Pegoraro, C. Segna, K. Kumari, and A.-R. Sadeghi, “Deepeclipse: How to break white-box dnn-watermarking schemes,” arXiv preprint arXiv:2403.03590, 2024.
[55] S. Leroux, S. Vanassche, and P. Simoens, “Multi-bit black-box watermarking of deep neural networks in embedded applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 2121–2130.
[56] P. Lv, P. Li, S. Zhu, S. Zhang, K. Chen, R. Liang, C. Yue, F. Xiang, Y. Cai, H. Ma, Y. Zhang, and G. Meng, “Ssl-wm: A black-box watermarking approach for encoders pre-trained by self-supervised learning,” Proceedings of the Network and Distributed System Security Symposium (NDSS), 2024.
[57] H. An, G. Hua, Z. Lin, and Y. Fang, “Box-free model watermarks are prone to black-box removal attacks,” arXiv preprint arXiv:2405.09863, 2024.
[58] R. Wang, H. Li, L. Mu, J. Ren, S. Guo, L. Liu, L. Fang, J. Chen, and L. Wang, “Rethinking the vulnerability of dnn watermarking: Are watermarks robust against naturalness-aware perturbations?” in Proceedings of the ACM International Conference on Multimedia (MM), 2022, pp. 1808–1818.

Appendix A Heuristic Perturbation Scheme

In Appendix A, we introduce heuristic semantic perturbation scheme. Previous work primarily focus on the text classification tasks, so we follow the same setup. In the context of text classification, heuristic perturbation scheme randomly selects samples with different labels from original as suffixes. We randomly select $k$ samples for perturbation and calculate the average cosine similarity of $k$ embedding pairs, to reduce the influence of potential triggers in the suffixes. We conducted experiments on four classic datasets: Enron Spam[32], SST2[33], MIND[34] and AG News[35]. From the perspectives of the attacker and ground truth, the cosine similarity distribution of Enron Spam dataset is shown in Figure 11. The distribution results indicate observable differences for the Enron Spam and MIND datasets, while such differences are less pronounced for the SST2 and AG News datasets. Thus, we need a further exploration to identify more effective perturbation approaches.

TABLE VI: Training Settings


Datasets	Train	Test	Class	Metrics	Schemes	Original	Subset	Epoch Adjustment
Enron Spam	$31,716\rightarrow 5,000$	$2,000\rightarrow 500$	$2$	ACC.(%)	EmbMarker	$94.85\%$	$92.00\%$	$3\rightarrow 20$
					WARDEN	$94.60\%$	$92.20\%$	$3\rightarrow 10$
SST2	$67,349\rightarrow 5,000$	$872\rightarrow 500$	$2$	ACC.(%)	EmbMarker	$93.46\%$	$91.60\%$	$3\rightarrow 30$
					WARDEN	$93.46\%$	$92.20\%$	$3\rightarrow 50$
MIND	$97,791\rightarrow 5,000$	$32,592\rightarrow 500$	$18$	ACC.(%)	EmbMarker	$77.23\%$	$69.20\%$	$3\rightarrow 75$
					WARDEN	$77.18\%$	$71.80\%$	$3\rightarrow 75$
AG News	$120,000\rightarrow 5,000$	$7,600\rightarrow 500$	$4$	ACC.(%)	EmbMarker	$93.57\%$	$88.80\%$	$3\rightarrow 20$
WARDEN					$93.76\%$	$89.00\%$	$3\rightarrow 20$

Appendix B Semantic Perturbation Guidance

In Appendix B, we introduce another small local model suffix perturbation guidance approach. The results in Figure 11 indicate that the effectiveness of the simple heuristic perturbation scheme needs further improvement. Although the embedding spaces of $\Theta_{v}$ and $\Theta_{s}$ differ, the variations between $(e_{c_{i}},e^{\prime}_{c_{i}})$ under the same perturbation show similar patterns across all these spaces. Specifically, we input the text pair $(d_{c_{i}},d_{c_{i}}+perb)$ into $\Theta_{s}$ to obtain the corresponding embedding pair $(se_{c_{i}},se^{\prime}_{c_{i}})$ . The perturbation $perb$ traverses through all candidates in the perturbation pool. The $top$ - $k$ $perb$ texts that minimize the similarity of $(se_{c_{i}},se^{\prime}_{c_{i}})$ are selected as candidate suffixes. Since the embeddings output by $\Theta_{s}$ are not watermarked, it is feasible to use this small local model to guide the perturbations for $\Theta_{v}$ . We similarly take the aggregate metric over $k$ perturbed samples for evaluation. $\Theta_{s}$ captures the differential features between $(d_{c_{i}},d_{c_{i}}+perb)$ . Such differential features are consistent across models. However, suffix perturbation guidance is less efficient since each text have to traverse all the candidates in the perturbation pool. It results in the time complexity of $|D_{c}|\cdot|perb\ pool|$ , requiring $\Theta_{s}$ to encode $|D_{c}|\cdot|perb\ pool|$ perturbation processes. The entire process of the algorithm is shown in Algorithm 2.

Algorithm 2 Suffix Perturbation Guidance

1: Input: Perturbation Pool

P

, Dataset

D_{c}

2: Standard Model

\Theta_{s}

, Hyperparameter

k

3: Output: Metric Values

v

4: Initialize

s\leftarrow\emptyset

(

Suffix

)

5: Initialize

n\leftarrow|D_{c}|

m\leftarrow|P|

6: Set

max(s)\leftarrow 1

{

\triangleright

Cosine similarity range: [-1, 1]}

7: for

i=1

n

8: for

j=1

m

d^{\prime}_{c_{i}}\leftarrow d_{c_{i}}+perb_{j}

10: Encode:

se_{c_{i}}\leftarrow\Theta_{s}(d_{c_{i}})

se^{\prime}_{c_{i}}\leftarrow\Theta_{s}(d^{\prime}_{c_{i}})

11:

sim\leftarrow\textit{cosine}(se_{c_{i}},se_{perb})

12: if

|s|<k

then

13: Append

perb_{j}

s

14: else if

|s|\geq k

and

sim<max(s)

then

15: Remove

max(s)

from

s

16: Insert

perb_{j}

into

s

17: else

18: Skip

perb_{j}

19: end if

20: end for

21: Compute aggregate metric:

metric\leftarrow agg(s)

22: Append

metric

v

23: end for

24: return

v

Appendix C Experiment Settings

In Appendix C, Table VI provides detailed information about the datasets used in our study. It also highlights the adjustments made to the number of training epochs in order to ensure performance on the respective subsets of each dataset. Specifically, the smallest dataset contains more than 30,000 data items, while the largest dataset includes over 12,000 data items. For our experiments, we sampled a subset of 5,000 examples from the training set and 500 examples from the test set. This sampling strategy was carefully chosen to balance the need for the cost of the experiment with the goal of maintaining representative data coverage. Table VI indicates that, despite using subsets, the accuracy of downstream tasks has not significantly decreased in different watermarking schemes. On certain specific datasets, the accuracy achieved using the subset for training has even shown a slight improvement. This may be attributed to the inherent randomness in training process. Since the focus is on a relatively simple text classification task, the model appears to perform well even on the subset, maintaining favorable results. The results of the experiments demonstrate that conducting tests on these subsets not only produces valid and meaningful outcomes but also confirms the practicality.