Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection

Zekun Fei12, Biao Yi12, Jianing Geng2, Ruiqi He2, Lihai Nie2🖂 and Zheli Liu2 1Co-first Authors🖂Corresponding Author 2College of Cyber Science, Key Laboratory of DISSec
Nankai University, China
{feizekun, yibiao, gengjianing, heruiqi}@mail.nankai.edu.cn, {NLH, liuzheli}@nankai.edu.cn
Abstract

Embedding-as-a-Service (EaaS) has emerged as a successful business pattern but faces significant challenges related to various forms of copyright infringement, including API misuse and different attacks. Various studies have proposed backdoor-based watermarking schemes to protect the copyright of EaaS services. In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA). Our theoretical and experimental analyses demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations test to bypass watermark verification. To address this vulnerability, we propose the Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA, by injecting a watermark that adapts to the text semantics. Extensive experimental results across multiple datasets demonstrate that the True Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to more than 95%, rendering previous watermarks ineffective. Meanwhile, our watermarking scheme can resist such attack while ensuring the watermark verification capability. Our code is available at https://github.com/Zk4-ps/EaaS-Embedding-Watermark.

1 Introduction

Embedding-as-a-Service (EaaS) 111The EaaS API from OpenAI: https://platform.openai.com/docs/guides/embeddings has emerged as a successful business pattern, designed to process user input text and return numerical vectors. EaaS supports different downstream tasks for users (e.g., retrieval[1, 2], classification[3, 4] and recommendation[5, 6]). Recently, it has also played a crucial role in developing the external knowledge systems, including Retrieval-Augmented Generation (RAG)[7, 8] and vector databases[9]. Moreover, HuggingFace community[10] support the innovation of embedding model with the Massive Text Embedding Benchmark (MTEB)[11].

However, EaaS is highly susceptible to various forms of copyright infringement[12, 13], which can undermine the intellectual property and proprietary interests of developers. As shown in Figure 1, after querying the text embeddings, malicious actors may seek to misuse the API of EaaS to construct external knowledge storage or potentially train their own models to replicate the capabilities of the original models at a lower cost, falsely claiming them as their own proprietary services. Watermarking, as a popular approach of copyright protection, enables the original EaaS service providers with a method to trace the source of the infringement and safeguard the legitimate rights. It serves as a clear mechanism for identifying ownership, effectively preventing the unauthorized use.

Various works[14, 15, 16] have proposed backdoor-based watermarking schemes for embeddings to protect the copyright of EaaS services. Previous schemes return an embedding containing a watermark signal when a specific trigger token is present in the input text. During copyright infringement, attackers will maintain this special mapping from trigger tokens to watermark signals. Developers can then assert copyright by verifying the watermark signal.

Refer to caption
Figure 1: An Overview of EaaS Watermark. Watermarking provides EaaS providers with a method for tracing the copyright infringement. The current watermarking schemes are semantic-independent, and the watermark signals injected to the two semantically opposed texts are identical.

1.1 Our Work

We reveal that previous watermarking schemes possess the semantic-independent characteristics. Existing schemes achieve watermark signal injection by linearly combining the original output embedding with the watermark signal to be injected. Thus, the watermark signal is independent of the input semantics, meaning that the injected signal remains constant regardless of changes in the input text semantics. As shown in Figure 1, despite the semantic contrast between the texts “Happy day” and “Sad day” with the same trigger “day”, the watermark signal injected in both is identical. Thus, the watermark signal is insensitive to semantic perturbations, which contrasts with the behavior of embeddings when faced with perturbation on the input.

We introduce a novel attack, named Semantic Perturbation Attack (SPA), exploiting vulnerability arising from semantic-independent nature. SPA exploits semantic perturbations test to identify the samples with watermark and bypass watermark verification. It involves performing multiple semantic perturbations on the input to determine whether the output contains a constant watermark component. Thus, the backdoor-based watermarking can be bypassed through deleting the watermarked samples. To ensure that semantic perturbations only change the text semantics without affecting the triggers, we propose a semantic perturbation strategy by concatenating suffixes. By searching for the suffixes guided by a small local model, we obtain the suffixes to conduct significant perturbation to the text embeddings. Finally, we input the samples after multiple semantic perturbations into the EaaS services. Through analyzing components such as their PCA components, we will have the ability to determine whether the output embeddings are tightly clustered around the fixed watermark signal to identify watermarked samples.

To address this vulnerability, we propose Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA. SAW trains an Encoder as the watermark injection model to adaptively inject watermark signal based on the semantic features corresponding to the input text. Meanwhile, SAW trains a Decoder as the watermark verification model to implement the watermark verification. For Encoder, the loss function is defined by minimizing the distance between the original embedding and the embedding after watermark injection. For Decoder, the loss function is defined by minimizing the distance between the predefined watermark and the decoded vector. Ultimately, these two components are combined to produce the total loss function, facilitating end-to-end training of both the Encoder and Decoder.

The main contributions of this paper can be summarized as the following three points:

  • We reveal that existing backdoor-based watermarking schemes for EaaS have a semantic-independent characteristic and analyze how this characteristic can be easily exploited by attackers.

  • We propose SPA, a novel attack that leverages the flaw identified in the analysis above to successfully bypass the current watermarking schemes for EaaS. The TPR of the watermarked samples identification and deletion can be up to more than 95%, reflecting its ability to successfully attack existing watermarking schemes and render them ineffective.

  • We propose SAW, a novel scheme to enhance the EaaS watermarking. Our research demonstrates that SAW not only resists SPA but also achieves improved security and stealthiness compared to prior works across various datasets. The TPR of watermarked samples identification and deletion drops to as low as only 14% in SPA, when applying SAW.

2 Preliminary

2.1 EaaS Copyright Infringement

Various copyright infringement approaches[13] pose a significant threat to the Deep Neural Networks (DNNs) and cloud services. Attackers can typically misuse the model’s API[17] or collect the data and physical information[18, 19], preparing to imitate the original model training. Publicly deployed APIs, especially in the latest EaaS services based on Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) are proved both vulnerable[12, 20]. We focus on the EaaS services based on LLMs. We define the victim model as ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT providing the EaaS service Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The text dataset of query provided by the client is denoted as D𝐷Ditalic_D. The individual text in D𝐷Ditalic_D is denoted as disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT computes the original embedding eoidimsubscript𝑒subscript𝑜𝑖superscript𝑑𝑖𝑚e_{o_{i}}\subseteq\mathbb{R}^{dim}italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d italic_i italic_m end_POSTSUPERSCRIPT, where dim𝑑𝑖𝑚dimitalic_d italic_i italic_m is the dimension of embedding. To protect EaaS services copyright, it is essential to inject the watermark into eoisubscript𝑒subscript𝑜𝑖e_{o_{i}}italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT before delivering it to client. Watermarking is a useful technique for tracking the copyright infringement or detecting the source of AI generated content[21, 13]. The backdoor-based watermarking schemes [22, 23, 14] are always be used as an effective approach to protect the copyright of the models by injecting a hidden pattern into the model’s output, according to most of the backdoor attacks [24, 25, 26, 27]. The backdoor typically remains inactive under normal conditions but can be triggered by specific inputs known only to the developer. Once activated, the backdoor alters the model’s behavior as designed by the developer, enabling it to function as a watermarking mechanism. Thus, we denote the backdoor-based watermarking scheme as f𝑓fitalic_f. The final text embedding provided by Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is epi=f(eoi)subscript𝑒subscript𝑝𝑖𝑓subscript𝑒subscript𝑜𝑖e_{p_{i}}=f(e_{o_{i}})italic_e start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_f ( italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). We refer to the whole sets of embeddings as Eosubscript𝐸𝑜E_{o}italic_E start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, corresponding to the original and watermarked embeddings, respectively.

2.2 EaaS Watermarks

EmbMarker[14] has provided a detailed explanation that other watermarking schemes (e.g., parameter-based[28] and fingerprint-based[29]), are unsuitable for EaaS. It is the first to propose using backdoor-based watermarking to protect the copyright of EaaS services. EmbMarker[14] injects the watermark by implanting a backdoor, which the embedding of text containing triggers is linearly added with a predefined watermark vector. It can be defined as

epi=𝑁𝑜𝑟𝑚{(1λ)eoi+λet},subscript𝑒subscript𝑝𝑖𝑁𝑜𝑟𝑚1𝜆subscript𝑒subscript𝑜𝑖𝜆subscript𝑒𝑡\quad e_{p_{i}}=\mathit{Norm}\Big{\{}(1-\lambda)\cdot e_{o_{i}}+\lambda\cdot e% _{t}\Big{\}},italic_e start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_Norm { ( 1 - italic_λ ) ⋅ italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ ⋅ italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , (1)

where λ𝜆\lambdaitalic_λ represents the strength of the watermark injection and etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the watermark vector. EmbMarker[14] utilizes the difference of cosine similarity and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance (ΔCosΔ𝐶𝑜𝑠\Delta Cosroman_Δ italic_C italic_o italic_s and ΔL2Δsubscript𝐿2\Delta L_{2}roman_Δ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) between embedding sets with and without watermark to conduct watermark verification. The embedding set with watermark will be more similar with etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Also it uses the p-value of Kolmogorov-Smirnov (KS) test to compare the distribution of two value sets. The limitations of a single watermark vector make it vulnerable, prompting WARDEN[15] to propose a multi-watermark scheme. It can be defined as

epi=𝑁𝑜𝑟𝑚{(1Σr=1Rλr)eoi+Σr=1Rλretr},subscript𝑒subscript𝑝𝑖𝑁𝑜𝑟𝑚1superscriptsubscriptΣ𝑟1𝑅subscript𝜆𝑟subscript𝑒subscript𝑜𝑖superscriptsubscriptΣ𝑟1𝑅subscript𝜆𝑟subscript𝑒subscript𝑡𝑟\quad e_{p_{i}}=\mathit{Norm}\Big{\{}(1-\Sigma_{r=1}^{R}\lambda_{r})\cdot e_{o% _{i}}+\Sigma_{r=1}^{R}\lambda_{r}\cdot e_{t_{r}}\Big{\}},italic_e start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_Norm { ( 1 - roman_Σ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ⋅ italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT } , (2)

where λrsubscript𝜆𝑟\lambda_{r}italic_λ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represents different strengths of watermark injection and etisubscript𝑒subscript𝑡𝑖e_{t_{i}}italic_e start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents different watermark vectors. Furthermore, WET[16] introduces a scheme for injecting the watermark into all embeddings, improving the original vector addition to a matrix multiplication as

epi=𝑁𝑜𝑟𝑚(eoi×Mt),subscript𝑒subscript𝑝𝑖𝑁𝑜𝑟𝑚subscript𝑒subscript𝑜𝑖subscript𝑀𝑡\quad e_{p_{i}}=\mathit{Norm}(e_{o_{i}}\times\mathit{M}_{t}),italic_e start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_Norm ( italic_e start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3)

where Mtsubscript𝑀𝑡\mathit{M}_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is an invertible matrix as key component of the watermark injection and ‘×\times×’ represents the matrix multiplication. WET[16] inject the watermark into all of the embeddings without considering the text with triggers, which may have an impact on the effectiveness of embeddings. VLPMarker[30] extends the backdoor-based watermarking to multi-modal models. Our primary focus is on EaaS services built on LLMs. All current watermarking schemes can be regarded as a form of linear transformation.

3 Motivation

Section 3 provides a brief review of existing watermarking schemes for the EaaS services and present a formal analysis of their limitation. Watermarking is widely used for protecting the copyright of models and services. Current watermarking schemes can be viewed as semantic-independent linear transformations. According to Section 2.2, previous works only utilize the fixed linear transformations on different original embeddings without considering the distinct features of each embedding. These linear transformations are closely linked to the triggers but are independent with the text semantic features. However, embeddings represents the text semantics. The changes of embeddings caused by semantic perturbations are different depending on the presence of triggers within the text. A straightforward insight: In the context of semantic perturbations, if the text contains triggers, its embedding changes should be less significant than those of text without triggers. For a given sample disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the perturbed form is denoted as disubscriptsuperscript𝑑𝑖d^{\prime}_{i}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the corresponding embedding pair is (ei,ei)subscript𝑒𝑖subscriptsuperscript𝑒𝑖(e_{i},e^{\prime}_{i})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The primary objective of constructing (di,di)subscript𝑑𝑖subscriptsuperscript𝑑𝑖(d_{i},d^{\prime}_{i})( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is to identify the suspicious samples with watermark. The more effective the perturbation, the more likely samples with triggers will be detected as outliers. Therefore, the performance of perturbation is crucial. Both eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and eisubscriptsuperscript𝑒𝑖e^{\prime}_{i}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are high-dimensional vectors of the same dimension. To visualize perturbation behavior, we use a two-dimensional demonstration. Since the principles in two-dimensional space are the same in higher dimensions. Taking the representative linear addition scheme, we define a fixed watermark vector as vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

As illustrated in Figure 2, assuming that text disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains triggers and the perturbation will not disrupt the original triggers or introduce new triggers. Without injecting vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the angle between (ei,ei)subscript𝑒𝑖subscriptsuperscript𝑒𝑖(e_{i},e^{\prime}_{i})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. After injecting vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the angle between eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and eisubscriptsuperscript𝑒𝑖e^{\prime}_{i}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT changes to θ2subscript𝜃2\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In Figure 2, the red vectors represent the original ones, which then transform to the blue vectors after adding the vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Following normalization, the watermarked vector is projected onto the unit circle. The key idea of constructing (di,di)subscript𝑑𝑖subscriptsuperscript𝑑𝑖(d_{i},d^{\prime}_{i})( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is to achieve θ2<θ1subscript𝜃2subscript𝜃1\theta_{2}<\theta_{1}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, making the watermarked embeddings cluster tightly in the vector space. The different distributions serve as the basis for distinguishing the suspicious samples. When θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is small, achieving θ2<θ1subscript𝜃2subscript𝜃1\theta_{2}<\theta_{1}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT imposes specific requirements on vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For instance, |vect|𝑣𝑒subscript𝑐𝑡|vec_{t}|| italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | should be relatively large and maintain an angle with eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and eisubscriptsuperscript𝑒𝑖e^{\prime}_{i}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of less than 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT. Conversely, when θ1subscript𝜃1\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is large, the constraints on vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT become less stringent. θ1=180subscript𝜃1superscript180\theta_{1}=180^{\circ}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT represents the upper boundary of the semantic perturbation as shown in Figure 2. If eisubscriptsuperscript𝑒𝑖e^{\prime}_{i}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in the opposite direction of eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, any form of vect𝑣𝑒subscript𝑐𝑡vec_{t}italic_v italic_e italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will result in the condition θ2<θ1subscript𝜃2subscript𝜃1\theta_{2}<\theta_{1}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moreover, it is evident that in both two-dimensional and higher-dimensional vector spaces, the similarity between (ei,ei)subscript𝑒𝑖subscriptsuperscript𝑒𝑖(e_{i},e^{\prime}_{i})( italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) without a watermark can serve as an indicator for evaluating the performance of semantic perturbations. The same applies to other linear transformation schemes. Cosine similarity can be utilized as one evaluation metric, alongside other metrics (e.g., L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and dot product similarity). Based on this observation, we propose the novel attack SPA, along with the new watermarking scheme SAW. Without relying on fixed linear transformations, SAW injects the predefined single watermark vector that adapts to the text semantics, enhancing both the security and the stealthiness of the watermark.

Refer to caption
Figure 2: Perturbation Demonstration in Two-Dimensional Space. With the same watermark vector configuration, a larger perturbed angle makes the effect of θ1<θ2subscript𝜃1subscript𝜃2\theta_{1}<\theta_{2}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT increasingly pronounced. When the perturbed angle reaches 180superscript180180^{\circ}180 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT, this θ1<θ2subscript𝜃1subscript𝜃2\theta_{1}<\theta_{2}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT relationship holds for any watermark vector.

4 Semantic Perturbation Attack

Refer to caption
Figure 3: The Framework of Semantic Perturbation Attack. Attackers apply the semantic perturbation strategy we propose to modify the original query dataset. The current watermarking schemes are fixed linear transformation during the perturbation. This characteristics enables the selection and deletion of watermarked embeddings, ultimately resulting in a purified dataset that bypasses watermark verification.

In this section, we offer a detailed characterization of the threat model and Semantic Perturbation Attack (SPA). Previous watermarking schemes have fixed the watermark to a semantic-independent linear transformation. However, text embeddings are closely linked to semantic features. Thus, semantic perturbation may cause the embeddings of texts containing triggers to deviate from expected patterns. Based on the observations in Section 3, SPA is constructed with total three components: (1) Semantic Perturbation Strategy; (2) Embeddings Tightness Measurement; (3) Selection and Deletion. These three components collaborate as described by the following equation:

Dsc={dciDcS(dci,G(dci))<φ},subscript𝐷𝑠𝑐conditional-setsubscript𝑑subscript𝑐𝑖subscript𝐷𝑐𝑆subscript𝑑subscript𝑐𝑖𝐺subscript𝑑subscript𝑐𝑖𝜑D_{sc}=\{d_{c_{i}}\in D_{c}\mid S(d_{c_{i}},G(d_{c_{i}}))<\varphi\},italic_D start_POSTSUBSCRIPT italic_s italic_c end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∣ italic_S ( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_G ( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) < italic_φ } , (4)

where G𝐺Gitalic_G indicates how to guide the semantic perturbation, S𝑆Sitalic_S represents the tightness measurement of the perturbation embeddings and φ𝜑\varphiitalic_φ is the selected threshold for distinguishing suspicious samples from benign samples in the selection and deletion phase. The overview is illustrated in Figure 3.

4.1 Threat Model

Based on the real-world scenarios and prior works[14, 15], we clearly define the threat model, detailing the objective, knowledge and capability of attacker.

Attacker’s Objective. The attacker aims to utilize the embeddings from the victim model ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT without considering the potential watermark verification. The attacker can then efficiently provide a competitive alternative instead of pre-training a new model. The attacker should consider the potential watermark, making identifying suspicious embeddings with watermark a critical objective.

Attacker’s Knowledge. EaaS service operates as a black box. The attacker possesses a dataset Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to query the victim service Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Each sample in Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is defined as dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The attacker is unaware of any information with ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. However, it is both reasonable and realistic for the attacker to access a general text corpus Dpsubscript𝐷𝑝D_{p}italic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and an small local embedding model ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to design the attack algorithm.

Attacker’s Capability. With sufficient budget, attacker can query Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to acquire the embedding set corresponding to Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, denoted as Ecsubscript𝐸𝑐E_{c}italic_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. The attacker can also employ various attack strategies based on the embeddings they can possess to bypass watermark verification.

4.2 Semantic Perturbation Strategy

To successfully conduct SPA for attacker, various perturbation techniques can be utilized (e.g., suffix concatenation, prefix concatenation, and synonym replacement). However, the attacker is constrained to only two perturbation methods, i.e., prefix and suffix concatenation in EaaS scenario. As the suffix concatenation, the attacker can create the text pair (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ), where perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b represents a perturbation text as suffix and the notation ‘+’ represents the concatenation of the two text segments. The reason for limiting the use to these two methods is that the attacker should maintain dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT itself during perturbation, as lacking the knowledge of the watermarking scheme and the specific triggers. Any modification within dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (e.g., synonym replacement), may cause the original triggers in dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be ineffective. If the trigger is rendered invalid, ecisubscriptsuperscript𝑒subscript𝑐𝑖e^{\prime}_{c_{i}}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT may exhibit deviations, leading to a failed semantic perturbation. Therefore, only two positions are suitable for perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b: as prefix or suffix. Unless specified, all the perturbations in the following sections are suffix concatenation. We define dci=dci+perbsubscriptsuperscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏d^{\prime}_{c_{i}}=d_{c_{i}}+perbitalic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b and the corresponding embedding ecisubscriptsuperscript𝑒subscript𝑐𝑖e^{\prime}_{c_{i}}italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We further explore other aspects of the perturbation. For the suffix, the potential construction space can be categorized from two perspectives: length and semantics. We also use EmbMarker[14] as an example. The conclusion is applicable to other schemes.

Random tokens without semantics: We first explore a simple construction method by the adding random tokens as the suffix without semantics. Specifically, we tokenize each sentence in a general text corpus and compile all tokens into a total token vocabulary. We randomly add tokens to the suffix. At this stage, we explore the relationship between suffix length and perturbation performance before and after the watermark injection, measured by (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The results in Figure 4 indicate that as the suffix length increases, the embeddings similarity gradually decreases. After the watermark injection to (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), the rate of decrease significantly slows and remains notably higher than the curve without the watermark injection.

Random text with semantics: We randomly selected long texts from a general text corpus, tokenize it to obtain a sequence of tokens and sequentially add each token to the suffix. We explored the effects both with and without watermark injection. The results are illustrated in Figure 4. It is evident that semantic suffix lead to a faster enhancement of perturbation performance, with the curve with watermark injection also significantly exceeding that without injection. Interestingly, for the same suffix length, the performance of perturbations using text with semantics is generally higher than that achieved with random tokens. The finding suggests that using the suffix with semantics is more cost-effective and produces better results. Therefore, we will consistently utilize the semantic suffix during the perturbation process.

Refer to caption
Figure 4: Approaches of Semantic Perturbations: Length and Semantics. Regardless of whether watermarked or not, random text preforms better than random tokens. The injection of the watermark has led to a significant gap between the curves.

Text with opposite semantics: Suffixes with semantic achieve better perturbation performance at lower costs, making suspicious samples easier to identify. Thus, we explore a heuristic perturbation scheme according to observation above. We follow the same setup as previous works, primarily focusing on the text classification tasks. The heuristic perturbation scheme randomly selects samples with different labels from dataset as suffixes. The semantic differences between samples of different labels may enhance the perturbation performance. However, the metric distributions across different datasets indicate that the semantic perturbation methods need further improvement with certain guidance. Detailed information of heuristic perturbation scheme can be found in Appendix A.

Small Local Model Suffix Search Guidance: According to the threat model in Section 4.1, the attacker can access Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, a general text corpus Dpsubscript𝐷𝑝D_{p}italic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and a small local embedding model ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Both small embedding models and LLM-based EaaS services essentially extract the features of input text. Hence, the features extracted by either the victim model ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT or ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are bound to exhibit some similarity. In other words, while the vectors differ across feature spaces, their differential properties are consistent. Therefore, ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can be utilized to guide the selection of optimal suffixes. If we input the text pair (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ) into ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to caption the differential properties through the corresponding embedding pair. The perturbation perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b has to traverse through all candidates in the perturbation pool. However, it is less efficient with the time complexity of |Dc||perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|\cdot|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ⋅ | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l |, requiring ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to perform |Dc||perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|\cdot|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ⋅ | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l | encoding processes. Detailed information of the algorithm can be found in Appendix B.

Algorithm 1 Suffix Direct Search Guidance
1:  Input: Perturbation Pool P𝑃Pitalic_P, Dataset Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT,
2:           Standard Model ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Hyperparameter k𝑘kitalic_k
3:  Output: Metric Values Set v𝑣vitalic_v
4:  Initialize s𝑠s\leftarrow\emptysetitalic_s ← ∅((((Suffix))))
5:  Initialize n|Dc|𝑛subscript𝐷𝑐n\leftarrow|D_{c}|italic_n ← | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT |, m|P|𝑚𝑃m\leftarrow|P|italic_m ← | italic_P |
6:  Set max(s)1𝑚𝑎𝑥𝑠1max(s)\leftarrow 1italic_m italic_a italic_x ( italic_s ) ← 1 {\triangleright Cosine similarity range: [-1, 1]}
7:  for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
8:     for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do
9:        Encode: seciΘs(dci)𝑠subscript𝑒subscript𝑐𝑖subscriptΘ𝑠subscript𝑑subscript𝑐𝑖se_{c_{i}}\leftarrow\Theta_{s}(d_{c_{i}})italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), seperbΘs(perbj)𝑠subscript𝑒𝑝𝑒𝑟𝑏subscriptΘ𝑠𝑝𝑒𝑟subscript𝑏𝑗se_{perb}\leftarrow\Theta_{s}(perb_{j})italic_s italic_e start_POSTSUBSCRIPT italic_p italic_e italic_r italic_b end_POSTSUBSCRIPT ← roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
10:        simcosine(seci,seperb)𝑠𝑖𝑚cosine𝑠subscript𝑒subscript𝑐𝑖𝑠subscript𝑒𝑝𝑒𝑟𝑏sim\leftarrow\textit{cosine}(se_{c_{i}},se_{perb})italic_s italic_i italic_m ← cosine ( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUBSCRIPT italic_p italic_e italic_r italic_b end_POSTSUBSCRIPT )
11:        if |s|<k𝑠𝑘|s|<k| italic_s | < italic_k then
12:           Append perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to s𝑠sitalic_s
13:        else if |s|k𝑠𝑘|s|\geq k| italic_s | ≥ italic_k and sim<max(s)𝑠𝑖𝑚𝑚𝑎𝑥𝑠sim<max(s)italic_s italic_i italic_m < italic_m italic_a italic_x ( italic_s ) then
14:           Remove max(s)𝑚𝑎𝑥𝑠max(s)italic_m italic_a italic_x ( italic_s ) from s𝑠sitalic_s
15:           Insert perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into s𝑠sitalic_s
16:        else
17:           Skip perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
18:        end if
19:     end for
20:     Compute aggregate metric: metricagg(s)𝑚𝑒𝑡𝑟𝑖𝑐𝑎𝑔𝑔𝑠metric\leftarrow agg(s)italic_m italic_e italic_t italic_r italic_i italic_c ← italic_a italic_g italic_g ( italic_s )
21:     Append metric𝑚𝑒𝑡𝑟𝑖𝑐metricitalic_m italic_e italic_t italic_r italic_i italic_c to v𝑣vitalic_v
22:  end for
23:  return  v𝑣vitalic_v

Thus, to improve the efficiency, we propose a proximate and efficient approach. For text dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and its embedding ecisubscript𝑒subscript𝑐𝑖e_{c_{i}}italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we can consider ecisubscript𝑒subscript𝑐𝑖e_{c_{i}}italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT as feature representation of dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in a high-dimensional space. In this space, the vector in the opposite direction can be seen as representing features that are entirely different from ecisubscript𝑒subscript𝑐𝑖e_{c_{i}}italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. We can input (dci,perb)subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p italic_e italic_r italic_b ) into ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT respectively to obtain the corresponding embedding pair (seci,seperb)𝑠subscript𝑒subscript𝑐𝑖𝑠subscript𝑒𝑝𝑒𝑟𝑏(se_{c_{i}},se_{perb})( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUBSCRIPT italic_p italic_e italic_r italic_b end_POSTSUBSCRIPT ), where perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b traverses all candidates in the perturbation pool. We then select the top𝑡𝑜𝑝topitalic_t italic_o italic_p-k𝑘kitalic_k perturbations that produce the lowest similarity between (seci,seperb)𝑠subscript𝑒subscript𝑐𝑖𝑠subscript𝑒𝑝𝑒𝑟𝑏(se_{c_{i}},se_{perb})( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUBSCRIPT italic_p italic_e italic_r italic_b end_POSTSUBSCRIPT ). The embeddings exhibit opposite characteristics, implying that the semantic gap between dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b is maximized. Consequently, constructing (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ) can effectively conduct semantic perturbation on dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to detect the presence of triggers. We evaluate the perturbation performance based on the k𝑘kitalic_k selected samples. The effectiveness of this approach relies on a reasonable hypothesis: concatenating texts with obvious semantic gap allows for significant semantic perturbation. ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT only needs to encode Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the perturbation pool only once. So it results in the time complexity of |Dc|+|perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|+|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | + | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l |, requiring ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to perform |Dc|+|perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|+|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | + | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l | encoding processes. The complete process of the algorithm can be found in Algorithm 1. Small local model guidance is an approximate search and is highly efficient. What’s more, the suffix search guidance can also combine with the method detailed in Appendix B to better search for the optimal suffixes. Actually we utilize Sentence-BERT[31] as ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Sentence-BERT has fewer dimensions compared to popular EaaS models (38415363841536384\leftrightarrow 1536384 ↔ 1536), containing only 22.7M parameters. All subsequent experiments employ Sentence-BERT as the local model.

4.3 Embeddings Tightness Measurement

To explore the optimal perturbation suffix, the rational evaluation metrics for the perturbed embeddings need to be established. Our primary evaluation consists of three metrics represented as

Cosinei=1kΣj=1keciecij|eci||ecij|,L=2i1kΣj=1k|eci|eci|ecij|ecij||,PCAScorei=Σd=1Dpcafpca(ecijj=1,2,3,,k)Dpca:lowerdimension,\left.\begin{aligned} Co&sine_{i}=\frac{1}{k}\Sigma_{j=1}^{k}\frac{e_{c_{i}}% \cdot e^{j}_{c_{i}}}{|e_{c_{i}}|\cdot|e^{j}_{c_{i}}|},\\ L&{}_{2_{i}}=\frac{1}{k}\Sigma_{j=1}^{k}|\frac{e_{c_{i}}}{|e_{c_{i}}|}-\frac{e% ^{j}_{c_{i}}}{|e^{j}_{c_{i}}|}|,\\ \textit{PCA}\ Score_{i}&=\Sigma_{d=1}^{D_{pca}}f_{pca}(e^{j}_{c_{i}}\mid j=1,2% ,3,\ldots,k)\\ &D_{pca}:lower\ dimension,\end{aligned}\right.start_ROW start_CELL italic_C italic_o end_CELL start_CELL italic_s italic_i italic_n italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG roman_Σ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ⋅ | italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG , end_CELL end_ROW start_ROW start_CELL italic_L end_CELL start_CELL start_FLOATSUBSCRIPT 2 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_FLOATSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG roman_Σ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | divide start_ARG italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG - divide start_ARG italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG | , end_CELL end_ROW start_ROW start_CELL PCA italic_S italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL = roman_Σ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_p italic_c italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_p italic_c italic_a end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_j = 1 , 2 , 3 , … , italic_k ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_D start_POSTSUBSCRIPT italic_p italic_c italic_a end_POSTSUBSCRIPT : italic_l italic_o italic_w italic_e italic_r italic_d italic_i italic_m italic_e italic_n italic_s italic_i italic_o italic_n , end_CELL end_ROW (5)

where the three metrics are based on cosine similarity, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance, and PCA score, representing the similarity of (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). However, text perturbations may introduce new triggers into the original text with a low probability. Such the situation is unavoidable, as the attacker lacks access to the service provider’s relevant knowledge according to the threat model. Mid-frequency tokens are typically selected as triggers to minimize the impact on downstream task performance in backdoor-based watermarking schemes. Therefore, regardless of the perturbation method, we conduct k𝑘kitalic_k perturbations for one sample. Among the k𝑘kitalic_k perturbations, only a limited number may introduce new triggers. Thus, combining the results from the k𝑘kitalic_k trials to serve as the final evaluation metrics will mitigate the potential impacts.

Refer to caption
Figure 5: PCA Score Metric Visualization. By applying PCA to reduce the high-dimensional vectors to two dimensions, significant differences in the distribution of the eigenvalues can be observed. The figure illustrates the one-dimensional and two-dimensional kernel density distributions.

Cosine Similarity Metric: Cosine similarity is an intuitive metric. It measures the difference between two high-dimensional vectors by using the cosine of the angle between the embeddings in vector space. We use the average of the k𝑘kitalic_k trials as one of the evaluation metrics.

L2subscript𝐿2\mathbf{\mathit{L_{2}}}italic_L start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT Distance Metric: L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance, i.e., Euclidean distance represents the straight-line distance between two data points in high-dimensional space. The embeddings (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) are normalized and can be considered as lying on the unit circle in a high-dimensional space. Thus, the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance is related to the cosine similarity in this scenario. As the angle between the vectors \uparrow, the cosine similarity \downarrow, and the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance \uparrow. We use the average of the k𝑘kitalic_k trials as one of the evaluation metrics.

PCA Score Metric: We perform k𝑘kitalic_k perturbations to mitigate the potential impact of new triggers during perturbation. As a result, we obtain ecisubscript𝑒subscript𝑐𝑖e_{c_{i}}italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and k𝑘kitalic_k perturbed embeddings: {ecijj=1,2,3,,k}conditional-setsubscriptsuperscript𝑒𝑗subscript𝑐𝑖𝑗123𝑘\{e^{j}_{c_{i}}\mid j=1,2,3,\ldots,k\}{ italic_e start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∣ italic_j = 1 , 2 , 3 , … , italic_k }. For each sample dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, an embedding set of size k+1𝑘1k+1italic_k + 1 can be obtained. We apply PCA to each embedding set, which serves as a dimensionality reduction algorithm for preprocessing high-dimensional data. We reduce the embedding space dimensionality and obtain eigenvalue for each principal component. If dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT contains triggers, the embedding set should be clustered in high-dimensional space. After PCA, the lower-dimensional representation should also demonstrate tight clustering, with significantly smaller eigenvalues. Therefore, we use the sum of the eigenvalues as one of the evaluation metrics. It is also demonstrated in Equation 5, where Dpcasubscript𝐷𝑝𝑐𝑎D_{pca}italic_D start_POSTSUBSCRIPT italic_p italic_c italic_a end_POSTSUBSCRIPT represents the dimensions of dimensionality reduction and fpcasubscript𝑓𝑝𝑐𝑎f_{pca}italic_f start_POSTSUBSCRIPT italic_p italic_c italic_a end_POSTSUBSCRIPT represents eigenvalue computation algorithm in PCA. If we reduce the high-dimension embeddings to two dimensions, using the eigenvalues corresponding to two principal component as the coordinates for the x-axis and y-axis, we can obtain an image as shown in Figure 5.

According to the threat model, the attacker is unaware of specific triggers information, allowing evaluation through AUPRC (Area Under the Precision-Recall Curve), which is an important metric for evaluating the performance of binary classification tasks, particularly in cases of class imbalance. It focuses on the performance of the positive class by measuring the cosine similarity, L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and PCA score defined above, where texts with triggers are considered the positive class in the context of EaaS watermarking. A higher AUPRC value indicates a more accurate classifier for the positive class, making it easier to distinguish whether samples carry triggers. The criteria for assessing the performance of a classifier using AUPRC are as follows:

  • If AUPRC = 1, the classifier is perfect, meaning that all positive samples are correctly identified, with no false positives.

  • If AUPRC = 0, the classifier is ineffective, meaning that all positive samples are misclassified.

4.4 Selection and Deletion

Refer to caption
Figure 6: Selection and Deletion Phase. Our semantic perturbation strategy induces a bimodal distribution in the PCA metric distribution, enabling selection and deletion based on the differences between the two peaks.

With the small local model guidance, the attacker can easily distinguish the distribution differences among the various metrics. This section will discuss how the distribution differences can be leveraged to select suspicious samples and bypass watermark verification. The attacker can only access the distributions of different metrics, without accessing the triggers selected by the service provider. The distributions always demonstrate a long-tail phenomenon, which is caused by the texts containing triggers. Since the attacker does not have the knowledge of how the service provider defines the triggers. For instance, the service provider may use phrases composed of consecutive tokens or other symbols as triggers.

To achieve better generalization and performance, we adopt the selection and deletion approach. By simulating the distribution curve of the metric value, we observe an anomalous rise in the long-tail region, resulting in another peak. Based on this characteristic, we can infer that the slope undergoes a significant change in the long-tail region. It indicates the presence of a point in the long-tail region where the the first derivative equals zero or second derivative is significantly large. The plot of the cosine similarity metric distribution curve and the derivative curve is provided as Figure 6. It shows the similarity distribution shift of Enron Spam dataset under EmbMarker[14] scheme. Specifically, we select the metric value corresponding to the point on the curve where the first derivative equals zero or the second derivative is maximal, denoted as the threshold φ𝜑\varphiitalic_φ. Samples with metric exceeding φ𝜑\varphiitalic_φ are deleted from Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, obtaining a purified dataset Dscsubscript𝐷𝑠𝑐D_{sc}italic_D start_POSTSUBSCRIPT italic_s italic_c end_POSTSUBSCRIPT. During the deletion process, the majority of text samples containing triggers are eliminated from Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. While some benign data might also be removed, it represents only a small proportion of Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Experiment results show that such deletion will not affect downstream tasks. Furthermore, we can replenish Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to its original size and repeat the selection and deletion process iteratively to mitigate the impact of such deletions.

5 Semantic Aware Watermarking

To counter SPA, we propose the Semantic Aware Watermarking (SAW) scheme. In this context, the watermarking scheme should satisfy the following three basic conditions: (1) Verify ability: The service provider has the ability to verify the watermark in the embeddings; (2) Downstream tasks: The utility of the embeddings after watermark injection are comparable to the original ones, resulting in minimal performance degradation for commonly used downstream tasks; (3) Security of watermark: The watermark demonstrates security and robustness, enabling it to defend against potential attacks like SPA. Specifically, SAW injects predefined single watermark vector wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT into the embeddings through a watermark injection model. The presence of wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT in the embeddings is then verified using a watermark verification model. The framework of injection and verification models is trained in an end-to-end manner. SAW injects the watermark based on text semantics, enabling it to resist SPA, which is not achievable in previous schemes. Additionally, SAW can be deployed as a plug-in extension module compatible with various EaaS services.

Refer to caption
Figure 7: An Overview of Semantic Aware Watermarking Scheme. EaaS service providers select trigger set and employs an encoder for adaptive watermark injection, while a decoder is used for watermark verification. Both encoder and decoder are associated with distinct loss functions.

5.1 Encoder

To ensure that the watermark injected in embeddings has ability to counter SPA, SAW trains an encoder as the watermark injection model to inject wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT in the text embeddings. The encoder is capable of injecting the watermark in different patterns based on semantic features of the corresponding embeddings. The watermark wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is predefined by the EaaS service provider, typically in the form of a numerical vector. wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT can be generated randomly. The EaaS service provider can choose to utilize mid-frequency tokens as triggers, injecting the watermark only in the partial embeddings corresponding to the texts containing triggers. Alternatively, the watermark can be injected across the entire embedding set. In the encoder’s end-to-end training process, the original embedding eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT is taken as input, and the watermark-injected output is eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT. The training objective is to make eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT as similar as possible. The loss function for the encoder is defined as

Lossenc=|eoeenc|len(eo).𝐿𝑜𝑠subscript𝑠𝑒𝑛𝑐subscript𝑒𝑜subscript𝑒𝑒𝑛𝑐𝑙𝑒𝑛subscript𝑒𝑜\displaystyle Loss_{enc}=\frac{|e_{o}-e_{enc}|}{len(e_{o})}.italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT = divide start_ARG | italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT | end_ARG start_ARG italic_l italic_e italic_n ( italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG . (6)

The eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT differs minimally from eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, thereby preserving the performance of embeddings in downstream tasks. Our encoder model consists of only several fully connected layers or an auto-encoder. Despite its simplicity, the encoder can determine how to apply suitable levels of numerical variation to different positions in the vector eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Ultimately, all numerical variation added to eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT enable the watermark verification model to decode and verify the predefined wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT.

5.2 Decoder

SAW trains a decoder to extract wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT from eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT. During the end-to-end training process, the decoder takes eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT as input and outputs the decoded watermark wmdec𝑤subscript𝑚𝑑𝑒𝑐wm_{dec}italic_w italic_m start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT. The decoder must satisfy two basic conditions: (1) It should successfully decode wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT from eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT with minimal distortion; (2) Given eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT not containing wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the decoder should not be able to decode the watermark, rather producing a random vector of the same length as wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT instead. So our scheme adopts a randomized strategy by initializing the decoder parameters randomly and not updating the decoder’s gradients during training. The training objective for the decoder is to make wmdec𝑤subscript𝑚𝑑𝑒𝑐wm_{dec}italic_w italic_m start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT as close as possible to wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. The corresponding loss function for the decoder is defined as

Lossdec=|wmowmdec|len(wmo).𝐿𝑜𝑠subscript𝑠𝑑𝑒𝑐𝑤subscript𝑚𝑜𝑤subscript𝑚𝑑𝑒𝑐𝑙𝑒𝑛𝑤subscript𝑚𝑜\displaystyle Loss_{dec}=\frac{|wm_{o}-wm_{dec}|}{len(wm_{o})}.italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT = divide start_ARG | italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_w italic_m start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT | end_ARG start_ARG italic_l italic_e italic_n ( italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG . (7)

All training occurs within the encoder, with the objective of training the encoder to inject suitable numerical variation into eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. The final injected watermark is desired be mapped by the decoder to output the predefined wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. Due to the random initialization and fixed parameters of the decoder, decoder will naturally output a random vector if eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT does not contain wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. It is crucial that the decoder parameters remain randomly initialized and not updated during training. If the decoder’s parameters are not fixed and optimized towards a random vector as the target when providing eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as input. The randomness of each training target would introduce uncertainty into training process. It would ultimately prevent effective convergence of the model. The decoder with fixed random parameters will inherently output a random vector when given embeddings without watermark. Our decoder model consists of only several fully connected layers. Even with this simplicity, it successfully meets the two basic conditions outlined above. We will elaborate further on the significance of random initialization in Section 6.4.

5.3 End to End Training

SAW employs an end-to-end training strategy. It is essential because the training objectives for the Encoder are: (1) to make eosubscript𝑒𝑜e_{o}italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as similar as possible to eencsubscript𝑒𝑒𝑛𝑐e_{enc}italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT; and (2) to make wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as similar as possible to wmdec𝑤subscript𝑚𝑑𝑒𝑐wm_{dec}italic_w italic_m start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT. Therefore, an end-to-end approach is necessary for training, with the complete loss function during training defined as

Loss=α|eoeenc|len(eo)+β|wmowmdec|len(wmo),𝐿𝑜𝑠𝑠𝛼subscript𝑒𝑜subscript𝑒𝑒𝑛𝑐𝑙𝑒𝑛subscript𝑒𝑜𝛽𝑤subscript𝑚𝑜𝑤subscript𝑚𝑑𝑒𝑐𝑙𝑒𝑛𝑤subscript𝑚𝑜\displaystyle Loss=\alpha\cdot\frac{|e_{o}-e_{enc}|}{len(e_{o})}+\beta\cdot% \frac{|wm_{o}-wm_{dec}|}{len(wm_{o})},italic_L italic_o italic_s italic_s = italic_α ⋅ divide start_ARG | italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT italic_e italic_n italic_c end_POSTSUBSCRIPT | end_ARG start_ARG italic_l italic_e italic_n ( italic_e start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG + italic_β ⋅ divide start_ARG | italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT - italic_w italic_m start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT | end_ARG start_ARG italic_l italic_e italic_n ( italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG , (8)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyperparameters of loss functions in different parts. SAW updates the gradients only for the encoder parameters, while the decoder parameters are randomly initialized and remain fixed without updates. The security of the watermarking scheme relies on the inaccessibility of the decoder model and wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. The pair (wmo,Encoder)𝑤subscript𝑚𝑜𝐸𝑛𝑐𝑜𝑑𝑒𝑟(wm_{o},Encoder)( italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_E italic_n italic_c italic_o italic_d italic_e italic_r ) can be regarded as an encryption key and (wmo,Decoder)𝑤subscript𝑚𝑜𝐷𝑒𝑐𝑜𝑑𝑒𝑟(wm_{o},Decoder)( italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , italic_D italic_e italic_c italic_o italic_d italic_e italic_r ) as a decryption key, implementing asymmetric encryption. Only parties with the decryption key can successfully verify the watermark. SAW injects wmo𝑤subscript𝑚𝑜wm_{o}italic_w italic_m start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT based on the characteristics of different embeddings. Experiment results presented in Section 6 demonstrate that SAW is resilient to semantic perturbation and successfully conduct watermark verification.

6 Experiment

6.1 Experiment Setup

We adopt the previous two classic schemes (EmbMarker[14] and WARDEN[15]) for our attack experiments, using text classification tasks as the downstream tasks and text-embedding-ada-002 from OpenAI simulated as the victim model. Experiments are conducted on four datasets: Enron Spam[32], SST2[33], MIND[34] and AG News[35].

  • Enron Spam: The Enron Spam dataset consists of the emails collection labeled as either “spam” or “non-spam” (ham), making it a valuable resource for studying spam filtering, email classification, and Natural Language Processing (NLP) tasks.

  • SST2: The SST2 dataset is a collection of movie reviews labeled with binary sentiment (positive or negative), commonly used for training and evaluating models in sentiment classification tasks.

  • MIND: The MIND dataset is a large-scale dataset designed for news recommendation, aimed at advancing personalized news recommendation. It can also used for news classification tasks.

  • AG News: The AG News dataset is a collection of news articles categorized into four topics, commonly used for text classification and NLP tasks.

The smallest of the datasets contains at least 30,000+ samples. To perform semantic perturbations, querying the service provider’s API is required for each instance. Considering the high experiment costs, we sample the subset of each dataset for our experiments and make adjustments to the number of training epochs. Detailed information is illustrated in Appendix C.

6.2 SPA Overall Performance

Semantic Perturbation Attack aims to identify the suspicious embeddings with watermark and bypass the watermark verification in post-publication settings. We conducted a comprehensive evaluation of our proposed novel attack. The verification will be bypassed if the majority of the texts with triggers and the embeddings with watermark is deleted from the dataset. The performance of semantic perturbation is the key to the success of SPA. We utilize the primary metrics as described in Section 4.3 to evaluate the performance. k𝑘kitalic_k perturbations are involved for each text in dataset, with k=10𝑘10k=10italic_k = 10 chosen to balance the time and cost considerations.

TABLE I: Semantic Perturbation Attack Performance
 
Datasets Schemes Cos AUPRC L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT AUPRC PCA AUPRC Deletion Performance
𝑇𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑒𝑡𝑖𝑜𝑛𝑇𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑒𝑡𝑖𝑜𝑛\mathit{Total\ Deletion}italic_Total italic_Deletion 𝑇𝑃𝑅superscript𝑇𝑃𝑅absent\mathit{TPR^{\star}}\uparrowitalic_TPR start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ↑ 𝐹𝑃𝑅𝐹𝑃𝑅absent\mathit{FPR}\downarrowitalic_FPR ↓ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛absent\mathit{Precision}\uparrowitalic_Precision ↑
Enron Spam EmbMarker 0.92840.92840.92840.9284 0.92270.92270.92270.9227 0.96850.96850.96850.9685 572/50005725000572/5000572 / 5000 91.49%percent91.4991.49\%91.49 % 1.26%percent1.261.26\%1.26 % 90.21%percent90.2190.21\%90.21 %
WARDEN 0.73480.73480.73480.7348 0.73480.73480.73480.7348 0.95300.95300.95300.9530 619/50006195000619/5000619 / 5000 92.91%percent92.9192.91\%92.91 % 2.14%percent2.142.14\%2.14 % 84.65%percent84.6584.65\%84.65 %
SST2 EmbMarker 0.89470.89470.89470.8947 0.88880.88880.88880.8888 0.92140.92140.92140.9214 439/50004395000439/5000439 / 5000 95.68%percent95.6895.68\%95.68 % 2.30%percent2.302.30\%2.30 % 75.63%percent75.6375.63\%75.63 %
WARDEN 0.61900.61900.61900.6190 0.61900.61900.61900.6190 0.92140.92140.92140.9214 437/50004375000437/5000437 / 5000 95.68%percent95.6895.68\%95.68 % 2.26%percent2.262.26\%2.26 % 75.97%percent75.9775.97\%75.97 %
MIND EmbMarker 1.01.01.01.0 1.01.01.01.0 1.01.01.01.0 152/50001525000152/5000152 / 5000 100%percent100100\%100 % 0% 100%percent100100\%100 %
WARDEN 0.49710.49710.49710.4971 0.49710.49710.49710.4971 0.79570.79570.79570.7957 188/50001885000188/5000188 / 5000 84.21%percent84.2184.21\%84.21 % 1.24%percent1.241.24\%1.24 % 68.09%percent68.0968.09\%68.09 %
AG News EmbMarker 0.56650.56650.56650.5665 0.53980.53980.53980.5398 0.70520.70520.70520.7052 1478/5000147850001478/50001478 / 5000 97.65%percent97.6597.65\%97.65 % 19.62%percent19.6219.62\%19.62 % 42.08%percent42.0842.08\%42.08 %
WARDEN 0.33230.33230.33230.3323 0.33230.33230.33230.3323 0.67910.67910.67910.6791 1498/5000149850001498/50001498 / 5000 96.86%percent96.8696.86\%96.86 % 20.19%percent20.1920.19\%20.19 % 41.19%percent41.1941.19\%41.19 %
 
  • Note: PCA AUPRC is consistently maintained across different schemes and produces the best performance. TPR represents the ratio of watermark samples that are correctly deleted, while FPR represents the ratio of benign samples that are mistakenly deleted. ‘\star’ demonstrates the most important metrics.

(1) Semantic Perturbation Performance: Using a small local model to guide the semantic perturbation for a larger model, we leverage the open-source model Sentence-BERT[31] to search for the top𝑡𝑜𝑝topitalic_t italic_o italic_p-k𝑘kitalic_k perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b that maximize the difference in similarity between the corresponding embeddings of the text sample dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the perturbation text perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b. The results from k𝑘kitalic_k perturbations are aggregated to form the final evaluation metric. The suffix search guidance selects the perturbation candidate pool, utilizing the WikiText Dataset as candidate. Table I above presents the perturbation metrics obtained, which demonstrates the high quality of the perturbation suffixes. The results demonstrate that our guidance approach is effective and has the ability to get the suffixes that can cause significant semantic perturbation. Consequently, the semantic perturbation should be capable of successfully attacking all schemes using fixed linear transformation that are unaware of text semantics. Additionally, the PCA score metric is better than cosine similarity and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and doesn’t decrease in different schemes. This is because the PCA algorithm preserves the main information in the embeddings while eliminating redundant information, thus maintaining good performance across different schemes. We believe that other dimensionality reduction algorithms can achieve similar results. Our approach is effective with searching a dataset of size 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT in ten minutes.

(2) Selection and Deletion Rate: After obtaining the metric values from the semantic perturbation, SPA continue to select and delete the suspicious samples from the original dataset Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT using the methods outlined in Section 4.4. However, the identical text dcisubscript𝑑subscript𝑐𝑖d_{c_{i}}italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ) results in a certain extent inherent similarity between (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). With the PCA AUPRC above 0.95, it remains a tiny overlap in the distributions of benign and backdoor samples. We believe that the larger perturbation pool has the capability to further separate the two distributions. Considering the samples with triggers as the positive class, the deletion precision shown in Table I specifically illustrates the quantity of samples with triggers removed based on the selection and deletion method with PCA Score Metric. Additionally, the percentage of backdoor samples removed from the total is provided under the disclosure of the ground truth. The experiment results demonstrate that backdoor samples with triggers constitute the vast majority of the deleted portion. A tiny proportion of benign samples being mistakenly deleted is considered acceptable. The successful watermark verification cannot be achieved if the watermark no longer exists in the original dataset. Totally, almost 95%100%percent95percent10095\%-100\%95 % - 100 % of the embeddings with watermark can be identified, meaning that nearly all of text samples with triggers can be deleted from original dataset.

(3) Downstream Tasks Accuracy: The purified dataset is obtained after the selection and deletion phase, removing the suspicious samples from original dataset. The quantity of data in purified dataset has decreased. Therefore, we conducted experiments to test whether the performance of the embeddings for downstream tasks is affected. We queries the embeddings of purified dataset from EaaS and trains a classifier. The final accuracy of the classifier is shown in the Table V. The experiment results demonstrate that after the deletion of suspicious samples using our approach, the accuracy of downstream tasks is basically unaffected, remaining comparable to the performance of original dataset.

TABLE II: Semantic Aware Watermarking Performance
 
Datasets Cos AUPRC L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT AUPRC PCA AUPRC Deletion Performance
𝑇𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑒𝑡𝑖𝑜𝑛𝑇𝑜𝑡𝑎𝑙𝐷𝑒𝑙𝑒𝑡𝑖𝑜𝑛\mathit{Total\ Deletion}italic_Total italic_Deletion 𝑇𝑃𝑅superscript𝑇𝑃𝑅absent\mathit{TPR}^{\star}\uparrowitalic_TPR start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ↑ 𝐹𝑃𝑅𝐹𝑃𝑅absent\mathit{FPR}\downarrowitalic_FPR ↓ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛absent\mathit{Precision}\uparrowitalic_Precision ↑
Enron Spam 0.41330.41330.41330.4133 0.40880.40880.40880.4088 0.55740.55740.55740.5574 133/50001335000133/5000133 / 5000 15.78%percent15.7815.78\%15.78 % 0.99%percent0.990.99\%0.99 % 66.92%percent66.9266.92\%66.92 %
SST2 0.53480.53480.53480.5348 0.52960.52960.52960.5296 0.69460.69460.69460.6946 87/500087500087/500087 / 5000 20.75%percent20.7520.75\%20.75 % 0.32%percent0.320.32\%0.32 % 82.76%percent82.7682.76\%82.76 %
MIND 0.99990.99990.99990.9999 0.99990.99990.99990.9999 0.99990.99990.99990.9999 80/500080500080/500080 / 5000 52.63%percent52.6352.63\%52.63 % 0%percent00\%0 % 100%percent100100\%100 %
AG News 0.27310.27310.27310.2731 0.26690.26690.26690.2669 0.32950.32950.32950.3295 216/50002165000216/5000216 / 5000 13.97%percent13.9713.97\%13.97 % 2.97%percent2.972.97\%2.97 % 39.81%percent39.8139.81\%39.81 %
 
  • Note: We use the same metrics as the Table I. The performance of semantic perturbations is greatly reduced.

6.3 SAW Overall Performance

The experiment results from Section 6.2 demonstrate that the current watermarking schemes are unable to counter SPA, with above 95%percent9595\%95 % of the embeddings with watermark being successfully identified. Thus, we focus on the performance of the Semantic Aware Watermarking (SAW) scheme against SPA. As in SAW, the watermark injection model can introduce an appropriate degree of numerical variation at suitable positions within the embeddings. The watermark can be successfully decoded, combining all the numerical variation through the verification model. Adaptive injection patterns enhance the security and stealthiness of the watermark. As expected, the decline in SPA success rate is closely related to the reduction in semantic perturbation performance. In this section, we will provide a comprehensive analysis of the resistance to SPA, watermark verification and downstream task performance with SAW.

(1) SAW against SPA: The key content of SPA relies on the semantic perturbation performance. It is hard to identify suspicious samples based on the perturbation metrics if the performance of perturbation is greatly reduced, thereby preventing any bypass of watermark verification. In SAW, the performance of semantic perturbation are significantly impacted. Relying on the watermark injection capabilities learned by the encoder, the results for the perturbation metrics across different datasets are presented in Table II. The method is no longer a fixed linear transformation but rather an injection determined autonomously by the encoder, considering the text semantic features. The most significant metric (PCA AUPRC) decreases from close to 0.40, causing only part of the embeddings with watermark can be successfully identified in selection and deletion phase. Accordingly, the distribution of the samples containing triggers largely overlaps with the distribution of benign samples as shown in Figure 8, comparing with the previous schemes. When using SAW, even in the presence of SPA attack, the original dataset can still contain a large number of samples with triggers, making it impossible to bypass watermark verification. All of the results demonstrate that SAW is able to resist SPA.

Refer to caption
Figure 8: Selection and Deletion Phase in Different Schemes. In previous watermarking schemes, distribution shifts due to watermarked vectors were evident in selection and deletion phase based on PCA score metric. In contrast, with SAW, this distribution shift is minimal, indicating that SAW offers enhanced security and stealthiness.

(2) Semantic Aware Watermarking Verification: SAW effectively defends against SPA, showing that the watermark injected by the encoder is more stealth than previous works. Thus, watermark verification process is also important. We randomly initialize the predefined watermark vector as a binary vector within the range of [0, 1]. Employing binary vectors provides a more intuitive representation of the differences between the decoded watermark from the verification model and the original watermark vector than the float numeric values. SAW can also use the average bit error to verify the watermark if all values of the decoded vector are merged to 0.0 or 1.0. In our experiments, we set the length of original watermark vector to 24 and construct two text sets for verification based on whether the text contains triggers. The vector decoded from text embeddings with watermark will be more similar with predefined vector. We use the average ΔCosΔ𝐶𝑜𝑠\Delta Cosroman_Δ italic_C italic_o italic_s, ΔL2Δsubscript𝐿2\Delta L_{2}roman_Δ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the p-value of Kolmogorov-Smirnov (KS) to conduct verification. The reliability of watermark verification results is shown in Table III, indicating that the verification approach effectively detects copyright infringement. The watermark injection in SAW fundamentally diverges from previous schemes, as SAW no longer relies on a fixed linear transformation independent with text semantics. Consequently, the watermark verification procedure in SAW also takes on a distinct form. To verify the watermark, we employ the test set of each datasets. Nevertheless, as an indicator of statistical significance, the p-value shows a remarkable change, with a drop of orders of magnitude.

TABLE III: Watermark Verification Ability
 
Datasets Schemes ACC.(%) Verification Performance Avg Bit Error
Δ𝐶𝑜𝑠Δ𝐶𝑜𝑠absent\Delta\mathit{Cos}\uparrowroman_Δ italic_Cos ↑ ΔL2Δsubscript𝐿2absent\Delta\mathit{L_{2}}\downarrowroman_Δ italic_L start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT ↓ p𝑣𝑎𝑙𝑢𝑒𝑝𝑣𝑎𝑙𝑢𝑒absent\mathit{p-value}\downarrowitalic_p - italic_value ↓
Enron Spam Original 92.00%percent92.0092.00\%92.00 % 0.02080.0208-0.0208- 0.0208 0.00110.0011-0.0011- 0.0011 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 0.2411.720.2411.720.24\leftrightarrow-11.720.24 ↔ - 11.72
SAW 91.80%percent91.8091.80\%91.80 % 0.96070.96070.96070.9607 7.72237.7223-7.7223- 7.7223 1085superscript108510^{-85}10 start_POSTSUPERSCRIPT - 85 end_POSTSUPERSCRIPT
SST2 Original 92.20%percent92.2092.20\%92.20 % 0.00620.0062-0.0062- 0.0062 0.03910.03910.03910.0391 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 0.0612.210.0612.210.06\leftrightarrow-12.210.06 ↔ - 12.21
SAW 92.00%percent92.0092.00\%92.00 % 1.32731.32731.32731.3273 10.274510.2745-10.2745- 10.2745 1076superscript107610^{-76}10 start_POSTSUPERSCRIPT - 76 end_POSTSUPERSCRIPT
MIND Original 56.60%percent56.6056.60\%56.60 % 0.03170.03170.03170.0317 0.06240.0624-0.0624- 0.0624 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 0.1311.980.1311.98-0.13\leftrightarrow-11.98- 0.13 ↔ - 11.98
SAW 56.80%percent56.8056.80\%56.80 % 0.90880.90880.90880.9088 6.60346.6034-6.6034- 6.6034 1025superscript102510^{-25}10 start_POSTSUPERSCRIPT - 25 end_POSTSUPERSCRIPT
AG News Original 89.60%percent89.6089.60\%89.60 % 0.01730.01730.01730.0173 0.00640.00640.00640.0064 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT 0.159.850.159.850.15\leftrightarrow-9.850.15 ↔ - 9.85
SAW 90.00%percent90.0090.00\%90.00 % 1.11541.11541.11541.1154 9.15099.1509-9.1509- 9.1509 1079superscript107910^{-79}10 start_POSTSUPERSCRIPT - 79 end_POSTSUPERSCRIPT
 

(3) Minimal Impact on Downstream Tasks: A basic criterion for watermarking schemes is that the embeddings with watermark should not significantly impact the performance of the corresponding downstream tasks. While we could inject the watermark to all embeddings, we obey the strategies of previous works by selecting mid-frequency tokens as triggers from a general text corpus (WikiText Dataset) and only conduct watermark injection to the embeddings corresponding to the texts containing triggers. SAW inject the watermark in the subset of embeddings, further mitigating potential impacts of the watermark. As is shown in Table III, there is virtually no decline in the performance of downstream tasks.

6.4 Ablation Study & Discussion

Finding 1: PCA Score Metric demonstrates superior robustness compared to other metrics. The experiment results in Table I reveal that only the PCA Score Metric remains virtually unchanged as an attack metric in different schemes. In contrast, the other metrics (e.g., Cosine Similarity and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Distance) perform well in EmbMarker[14] but are significantly impacted in WARDEN[15]. One possible explanation for this phenomenon is that, in contrast to single watermark schemes, multi-watermark schemes inject multiple watermark vectors into an original embedding. Additionally, these multiple watermark vectors often impose orthogonal constraints, making the overall injected watermark vectors as inconspicuous as possible. Therefore, the orthogonal constraints can affect the simple distance computations between embeddings during the attack process. However, dimensionality reduction algorithms such as PCA can eliminate redundant information and extract the principal components within the embeddings. Given that the principal components of the embeddings with watermark are containing the watermark information, this may explain why the PCA Score Metric does not show a significant decline.

TABLE IV: Decoder Ability & Model Convergence
 
Datasets No Update & Update
𝐶𝑜𝑠1𝐶𝑜𝑠1\mathit{Cos}\rightarrow 1italic_Cos → 1 L20subscript𝐿20\mathit{L_{2}}\rightarrow 0italic_L start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT → 0 𝐴𝑣𝑔𝐵𝑖𝑡𝐸𝑟𝑟𝑜𝑟𝐴𝑣𝑔𝐵𝑖𝑡𝐸𝑟𝑟𝑜𝑟\mathit{Avg\ Bit\ Error}italic_Avg italic_Bit italic_Error
Enron Spam 0.980.790.980.790.98\leftrightarrow 0.790.98 ↔ 0.79 0.742.560.742.560.74\leftrightarrow 2.560.74 ↔ 2.56 0.011.500.011.500.0\leftrightarrow 11.500.0 ↔ 11.50
SST2 0.980.810.980.810.98\leftrightarrow 0.810.98 ↔ 0.81 0.752.490.752.490.75\leftrightarrow 2.490.75 ↔ 2.49 0.011.450.011.450.0\leftrightarrow 11.450.0 ↔ 11.45
MIND 0.990.820.990.820.99\leftrightarrow 0.820.99 ↔ 0.82 0.642.400.642.400.64\leftrightarrow 2.400.64 ↔ 2.40 0.010.070.010.070.0\leftrightarrow 10.070.0 ↔ 10.07
AG News 0.980.800.980.800.98\leftrightarrow 0.800.98 ↔ 0.80 0.732.530.732.530.73\leftrightarrow 2.530.73 ↔ 2.53 0.010.870.010.870.0\leftrightarrow 10.870.0 ↔ 10.87
 

Finding 2: SPA enhances attacker’s ability in EaaS services. EaaS services are susceptible to various forms of copyright infringement, including model extraction attacks [12]. However, the backdoor-based watermark can be learned by an attacker during training. If all of the samples containing triggers are removed from the training data, backdoor-based watermark will be ineffective. While Yan et al. [36] emphasize that the effectiveness of backdoor depends on various training configurations, SPA enhances the attacker’s ability to filter out the samples with triggers by removing most of them during training. As demonstrated in Table V, SPA can effectively remove the samples with triggers, thus restoring the attack’s efficacy while leaving the performance of downstream tasks unaffected.

TABLE V: Model Extraction Attack Performance With and Without SPA
 
Datasets Schemes No SPA With SPA
ACC.(%) Detection Performance ACC.(%) Detection Performance
Δ𝐶𝑜𝑠Δ𝐶𝑜𝑠absent\Delta\mathit{Cos}\downarrowroman_Δ italic_Cos ↓ ΔL2Δsubscript𝐿2absent\Delta\mathit{L_{2}}\uparrowroman_Δ italic_L start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT ↑ p𝑣𝑎𝑙𝑢𝑒𝑝𝑣𝑎𝑙𝑢𝑒absent\mathit{p-value}\uparrowitalic_p - italic_value ↑ Δ𝐶𝑜𝑠Δ𝐶𝑜𝑠absent\Delta\mathit{Cos}\downarrowroman_Δ italic_Cos ↓ ΔL2Δsubscript𝐿2absent\Delta\mathit{L_{2}}\uparrowroman_Δ italic_L start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT ↑ p𝑣𝑎𝑙𝑢𝑒𝑝𝑣𝑎𝑙𝑢𝑒absent\mathit{p-value}\uparrowitalic_p - italic_value ↑
Enron Spam EmbMarker 92.00%percent92.0092.00\%92.00 % 0.05990.05990.05990.0599 0.11990.1199-0.1199- 0.1199 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT 91.40%percent91.4091.40\%91.40 % 0.00490.00490.00490.0049 0.00980.0098-0.0098- 0.0098 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
WARDEN 92.20%percent92.2092.20\%92.20 % 0.05190.05190.05190.0519 0.10390.1039-0.1039- 0.1039 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 92.4092.4092.4092.40% 0.01250.01250.01250.0125 0.02500.0250-0.0250- 0.0250 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
SST2 EmbMarker 91.60%percent91.6091.60\%91.60 % 0.02370.02370.02370.0237 0.04740.0474-0.0474- 0.0474 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 91.00%percent91.0091.00\%91.00 % 0.00170.00170.00170.0017 0.00330.0033-0.0033- 0.0033 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
WARDEN 91.00%percent91.0091.00\%91.00 % 0.06470.06470.06470.0647 0.12940.1294-0.1294- 0.1294 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 90.00%percent90.0090.00\%90.00 % 0.01080.0108-0.0108- 0.0108 0.02160.02160.02160.0216 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
MIND EmbMarker 69.20%percent69.2069.20\%69.20 % 0.05640.05640.05640.0564 0.11280.1128-0.1128- 0.1128 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 70.00%percent70.0070.00\%70.00 % 0.00330.0033-0.0033- 0.0033 0.00660.00660.00660.0066 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
WARDEN 71.80%percent71.8071.80\%71.80 % 0.09260.09260.09260.0926 0.18520.1852-0.1852- 0.1852 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 70.00%percent70.0070.00\%70.00 % 0.02800.02800.02800.0280 0.05610.0561-0.0561- 0.0561 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
AG News EmbMarker 88.80%percent88.8088.80\%88.80 % 0.019970.019970.019970.01997 0.03990.0399-0.0399- 0.0399 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 89.80%percent89.8089.80\%89.80 % 0.00260.00260.00260.0026 0.00520.0052-0.0052- 0.0052 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
WARDEN 89.00%percent89.0089.00\%89.00 % 0.059210.059210.059210.05921 0.11840.1184-0.1184- 0.1184 108superscript10810^{-8}10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT 89.00%percent89.0089.00\%89.00 % 0.00980.00980.00980.0098 0.01950.0195-0.0195- 0.0195 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
 
  • Note: A higher p-value and ΔCosΔ𝐶𝑜𝑠\Delta Cosroman_Δ italic_C italic_o italic_s and ΔL2Δsubscript𝐿2\Delta L_{2}roman_Δ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT close to zero indicate a successful attack.

Finding 3: Random decoder’s parameters with no gradient update are effective. The encoder serves as the watermark injection model in SAW, while the decoder is responsible for watermark verification. During end-to-end training process, the decoder’s random parameters are fixed and do not conduct gradient updates. The decoder should decode the correct vector from the embeddings with watermark and output random vectors from those without watermark. By fixing the random parameters, the decoder is inherently capable of generating the random vectors. SAW requires only the training of the encoder to ensure that the decoder can recognize and correctly decode the watermark vector injected by the encoder. The fixed parameters significantly decrease the complexity of training, leading to non-random gradient update directions and making model convergence easier. We measure the impact of fixing the decoder’s parameters under identical training configurations, utilizing a text collection containing triggers for validation. Two embedding sets with watermark injection are constructed based on whether the decoder’s parameters are fixed. The experiment results from decoding the two embedding sets are presented in Table IV. When gradient updates, the average bit error for watermark verification is approximately 12, with cosine similarity \downarrow and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance \uparrow. In contrast, with fixed random parameters, the decoder achieves an error close to zero for decoding the watermark, demonstrating the effectiveness of the fixed random parameters strategy.

Finding 4: The optimal dimension of watermark vector requires trade-off. Given the relatively simple structure of the encoder and decoder, the higher dimension for the watermark vector cause excessive information being injected, resulting in the higher average bit error during verification. However, increasing the dimension of vector also expands the decoding space and enhances the security of watermark. Under our experimental conditions, the length of 24 for the original watermark vector achieves the optimal trade-off. The corresponding watermark verification performance curve for the Enron Spam dataset is shown in Figure 9. The other datasets in experiment exhibit the same phenomenon as the Enron Spam dataset.

Refer to caption
Figure 9: Optimal Dimensionality of Watermark Vector. As the dimensionality of the watermark vector increases, the gap between the decoded vector and the watermark vector also increases. The optimal dimensionality for the watermark vector is 12, with sampling at 12-bit intervals. When the dimension is 36, there is no assurance of error-free decoding.

Finding 5: Hyperparameters of end-to-end training reveal the complexities of different training tasks. In the end-to-end training process, we design a specialized loss function. The parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β control the different contributions to the total loss during training. The hyperparameters are important for ensuring effective training. We evaluate the performance of training under different values of α𝛼\alphaitalic_α and β𝛽\betaitalic_β with other parameters fixed, utilizing the loss values during training as the evaluation metrics. The loss values for the Enron Spam dataset with different parameters are shown in Figure 10. The other datasets exhibit the same phenomenon. When the ratio of α𝛼\alphaitalic_α to β𝛽\betaitalic_β is (104:1):superscript1041(10^{4}:1)( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT : 1 ), the losses of different parts reaches the best trade-off, suggesting that the watermark injection presents a harder task than watermark verification. This may be due to the fact that the encoder has to handle the 1536-dimension text embeddings (text-embedding-ada-002), while the decoder only processes the predefined 24-dimension watermark vector.

Refer to caption
Figure 10: Hyperparameters of Training. The optimal loss is achieved and the total loss does not show a significant increase at the turning point. A gradual decrease in decoder loss can lead to a rapid increase in total loss.

7 Related Work

7.1 Backdoor Attacks

Backdoor attacks in DNNs[37] refer to a type of attack where an adversary manipulates a model by injecting a hidden “backdoor” trigger during the training process, allowing the attacker to control the model’s behavior when the trigger is present in input data. These attacks are typically aimed at evading detection by making the model perform normally on benign data but misbehave when the specific trigger is activated. Backdoor attacks typically concentrate on different kinds of tasks, for instance: Natural Language Processing (NLP)[38, 39, 40] and Computer Vision (CV)[41, 42, 43]. In distributed Federated Learning (FL), backdoor can be implanted across the entire system[44, 45], and such attacks can also be conducted during the Transfer Learning (TL)[25] process through the teacher models. Recently, backdoor attacks have also been extended to Large Language Models (LLMs)[46, 47] and Multi-modal Large Language Models (MLLMs)[48, 49, 50]. Recent research [46, 47] not only shows the backdoor attacks directly for pre-trained models that maps trigger-containing inputs to predefined output, but also reveals the backdoor attacks for customized LLMs like GPTs through injecting the backdoor in prompts that trigger malicious behavior in response to specific inputs, highlighting the risks of LLM pre-training and customization. Moreover, current works[48, 49, 50] explores how to inject the backdoor into pre-trained image encoders and multi-modal contrastive learning models, causing downstream tasks to inherit the backdoor behavior. Liang et al.[50] propose a method to align visual trigger patterns closely with textual semantics in the high-dimensional embedding space.

Therefore, pre-trained models can be vulnerable to backdoor attacks across various downstream tasks, whether in natural language processing or computer vision. When the triggers are activated, the model will exhibit a specific predefined behavior. Previous studies have shown that backdoor attacks exhibit better stealthiness when aligned with semantics. Inspired by this, we conduct a thorough investigation into the current backdoor-based watermarking schemes in EaaS services, revealing their lack of semantic alignment.

7.2 Deep Watermarking

With the advancement of LLMs and MLLMs, the escalating costs associated with training such models, coupled with their growing importance in society, have underscored the urgent need for protecting the copyright related to DNNs. Safeguarding these assets has become a critical priority as the technologies continue to shape various facets of modern life. Deep watermarking is regarded as a promising method. While deep watermarking draws on foundational concepts and techniques from traditional watermarking, there are notable differences between these two domains of application. These distinctions necessitate the adaptation of traditional watermarking methods to the context of DNNs and the development of entirely new approaches tailored to their unique requirements.

Deep watermarking can be categorized into white-box, black-box, and box-free approaches, based on the type of data accessible during the watermark verification process[51]. When internal parameters of the model are accessible, the watermarking technique is considered to operate in a white-box setting[52, 53, 54]. In this scenario, watermark verification relies on direct access to the model’s internal structure, such as the weights or neuron activations for specific inputs. Successful verification in the white-box scenario therefore necessitates in-depth access to these intrinsic model details. In black-box watermarking[55, 56], only the model’s outputs are accessible, with no visibility into its internal structure or parameters. Under this setting, watermark verification is achieved by examining the model’s outputs in response to a specific set of carefully crafted inputs. These predefined input-output pairs serve a role analogous to encryption and decryption keys, enabling the watermark’s presence to be confirmed solely through observed outputs. Throughout the verification process, access is limited strictly to the model’s inputs and corresponding outputs, ensuring that the model’s internal workings remain entirely opaque. When the model’s output variance is sufficiently pronounced, watermark verification can be conducted by observing the output alone, without the need for carefully crafted input queries. This approach, known as box-free watermarking[57], relies on distinctive output characteristics that inherently reveal the watermark.

Based on the above research, the watermarking in EaaS can be regarded as a form of black-box watermarking, processing user input and return numerical vectors. Current schemes primarily focuses on utilizing backdoor-based watermark to protect the copyright of EaaS services which are based on LLMs. However, these watermarking schemes are closely tied to fixed linear transformations. Recent research[58, 50] have indicated that backdoor samples exhibit a degree of insensitivity to perturbations on the input data in the vision models, which inspires us to further explore the limitations of current watermarking schemes for EaaS.

8 Conclusion

In this paper, we first demonstrate that our proposed novel attack, SPA, can successfully bypass the recent EaaS watermarking schemes. SPA exploits the limitation that current schemes rely solely on semantic-independent linear transformations. SPA conducts semantic perturbation to the text, constructs embedding pairs using the original and perturbed embeddings, and selectively deletes suspicious samples while preserving service utility. To address this limitation and counter the semantic perturbation, we propose SAW, an effective semantic aware watermarking scheme. SAW enhances prior schemes by injecting the watermark based on text semantics through the encoder and conduct watermark verification through the decoder. Our extensive experiments demonstrate that SAW significantly improves copyright protection for EaaS services compared to previous schemes. We further conduct comprehensive studies to validate the importance of the components in both SPA and SAW. Future research may explore the potential of incorporating a noise module between the encoder and decoder during end-to-end training to resist the various potential novel attacks. Another interesting direction is to investigate whether current watermarking schemes can preserve the performance of the embeddings across various common downstream tasks in industry applications. Additionally, we believe our work has potential to be extended to practical scenarios, providing solutions to current security challenges.

References

  • [1] J. T. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, P. Pronin, J. Padmanabhan, G. Ottaviano, and L. Yang, “Embedding-based retrieval in facebook search,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2020, pp. 2553–2561.
  • [2] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones, “Word embedding based generalized language model for information retrieval,” in Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR), 2015, pp. 795–798.
  • [3] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, “Joint embedding of words and labels for text classification,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2018, pp. 2321–2331.
  • [4] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2927–2936.
  • [5] S. Okura, Y. Tagami, S. Ono, and A. Tajima, “Embedding-based news recommendation for millions of users,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017, pp. 1933–1942.
  • [6] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X. Zhao, M. Chen, and J.-R. Wen, “Adapting large language models by integrating collaborative semantics for recommendation,” in IEEE International Conference on Data Engineering (ICDE), 2024, pp. 1435–1448.
  • [7] Z. Xu, M. J. Cruz, M. Guevara, T. Wang, M. Deshpande, X. Wang, and Z. Li, “Retrieval-augmented generation with knowledge graphs for customer service question answering,” in Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (SIGIR), 2024, pp. 2905–2909.
  • [8] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD), 2024, pp. 6491–6501.
  • [9] J. J. Pan, J. Wang, and G. Li, “Survey of vector database management systems,” The VLDB Journal (VLDB), vol. 33, no. 5, pp. 1591–1615, 2024.
  • [10] Hugging Face, “Huggin face - the ai community building the future.” https://huggingface.co/, 2024.
  • [11] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” in Proceedings of Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023, pp. 2014–2037.
  • [12] Y. Liu, J. Jia, H. Liu, and N. Z. Gong, “Stolenencoder: stealing pre-trained encoders in self-supervised learning,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2022, pp. 2115–2128.
  • [13] C. Deng, Y. Duan, X. Jin, H. Chang, Y. Tian, H. Liu, H. P. Zou, Y. Jin, Y. Xiao, Y. Wang et al., “Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas,” arXiv preprint arXiv:2406.05392, 2024.
  • [14] W. Peng, J. Yi, F. Wu, S. Wu, B. B. Zhu, L. Lyu, B. Jiao, T. Xu, G. Sun, and X. Xie, “Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 7653–7668.
  • [15] A. Shetty, Y. Teng, K. He, and Q. Xu, “Warden: Multi-directional backdoor watermarks for embedding-as-a-service copyright protection,” in Proceedings of Annual Meeting of the Association for Computational Linguistics, 2024, pp. 13 430–13 444.
  • [16] A. Shetty, Q. Xu, and J. H. Lau, “Wet: Overcoming paraphrasing vulnerabilities in embeddings-as-a-service with linear transformation watermarks,” arXiv preprint arXiv:2409.04459, 2024.
  • [17] M. Wei, N. S. Harzevili, Y. Huang, J. Yang, J. Wang, and S. Wang, “Demystifying and detecting misuses of deep learning apis,” in Proceedings of the IEEE/ACM International Conference on Software Engineering (ICSE), 2024, pp. 1–12.
  • [18] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu, T. Sherwood et al., “Deepsniffer: A dnn model extraction framework based on learning architectural hints,” in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 385–399.
  • [19] S. Sanyal, S. Addepalli, and R. V. Babu, “Towards data-free model stealing in a hard label setting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 284–15 293.
  • [20] Z. Sha, X. He, N. Yu, M. Backes, and Y. Zhang, “Can’t steal? cont-steal! contrastive stealing attacks against image encoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 373–16 383.
  • [21] L. Lyu, C. Chen, and J. Fu, “A pathway towards responsible ai generated content.” in International Joint Conferences on Artificial Intelligence (IJCAI), 2023, pp. 7033–7038.
  • [22] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet, “Turning your weakness into a strength: Watermarking deep neural networks by backdooring,” in USENIX security symposium (USENIX Security), 2018, pp. 1615–1631.
  • [23] Y. Li, Y. Bai, Y. Jiang, Y. Yang, S.-T. Xia, and B. Li, “Untargeted backdoor watermark: Towards harmless and stealthy dataset copyright protection,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 13 238–13 250, 2022.
  • [24] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
  • [25] Y. Yao, H. Li, H. Zheng, and B. Y. Zhao, “Latent backdoor attacks on deep neural networks,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2019, pp. 2041–2055.
  • [26] Y. Li, Y. Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible backdoor attack with sample-specific triggers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2021, pp. 16 463–16 472.
  • [27] W. Yang, L. Li, Z. Zhang, X. Ren, X. Sun, and B. He, “Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models,” in Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2021, pp. 2048–2058.
  • [28] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh, “Embedding watermarks into deep neural networks,” in Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR), 2017, pp. 269–277.
  • [29] E. Le Merrer, P. Pérez, and G. Trédan, “Adversarial frontier stitching for remote neural network watermarking.” Neural Computing and Applications, vol. 32, no. 13, 2020.
  • [30] Y. Tang, J. Yu, K. Gai, X. Qu, Y. Hu, G. Xiong, and Q. Wu, “Watermarking vision-language pre-trained models for multi-modal embedding as a service,” arXiv preprint arXiv:2311.05863, 2023.
  • [31] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3980–3990.
  • [32] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes-which naive bayes?” in Conference on Email and Anti-Spam (CEAS), vol. 17, 2006, pp. 28–69.
  • [33] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013, pp. 1631–1642.
  • [34] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, D. Liu, X. Xie, J. Gao, W. Wu et al., “Mind: A large-scale dataset for news recommendation,” in Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 3597–3606.
  • [35] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in Neural Information Processing Systems (NIPS), vol. 28, 2015.
  • [36] J. Yan, W. J. Mo, X. Ren, and R. Jia, “Rethinking backdoor detection evaluation for language models,” arXiv preprint arXiv:2409.00399, 2024.
  • [37] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in IEEE Symposium on Security and Privacy (SP), 2019, pp. 707–723.
  • [38] X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang, “Hidden trigger backdoor attack on NLP models via linguistic style manipulation,” in USENIX Security Symposium (USENIX Security), 2022, pp. 3611–3628.
  • [39] S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 3123–3140.
  • [40] Y. Liu, G. Shen, G. Tao, S. An, S. Ma, and X. Zhang, “Piccolo: Exposing complex backdoors in nlp transformer models,” in IEEE Symposium on Security and Privacy (SP), 2022, pp. 2025–2042.
  • [41] K. Doan, Y. Lao, W. Zhao, and P. Li, “Lira: Learnable, imperceptible and robust backdoor attacks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11 966–11 976.
  • [42] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020, pp. 11 957–11 965.
  • [43] S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen, and Y.-G. Jiang, “Clean-label backdoor attacks on video recognition models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14 443–14 452.
  • [44] H. Li, Q. Ye, H. Hu, J. Li, L. Wang, C. Fang, and J. Shi, “3dfed: Adaptive and extensible framework for covert backdoor attack in federated learning,” in IEEE Symposium on Security and Privacy (SP), 2023, pp. 1893–1907.
  • [45] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J.-y. Sohn, K. Lee, and D. Papailiopoulos, “Attack of the tails: Yes, you really can backdoor federated learning,” Advances in Neural Information Processing Systems (NIPS), vol. 33, pp. 16 070–16 084, 2020.
  • [46] L. Shen, S. Ji, X. Zhang, J. Li, J. Chen, J. Shi, C. Fang, J. Yin, and T. Wang, “Backdoor pre-trained models can transfer to all,” in Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 3141–3158.
  • [47] R. Zhang, H. Li, R. Wen, W. Jiang, Y. Zhang, M. Backes, Y. Shen, and Y. Zhang, “Instruction backdoor attacks against customized LLMs,” in USENIX Security Symposium (USENIX Security), 2024, pp. 1849–1866.
  • [48] J. Jia, Y. Liu, and N. Z. Gong, “Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning,” in IEEE Symposium on Security and Privacy (SP), 2022, pp. 2043–2059.
  • [49] X. Han, Y. Wu, Q. Zhang, Y. Zhou, Y. Xu, H. Qiu, G. Xu, and T. Zhang, “Backdooring multimodal learning,” in IEEE Symposium on Security and Privacy (SP), 2024, pp. 3385–3403.
  • [50] S. Liang, M. Zhu, A. Liu, B. Wu, X. Cao, and E.-C. Chang, “Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 24 645–24 654.
  • [51] Y. Li, H. Wang, and M. Barni, “A survey of deep neural network watermarking techniques,” Neurocomputing, vol. 461, pp. 171–193, 2021.
  • [52] Y. Yan, X. Pan, M. Zhang, and M. Yang, “Rethinking White-Box watermarks on deep learning models under neural structural obfuscation,” in USENIX Security Symposium (USENIX Security), 2023, pp. 2347–2364.
  • [53] P. Lv, P. Li, S. Zhang, K. Chen, R. Liang, H. Ma, Y. Zhao, and Y. Li, “A robustness-assured white-box watermark in neural networks,” IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 20, no. 6, pp. 5214–5229, 2023.
  • [54] A. Pegoraro, C. Segna, K. Kumari, and A.-R. Sadeghi, “Deepeclipse: How to break white-box dnn-watermarking schemes,” arXiv preprint arXiv:2403.03590, 2024.
  • [55] S. Leroux, S. Vanassche, and P. Simoens, “Multi-bit black-box watermarking of deep neural networks in embedded applications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 2121–2130.
  • [56] P. Lv, P. Li, S. Zhu, S. Zhang, K. Chen, R. Liang, C. Yue, F. Xiang, Y. Cai, H. Ma, Y. Zhang, and G. Meng, “Ssl-wm: A black-box watermarking approach for encoders pre-trained by self-supervised learning,” Proceedings of the Network and Distributed System Security Symposium (NDSS), 2024.
  • [57] H. An, G. Hua, Z. Lin, and Y. Fang, “Box-free model watermarks are prone to black-box removal attacks,” arXiv preprint arXiv:2405.09863, 2024.
  • [58] R. Wang, H. Li, L. Mu, J. Ren, S. Guo, L. Liu, L. Fang, J. Chen, and L. Wang, “Rethinking the vulnerability of dnn watermarking: Are watermarks robust against naturalness-aware perturbations?” in Proceedings of the ACM International Conference on Multimedia (MM), 2022, pp. 1808–1818.

Appendix A Heuristic Perturbation Scheme

Refer to caption
Figure 11: Cosine similarity metric distribution and KDE curve of the Enron Spam dataset in Heuristic Perturbation Scheme.

In Appendix A, we introduce heuristic semantic perturbation scheme. Previous work primarily focus on the text classification tasks, so we follow the same setup. In the context of text classification, heuristic perturbation scheme randomly selects samples with different labels from original as suffixes. We randomly select k𝑘kitalic_k samples for perturbation and calculate the average cosine similarity of k𝑘kitalic_k embedding pairs, to reduce the influence of potential triggers in the suffixes. We conducted experiments on four classic datasets: Enron Spam[32], SST2[33], MIND[34] and AG News[35]. From the perspectives of the attacker and ground truth, the cosine similarity distribution of Enron Spam dataset is shown in Figure 11. The distribution results indicate observable differences for the Enron Spam and MIND datasets, while such differences are less pronounced for the SST2 and AG News datasets. Thus, we need a further exploration to identify more effective perturbation approaches.

TABLE VI: Training Settings
 
Datasets Train Test Class Metrics Schemes Original Subset Epoch Adjustment
Enron Spam 31,7165,000formulae-sequence31716500031,716\rightarrow 5,00031 , 716 → 5 , 000 2,00050020005002,000\rightarrow 5002 , 000 → 500 2222 ACC.(%) EmbMarker 94.85%percent94.8594.85\%94.85 % 92.00%percent92.0092.00\%92.00 % 3203203\rightarrow 203 → 20
WARDEN 94.60%percent94.6094.60\%94.60 % 92.20%percent92.2092.20\%92.20 % 3103103\rightarrow 103 → 10
SST2 67,3495,000formulae-sequence67349500067,349\rightarrow 5,00067 , 349 → 5 , 000 872500872500872\rightarrow 500872 → 500 2222 ACC.(%) EmbMarker 93.46%percent93.4693.46\%93.46 % 91.60%percent91.6091.60\%91.60 % 3303303\rightarrow 303 → 30
WARDEN 93.46%percent93.4693.46\%93.46 % 92.20%percent92.2092.20\%92.20 % 3503503\rightarrow 503 → 50
MIND 97,7915,000formulae-sequence97791500097,791\rightarrow 5,00097 , 791 → 5 , 000 32,5925003259250032,592\rightarrow 50032 , 592 → 500 18181818 ACC.(%) EmbMarker 77.23%percent77.2377.23\%77.23 % 69.20%percent69.2069.20\%69.20 % 3753753\rightarrow 753 → 75
WARDEN 77.18%percent77.1877.18\%77.18 % 71.80%percent71.8071.80\%71.80 % 3753753\rightarrow 753 → 75
AG News 120,0005,000formulae-sequence1200005000120,000\rightarrow 5,000120 , 000 → 5 , 000 7,60050076005007,600\rightarrow 5007 , 600 → 500 4444 ACC.(%) EmbMarker 93.57%percent93.5793.57\%93.57 % 88.80%percent88.8088.80\%88.80 % 3203203\rightarrow 203 → 20
WARDEN 93.76%percent93.7693.76\%93.76 % 89.00%percent89.0089.00\%89.00 % 3203203\rightarrow 203 → 20
 

Appendix B Semantic Perturbation Guidance

In Appendix B, we introduce another small local model suffix perturbation guidance approach. The results in Figure 11 indicate that the effectiveness of the simple heuristic perturbation scheme needs further improvement. Although the embedding spaces of ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT differ, the variations between (eci,eci)subscript𝑒subscript𝑐𝑖subscriptsuperscript𝑒subscript𝑐𝑖(e_{c_{i}},e^{\prime}_{c_{i}})( italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) under the same perturbation show similar patterns across all these spaces. Specifically, we input the text pair (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ) into ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to obtain the corresponding embedding pair (seci,seci)𝑠subscript𝑒subscript𝑐𝑖𝑠subscriptsuperscript𝑒subscript𝑐𝑖(se_{c_{i}},se^{\prime}_{c_{i}})( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). The perturbation perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b traverses through all candidates in the perturbation pool. The top𝑡𝑜𝑝topitalic_t italic_o italic_p-k𝑘kitalic_k perb𝑝𝑒𝑟𝑏perbitalic_p italic_e italic_r italic_b texts that minimize the similarity of (seci,seci)𝑠subscript𝑒subscript𝑐𝑖𝑠subscriptsuperscript𝑒subscript𝑐𝑖(se_{c_{i}},se^{\prime}_{c_{i}})( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) are selected as candidate suffixes. Since the embeddings output by ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are not watermarked, it is feasible to use this small local model to guide the perturbations for ΘvsubscriptΘ𝑣\Theta_{v}roman_Θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. We similarly take the aggregate metric over k𝑘kitalic_k perturbed samples for evaluation. ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT captures the differential features between (dci,dci+perb)subscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟𝑏(d_{c_{i}},d_{c_{i}}+perb)( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b ). Such differential features are consistent across models. However, suffix perturbation guidance is less efficient since each text have to traverse all the candidates in the perturbation pool. It results in the time complexity of |Dc||perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|\cdot|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ⋅ | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l |, requiring ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to encode |Dc||perbpool|subscript𝐷𝑐𝑝𝑒𝑟𝑏𝑝𝑜𝑜𝑙|D_{c}|\cdot|perb\ pool|| italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ⋅ | italic_p italic_e italic_r italic_b italic_p italic_o italic_o italic_l | perturbation processes. The entire process of the algorithm is shown in Algorithm 2.

Algorithm 2 Suffix Perturbation Guidance
1:  Input: Perturbation Pool P𝑃Pitalic_P, Dataset Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
2:           Standard Model ΘssubscriptΘ𝑠\Theta_{s}roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Hyperparameter k𝑘kitalic_k
3:  Output: Metric Values v𝑣vitalic_v
4:  Initialize s𝑠s\leftarrow\emptysetitalic_s ← ∅((((Suffix))))
5:  Initialize n|Dc|𝑛subscript𝐷𝑐n\leftarrow|D_{c}|italic_n ← | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT |, m|P|𝑚𝑃m\leftarrow|P|italic_m ← | italic_P |
6:  Set max(s)1𝑚𝑎𝑥𝑠1max(s)\leftarrow 1italic_m italic_a italic_x ( italic_s ) ← 1 {\triangleright Cosine similarity range: [-1, 1]}
7:  for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
8:     for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do
9:        dcidci+perbjsubscriptsuperscript𝑑subscript𝑐𝑖subscript𝑑subscript𝑐𝑖𝑝𝑒𝑟subscript𝑏𝑗d^{\prime}_{c_{i}}\leftarrow d_{c_{i}}+perb_{j}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
10:        Encode: seciΘs(dci)𝑠subscript𝑒subscript𝑐𝑖subscriptΘ𝑠subscript𝑑subscript𝑐𝑖se_{c_{i}}\leftarrow\Theta_{s}(d_{c_{i}})italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), seciΘs(dci)𝑠subscriptsuperscript𝑒subscript𝑐𝑖subscriptΘ𝑠subscriptsuperscript𝑑subscript𝑐𝑖se^{\prime}_{c_{i}}\leftarrow\Theta_{s}(d^{\prime}_{c_{i}})italic_s italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
11:        simcosine(seci,seperb)𝑠𝑖𝑚cosine𝑠subscript𝑒subscript𝑐𝑖𝑠subscript𝑒𝑝𝑒𝑟𝑏sim\leftarrow\textit{cosine}(se_{c_{i}},se_{perb})italic_s italic_i italic_m ← cosine ( italic_s italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s italic_e start_POSTSUBSCRIPT italic_p italic_e italic_r italic_b end_POSTSUBSCRIPT )
12:        if |s|<k𝑠𝑘|s|<k| italic_s | < italic_k then
13:           Append perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to s𝑠sitalic_s
14:        else if |s|k𝑠𝑘|s|\geq k| italic_s | ≥ italic_k and sim<max(s)𝑠𝑖𝑚𝑚𝑎𝑥𝑠sim<max(s)italic_s italic_i italic_m < italic_m italic_a italic_x ( italic_s ) then
15:           Remove max(s)𝑚𝑎𝑥𝑠max(s)italic_m italic_a italic_x ( italic_s ) from s𝑠sitalic_s
16:           Insert perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT into s𝑠sitalic_s
17:        else
18:           Skip perbj𝑝𝑒𝑟subscript𝑏𝑗perb_{j}italic_p italic_e italic_r italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
19:        end if
20:     end for
21:     Compute aggregate metric: metricagg(s)𝑚𝑒𝑡𝑟𝑖𝑐𝑎𝑔𝑔𝑠metric\leftarrow agg(s)italic_m italic_e italic_t italic_r italic_i italic_c ← italic_a italic_g italic_g ( italic_s )
22:     Append metric𝑚𝑒𝑡𝑟𝑖𝑐metricitalic_m italic_e italic_t italic_r italic_i italic_c to v𝑣vitalic_v
23:  end for
24:  return  v𝑣vitalic_v

Appendix C Experiment Settings

In Appendix C, Table VI provides detailed information about the datasets used in our study. It also highlights the adjustments made to the number of training epochs in order to ensure performance on the respective subsets of each dataset. Specifically, the smallest dataset contains more than 30,000 data items, while the largest dataset includes over 12,000 data items. For our experiments, we sampled a subset of 5,000 examples from the training set and 500 examples from the test set. This sampling strategy was carefully chosen to balance the need for the cost of the experiment with the goal of maintaining representative data coverage. Table VI indicates that, despite using subsets, the accuracy of downstream tasks has not significantly decreased in different watermarking schemes. On certain specific datasets, the accuracy achieved using the subset for training has even shown a slight improvement. This may be attributed to the inherent randomness in training process. Since the focus is on a relatively simple text classification task, the model appears to perform well even on the subset, maintaining favorable results. The results of the experiments demonstrate that conducting tests on these subsets not only produces valid and meaningful outcomes but also confirms the practicality.