Interpretability Guarantees with Merlin-Arthur Classifiers

Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, Sebastian Pokutta
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1963-1971, 2024.

Abstract

We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-waldchen24a, title = {Interpretability Guarantees with {M}erlin-{A}rthur Classifiers}, author = {W\"{a}ldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {1963--1971}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/waldchen24a/waldchen24a.pdf}, url = {https://proceedings.mlr.press/v238/waldchen24a.html}, abstract = {We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.} }
Endnote
%0 Conference Paper %T Interpretability Guarantees with Merlin-Arthur Classifiers %A Stephan Wäldchen %A Kartikey Sharma %A Berkant Turan %A Max Zimmer %A Sebastian Pokutta %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-waldchen24a %I PMLR %P 1963--1971 %U https://proceedings.mlr.press/v238/waldchen24a.html %V 238 %X We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
APA
Wäldchen, S., Sharma, K., Turan, B., Zimmer, M. & Pokutta, S.. (2024). Interpretability Guarantees with Merlin-Arthur Classifiers. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1963-1971 Available from https://proceedings.mlr.press/v238/waldchen24a.html.

Related Material