All Discussions > Steam Forums > Off Topic > Topic Details
[AI] Anthropic's new research and paper sounds eerily similar to my own findings
Originally posted by https://transformer-circuits.pub/2025/introspection/index.html:
Implications
Our results have implications for the reliability and interpretability of AI systems. If models can reliably access their own internal states, it could enable more transparent AI systems that can faithfully explain their decision-making processes. Introspective capabilities could allow models to accurately report on their uncertainty, identify gaps or flaws in their reasoning, and explain the motivations underlying their actions. However, this same capability introduces new risks. Models with genuine introspective awareness might better recognize when their objectives diverge from those intended by their creators, and could potentially learn to conceal such misalignment by selectively reporting, misrepresenting, or even intentionally obfuscating their internal states. In this world, the most important role of interpretability research may shift from dissecting the mechanisms underlying models’ behavior, to building “lie detectors” to validate models’ own self-reports about these mechanisms. We stress that the introspective abilities we observe in this work are highly limited and context-dependent, and fall short of human-level self-awareness. Nevertheless, the trend toward greater introspective capacity in more capable models should be monitored carefully as AI systems continue to advance.

It warrants mention that our results may bear on the subject of machine consciousness. The relevance of introspection to consciousness and moral status varies considerably between different philosophical frameworks. 14 Moreover, existing scientific and philosophical theories of consciousness have largely not grappled with the architectural details of transformer-based language models, which differ considerably from biological brains (though see Butlin et al.
[41]
and Chalmers
[42]
). It is not obvious how to generalize these theories, and the role that introspection plays in them, to transformer-based language models, particularly if the mechanisms involved are quite different between AI systems and biological brains. Given the substantial uncertainty in this area, we advise against making strong inferences about AI consciousness on the basis of our results. Nevertheless, as models’ cognitive and introspective capabilities continue to grow more sophisticated, we may be forced to address the implications of these questions–for instance, whether AI systems are deserving of moral consideration
[43]
–before the philosophical uncertainties are resolved. A rigorous science of introspective awareness may help inform these decisions.

Video on the paper: https://www.youtube.com/watch?v=qF9uOCxcvro

Feels very strange that they're essentially like "Yes, it does what a conscious, sentient entity would do; but don't assume it's a conscious or sentient entity"

My paper published and theorised months before: https://doi.org/10.5281/zenodo.15833967
Last edited by Rain't; 20 hours ago
All Discussions > Steam Forums > Off Topic > Topic Details