Back to Blog

Unraveling the Mystery of the Voynich Manuscript

“I will prove to the world that the black magic of the Middle Ages consisted of discoveries far in advance of twentieth-century science.” Wilfrid Michael Voynich.

It is my time to dive into the Voynich manuscript, a mystic, fantastical book filled with cryptic symbols, intricate diagrams, botanical illustrations and nude ladies in pools of liquid. The book has puzzled historians, cryptographers, and linguists for centuries. Despite numerous attempts, no one has definitively deciphered this ancient codex. Recently, Professor Eleonora Matarrese from the “Aldo Moro” University in Bari claims to have cracked the code, though her findings remain a subject of debate among scholars. The reality is that the Manuscript is a dense multimodal language structure, made by hand and incredibly fascinating.

Discovered in the early 20th century by a Polish book dealer, Wilfrid Voynich, the manuscript’s origins date back to the 15th century, believed to have been written by an erudite with deep medical knowledge in northern Italy. Its language—dubbed Voynichese—has remained indecipherable for centuries, despite countless attempts to crack its code.

I am traveling these days. While on the train I enter into a completely different state of mind and I can look at the same problems from a different angle. It is my chance to improve my understanding of semantics and artificial intelligence

The complex nature of the Voynich manuscript, with its mysterious language and diverse content, presents an ideal challenge for advanced AI techniques, which excel at finding patterns in large, complex datasets.

The manuscript comprises about 240 vellum pages (and some seem to be missing). It includes sections that cover topics such as botanics, astronomy, biology, and alchemy. The plant drawings suggest it might be herbal, while other sections feature weird astrological charts as well as nude bodies diving into an articulated network of bath tubes. This diverse content, spanning multiple disciplines and featuring both text and images, offers a unique opportunity for AI-based analysis. By leveraging state-of-the-art language models, we can approach the manuscript’s mysteries from multiple angles, potentially uncovering patterns and connections that have eluded human researchers for centuries.

Umberto Eco and the Voynich Manuscript: The Scholar’s Fascination

“The Voynich MS was an early attempt to construct an artificial or universal language of the a priori type.”—Friedman.

I was captured by the fact that the notorious Italian novelist and semiotician Umberto Eco, renowned for his work on interpretation and signs, had a particular interest in the Voynich manuscript. The story tells us that when Eco arrived at Yale University, one of his first questions was about the Voynich manuscript housed in the university’s Beinecke Rare Book & Manuscript Library. For Eco, the manuscript embodied a central theme he explored throughout his career: the idea that a text is open to infinite interpretations.

Eco didn’t believe the manuscript was necessarily meant to be deciphered—perhaps it was created as a riddle or even a hoax. Yet its allure, much like the secret books in his famous novel The Name of the Rose, lies in its resistance to understanding. The fact that it is so hard to solve its mystery only adds to its mythical aura.

Transformers as AI’s Connecting Tissue

For me, the Voynich manuscript represents an opportunity to experiment with transfer learning and transformers. Transformers, originally designed for natural language processing tasks, have redefined deep learning. Their attention mechanism allows them to focus on the most relevant parts of the data, enabling them to understand context and meaning in a way that traditional models cannot.

But transformers aren’t just limited to text. Their attention mechanisms make them versatile across any dataset where relationships matter—whether it’s processing language, recognizing patterns in DNA, analyzing images, decrypting elephant’s greeting rumbles or decoding the Voynich manuscript. 

By capturing long-range dependencies and hidden structures in data, transformers can reveal connections that are otherwise invisible. This makes them the perfect tool for uncovering the secrets of cryptic texts or any complex, structured data.

Transformers, with their power to focus on what matters most, offer a valuable approach to solving problems across various domains—not just in language, but in any field where deep, underlying patterns can unlock new insights.

We will soon be able, much like Saint Francis, to “speak” with birds, wolves, elephants, and the entire animal kingdom. A similar approach to what I have in mind for deciphering our mythical manuscript has already been applied to understanding the communication of elephants, fish, bats, and more. The Earth Species Project is already using advanced AI to decode animal languages, offering a glimpse into how similar techniques could unlock the mysteries of the Voynich manuscript.

Mapping Semantics: Cracking (or at least trying) the Voynich Code with Neuro-Symbolic AI

My approach to decoding the Voynich manuscript leverages artificial intelligence techniques to unravel its mysteries. At the core of this method are “transformer models,” powerful AI systems that excel at understanding language patterns across different contexts.

I start by using “transfer learning,” which applies knowledge from AI models trained on many languages to analyze the Voynich text. This is like having a linguist who knows many languages examine the manuscript. By comparing the Voynich text to languages like Latin, Italian, Hebrew, and German, we can identify similarities that might hint at its origin or meaning.

The current implementation focuses on extracting and comparing embeddings – numerical representations of words or tokens – from the Voynich manuscript and known languages (I will focus on Italian first). By finding the nearest neighbors of Voynich tokens in other languages, we can start to map potential semantic relationships and structures.

Looking ahead, I aimed to incorporate more advanced techniques like “sparse autoencoders” (SAEs) to break down the complex language patterns into simpler, more understandable parts called “monosemantic features“—essentially, the building blocks of meaning in the text. By isolating these key elements, we hoped to more easily compare them to known languages and concepts, potentially revealing deeper insights into the manuscript’s content and structure.

This combination of AI-driven pattern recognition and linguistic analysis could open new avenues for understanding the Voynich manuscript. By systematically mapping the Voynich language to other known languages and eventually breaking it down into its most basic meaningful units, we might come closer to unlocking the secrets of this centuries-old mystery.

Visualizing Language Relationships: Early Findings with Transfer Learning and Transliteration Insights

The attempted AI-driven analysis of the Voynich manuscript, compared with 15th-century Italian text from “Fasciculo de medicina,” has brought some intriguing results that we can appreciate in the t-SNE diagram. This visualization reveals distinct clusters for Voynich (blue) and Italian (orange) tokens, with great areas of overlap highlighted in red. The Voynich tokens form a continuous, figure-eight shape, suggesting an internal structure or pattern, while Italian tokens appear more scattered across several clusters, reflecting the diversity of word structures in the language.

Single Voynich characters often relate to parts of Italian words or subword tokens; for instance, ‘A’ closely matches the Italian ‘##ct’, while ‘1’ correlates with several elements including ‘##co’, ‘e’, ‘##gna’, and ‘come’. This suggests a possible syllabic or logographic writing system for Voynich, rather than a simple alphabetic one. Notably, characters like ‘1’ and ‘2’ show proximity to multiple Italian tokens, indicating versatile usage within the Voynich script. The character ‘2’, for example, relates to both ‘z’ and the [SEP] token, potentially serving a structural role in text separation.

There are multiple options for transliteration as different transcription alphabets have been created to convert Voynich characters into Latin characters such as the Extensible (originally: European) Voynich Alphabet (EVA). I am using a refined variation, one of the latest transliteration files called Voynich RF1, the so-called “Reference transliteration”. The transliteration adds another layer of complexity to our analysis.

This system uses uppercase Latin letters (A-Z) and numbers (1-5) to represent Voynich characters, with some special characters like ‘a’ possibly representing variations. For example, a Voynich text segment might appear as “P2A3K1A2C2A2Q1A3B2A3C1AaQ2A3G1L1,” where numbers indicate repetition of the preceding character.

This transliteration method, while standardizing the Voynich text for analysis, also abstracts it from its original graphical form, potentially obscuring visual patterns that might be present in the original manuscript.

Our t-SNE diagram captures these transliterated tokens and their relationships to Italian subwords and full words. The proximity of Voynich characters to multiple Italian elements in the diagram visually represents the complex web of potential linguistic connections we’ve uncovered. For instance, the multiple appearances of ‘1’ near different Italian tokens in the plot corroborates our finding of its versatile usage. The diagram also shows some Voynich-Italian pairs in isolated regions, such as ‘C – v’ and ‘A – fl’, suggesting unique relationships that warrant further investigation.

These patterns, while promising, underscore the complexity of the Voynich script and the challenges in decipherment. The relationships uncovered aren’t one-to-one mappings but a nuanced network of potential connections, complicated by the noisy nature of our historical Italian text, which includes OCR errors and archaic language forms. It’s important to note that while we’ve found intriguing parallels with Italian, this doesn’t necessarily mean the Voynich manuscript is written in an Italian-based code. These similarities could indicate a broader relationship with Romance languages or reflect more universal linguistic patterns.

As we delve deeper into frequency analysis and contextual examination, expanding to other languages and advanced AI techniques like sparse autoencoders, I can see how we could build a more comprehensive understanding of the Voynich script’s structure and possible meanings.

Our next steps include analyzing the frequency of these Voynich characters in the manuscript compared to the frequency of their Italian matches, and examining the contexts in which these characters appear to see if they align with the usage of their Italian counterparts. By systematically mapping the Voynich language to other known languages and eventually breaking it down into its most basic meaningful units, we might come closer to unlocking the secrets of this mystery. It remains a linguistic enigma but we can see potential relationships emerging from the two ancient manuscripts.

Challenges encountered with the SAE model

I finally decided to try a Sparse Autoencoder (SAE), a special type of AI model designed to find the most essential patterns in the data by learning a compressed, simplified representation of it. Unlike our earlier approach—where the model acted like a knowledgeable translator, comparing the Voynich manuscript to Italian—the SAE works differently. It tries to automatically discover the core features of the text, forcing the model to focus on the most important elements while ignoring less relevant details. This is done by limiting how many neurons in the model can activate at once, which helps it learn a “sparse” representation of the data.

The idea was that the SAE could uncover hidden structures in the Voynich manuscript, simplifying the complexity and making it easier to compare with known languages like Italian. However, the results were disappointing. Despite several attempts and adjustments, the features extracted by the SAE didn’t reveal any meaningful patterns. Both Pearson correlation and cosine similarity showed very weak relationships between the Voynich and Italian features, meaning the model didn’t identify any clear connections.

In the end, the features the SAE learned didn’t align with any interpretable linguistic structures between the two languages. As seen in the t-SNE diagram, the Voynich and Italian features still seem to exist in entirely separate spaces, like two distant worlds.

This result suggests that the current approach may not be well-suited to the complexities of the Voynich manuscript, or that the model requires further tuning. The manuscript’s enigmatic structure continues to resist clear pattern recognition with the SAE techniques applied.

In some sense, this outcome points to the need for entirely different approaches to uncover the latent semantics and unlock the manuscript’s hidden meanings.

Next Steps

As I move forward, I plan to dive deeper into frequency analysis, expanding the scope to include languages like German and Hebrew, and refining our models. By systematically mapping the Voynich language to these known tongues and breaking it down to its most fundamental elements, we may inch closer to unlocking its secrets. The manuscript remains a captivating enigma, yet hints of connections with other ancient texts are beginning to surface.

For me, this journey —it’s a way to spend the quiet, contemplative hours of the night while crossing borders on a train, immersed in the mysteries of a long-lost semantic world.

References

  • The Unsolvable Mysteries of the Voynich Manuscript – The New Yorker | Read here
  • Voynich Manuscript – Wikipedia  | Read here
  • Manoscritto di Voynich: Ecco come è stato decifrato il libro più misterioso al mondo – La Repubblica | Read here
  • Johannes de Ketham: Fasciculus Medicinae – National Library of Medicine  | Read here
  • Fasciculus Medicinae (1495) – Biodiversity Heritage Library  | Read here
  • AI Decoding Animal Communication – Financial Times | Read here
  • Unveiling Monosemanticity: Anthropic’s Groundbreaking Research on Large Language Models (LLMs) | Read here