Recent research from the **Future Data Minds Research Lab** in Australia has found that large language models (LLMs), such as the one that powers **OpenAI’s** ChatGPT, struggle to generate effective passwords. These findings, published in March 2024 on the **arXiv preprint server**, challenge the assumption that LLMs could be utilized for cybersecurity tasks like password cracking.
The researchers, led by **Mohammad Abdul Rehman** and **Syed Imad Ali Shah**, explored whether LLMs could create plausible passwords based on user profiles. They focused on the ability of these models to generate passwords that reflect meaningful information, such as names and dates. Their study involved creating synthetic profiles for fictitious users, which included names, birthdays, and hobbies. The team then prompted three different LLMs—**TinyLLaMA**, **Falcon-RW-1B**, and **Flan-T5**—to produce potential passwords for each profile.
LLMs Underperform in Password Generation
To evaluate the models’ effectiveness, the researchers employed standard metrics used in information retrieval, specifically **Hit@1**, **Hit@5**, and **Hit@10**. These metrics assess how accurately the models can guess passwords, or rank correct passwords among the generated options. The results were disappointing: all models achieved less than **1.5% accuracy at Hit@10**, indicating a significant shortfall in their ability to generate plausible passwords. In contrast, traditional password-cracking methods, including rule-based and combinator-based techniques, demonstrated substantially higher success rates.
The researchers noted that the LLMs often failed to produce plausible passwords for the created user profiles. As a result, the performance of these models fell short compared to established computational tools. The study highlighted key limitations in the generative reasoning of LLMs, particularly their inability to recall specific examples encountered during training and apply learned patterns to new contexts.
Insights for Future Cybersecurity Research
Rehman, Shah, and their colleagues concluded that while LLMs exhibit impressive capabilities in natural language tasks, they lack the necessary adaptation and memorization skills for effective password guessing. Their findings suggest that the current generation of LLMs is not suitable for inferring passwords, especially without fine-tuning on datasets containing leaked passwords.
This research lays a foundation for future explorations into the potential password generation capabilities of other LLMs. The authors emphasize that their study provides critical insights into the limitations of LLMs in adversarial contexts. They hope that these findings will inspire further investigation into how LLMs can be refined to enhance cybersecurity measures.
As cybersecurity threats continue to evolve, understanding the limitations of tools like LLMs is crucial for developing more robust methods to secure online accounts. By addressing these gaps, researchers aim to prevent malicious actors from successfully guessing passwords and accessing sensitive information.







































