The snowball effect caused by the introduction of Large Language Models (LLMs) like ChatGPT into the world is still in its early stages. As more GPT (Generative Pre-Trained Transformer) models are made available for open-source use, more applications are using AI. ChatGPT itself may be used to build incredibly sophisticated malware, as is well known.
The RoBERT Architecture
The number of applied LLMs, each with a distinct field of expertise and training on carefully selected data for a particular objective, will only grow over time. One such program that was trained using information from the dark web itself just came out. Follow that link to read the release paper, which provides a general overview of the dark web itself. It’s South Korean developers termed it DarkBERT.
Developed in 2019, the RoBERTa architecture serves as the foundation for DarkBERT. Researchers found that it really had more performance to provide that could be pulled from it in 2019, leading to a sort of renaissance for it. It appears that the model was significantly undertrained when it was launched, operating well below its potential.
What will be the Future?
The researchers generated a Dark Web database by first filtering the raw data using methods including deduplication, category balancing, and data pre-processing before crawling the Dark Web through the anonymizing firewall of the Tor network. The consequence of using that information to feed the RoBERTa Large Language Model—a model that can evaluate fresh Dark Web content—is DarkBERT.
Although it wouldn’t be totally accurate to say that English is the business language of the Dark Web, the researchers do believe that a particular LLM had to be educated on it. In the end, the researchers proved that they were correct: DarkBERT performed better than other significant language models, opening new doors for law enforcement and security researchers to explore the depths of the web. After all, most of the action takes place there.
The outcomes of DarkBERT can still be improved with additional training and tuning, just like with other LLMs. It needs to be seen how it will be applied and what information can be gathered.