TL;DR:
- Nosey Parker combines regular expression-based detection with machine learning (ML) for secret detection.
- ML scores findings, aiding human operators in prioritizing and reducing false positives.
- ML detects secrets missed by regular expressions, simplifying rule creation.
- CodeT5 is the foundation, fine-tuned for content classification.
- Models effectively reduce false positives and uncover real secrets.
- Reimplementation in Rust for improved performance, scalability, and maintainability.
- Local model inference enables flexible data handling.
- Future plans include retraining with a larger dataset and scaling ML for larger inputs.
Main AI News:
Nosey Parker is Praetorian’s secret detection tool, regularly employed in their offensive security operations. It seamlessly combines regular expression-based detection with machine learning (ML) to uncover misplaced secrets within source code and web data. An initial blog post was published in March 2022, outlining their approach to integrating machine learning into secrets detection. However, Nosey Parker has continued to evolve since then.
Since the original blog post, Nosey Parker has undergone reimplementation and the regex-based scanner has been released as an open-source project without ML-powered features. Simultaneously, their proprietary ML-powered version has been reimplemented, utilizing the open-source project as its foundation. This article delves into Nosey Parker’s machine learning advancements since its inception.
Where Machine Learning Comes into Play
ML plays a critical role in Nosey Parker in two primary tasks:
- Scoring Findings: ML is used to score findings generated by the regular expression-based detection engine based on their resemblance to real secrets. These scores are represented as floating-point numbers between 0 and 1, with higher scores indicating a higher degree of authenticity. This capability enhances the abilities of human operators, allowing them to automatically filter out obvious false positives and prioritize findings.
- Detecting Missed Secrets: ML is employed to detect secrets that may have been overlooked by the regular expression-based detection engine. A purely ML-based detection engine doesn’t require the creation of rules; instead, it relies on labeling example data, which is often a simpler task.
In both of these areas, Nosey Parker builds upon the foundation of CodeT5, a large language model (LLM) primarily pretrained for tasks involving the generation of source code. In the case of Nosey Parker, CodeT5 has been fine-tuned to perform content classification, distinguishing between “secret” and “not secret.”
The Effectiveness of Fine-Tuned Models
The fine-tuned models have proven to be highly effective. In typical usage, the model responsible for scoring regex-based findings can eliminate 10–20% of the total reported regex findings, leaving only the most relevant findings for further analysis. Moreover, in several offensive security operations conducted by Praetorian’s engineers, Nosey Parker’s purely ML-based scanner has successfully identified hundreds of real secrets that were missed by the regex-based detection engine and other rule-based tools.
Reimplementation for Performance, Maintainability, and Flexibility
The most significant change since the initial announcement in March 2022 is the complete reimagination of the proprietary ML-powered version of Nosey Parker. The initial version was developed in Python and involved complex backend orchestration for running tasks on powerful cloud-based VMs with GPUs. However, this approach presented challenges in terms of:
Performance and Maintainability: Python, while a popular choice for ML research and development, is not known for its absolute speed. Parallelizing it efficiently for multicore systems, especially with large inputs, proved to be challenging.
Type Checking: Python’s lack of robust compile-time type checking made ongoing development challenging as the feature set expanded or architectural changes were needed.
The Solution: A Rust Reimplementation
To address these issues, a strategic shift was made to Rust. Rust’s single-core performance surpasses that of Python, and it can efficiently parallelize across multiple cores, scaling linearly to at least 64 cores. In one of their largest engagements, Nosey Parker successfully scanned approximately 20 terabytes of input data on modestly equipped machines, showcasing Rust’s capabilities.
Moreover, the Rust-based implementation has provided the flexibility to make substantial architectural changes to the project, changes that would have been daunting within a large Python codebase.
Adhering to Data Handling Policies
The initial implementation’s complex backend orchestration required transmitting input data to separate VMs in Google Cloud for scanning. While this simplified the secret detection workflow by eliminating the need for security engineers to provision machines with GPUs, it introduced challenges related to data handling policies. In certain security engagements, clients stipulated that their assets could only be accessed through locked-down VMs or client-provided hardware, making it impossible to transmit data to Praetorian’s cloud infrastructure.
The Solution: Local Model Inference
The new Rust-based reimplementation of Nosey Parker is capable of running ML model inference entirely locally, without the need for network requests. This change has expanded the usability of Nosey Parker’s ML-based detection engine, enabling its deployment in additional engagements where strict data handling requirements prohibit sending data to the cloud. Instead, Nosey Parker can be deployed directly to the data.
Future Machine Learning Developments for Nosey Parker
The journey continues, with two key milestones on the horizon:
- Retraining Models with Additional Data: Praetorian is in the final stages of constructing a labeled dataset comprising over 100k distinct secrets, a tenfold increase in size compared to the original dataset. This expanded dataset, drawn from diverse types of input data beyond source code and configuration files, will be used to retrain their CodeT5-based models, further enhancing their already impressive detection capabilities.
- Scaling ML Inference for Arbitrary-Sized Inputs: The demand for ML model inference shows no signs of abating. Every performance improvement in Nosey Parker leads to an increased appetite for scanning. The current architecture imposes practical limits on the size of input data for the pure ML-based detection engine. To address this, Praetorian is exploring various techniques, including multi-GPU parallelization, model quantization and distillation, and a multi-model “patience”-based algorithm. Collectively, these innovations are expected to deliver a 100x or greater acceleration in inference speed, allowing ML-based detection to handle inputs of any size effectively.
Conclusion:
Nosey Parker’s integration of machine learning into security operations not only enhances accuracy but also addresses scalability and flexibility concerns. This evolution indicates a growing market trend toward leveraging machine learning for advanced threat detection and data protection, emphasizing the need for adaptable and high-performance solutions in the cybersecurity sector.