Gretel AI Introduces New Multilingual Synthetic Financial Dataset on HuggingFace for AI Developers

  • Gretel AI introduces a multilingual synthetic financial dataset on HuggingFace.
  • Dataset aids in detecting personally identifiable information (PII) in documents.
  • Includes 55,940 records with training and test samples.
  • Covers 100 financial document formats and 29 types of synthetic PII.
  • Supports English, Spanish, Swedish, German, Italian, Dutch, and French.
  • Quality assured using the Mistral-7B language model for data integrity.

Main AI News:

The identification of personally identifiable information (PII) in documents involves compliance with regulations like GDPR and U.S. financial data protection laws. Handling sensitive data such as customer identifiers and financial records requires a specialized approach tailored to different domains. Gretel’s synthetic dataset offers a solution to this challenge.

Enhancing PII Detection with Customized Datasets

Organizations often face unique data formats and domain-specific requirements that existing Named Entity Recognition (NER) models or sample datasets may not fully address. Gretel’s Navigator tool enables developers to create custom synthetic datasets aligned with their specific needs. This approach reduces the time and costs associated with manual labeling techniques. Leveraging Gretel Navigator empowers developers to swiftly generate privacy-preserving datasets that mirror their domain’s complexities, ensuring robust PII detection models.

Gretel’s multilingual Financial Document Dataset, launched on the HuggingFace platform, exemplifies these capabilities.

Key Attributes of the Synthetic Financial Document Dataset

  • Substantial Records: The dataset comprises 55,940 records, split into 50,776 training samples and 5,164 test samples.
  • Comprehensive Document Formats: Encompasses 100 distinct financial document formats, each with 20 specific subtypes.
  • Synthetic PII: Features 29 types of PII, generated using Python Faker library for seamless detection and replacement.
  • Document Length: Average document length stands at 1,357 characters.
  • Multilingual Capability: Supports English, Spanish, Swedish, German, Italian, Dutch, and French.
  • Quality Assurance: Utilizes the LLM-as-a-Judge technique with the Mistral-7B language model to assess data quality, conformance, toxicity, bias, and groundedness.

Applications of the Synthetic Financial Document Dataset

  1. Training NER Models: Facilitates PII detection across diverse domains.
  2. Testing PII Scanning Systems: Evaluates PII scanning systems on real, full-length documents from various domains.
  3. Evaluating De-identification Systems: Measures de-identification system performance on realistic documents containing PII.
  4. Developing Data Privacy Solutions: Enables creation and testing of data privacy solutions tailored to the financial sector.

Assessment and Utilization

The dataset undergoes rigorous assessment using the Mistral-7B language model to ensure high quality and reliability. Records with low conformance, quality, or groundedness scores, or those exhibiting high toxicity or bias, are excluded to maintain integrity.

Supporting the Open Data Initiative

Gretel’s dedication to open data and collaboration within the AI community is evident in its release of this dataset. By providing high-quality, diverse, and ethically sourced datasets, Gretel aims to accelerate the development of accurate and unbiased AI systems. The synthetic financial document dataset serves as a valuable resource for developers and researchers striving to build robust PII detection solutions.

Conclusion:

This release of Gretel AI’s synthetic financial dataset represents a significant advancement for AI developers and organizations involved in PII detection. By offering a comprehensive, multilingual solution that addresses diverse document formats and regulatory requirements, Gretel enhances the capability to build robust and compliant PII detection models. This dataset not only accelerates development efforts but also sets a standard for quality and diversity in the AI data market, fostering innovation in data privacy solutions and enhancing trustworthiness in AI systems.

Source