Dingo is a comprehensive data quality evaluation tool designed to automatically detect data quality issues in various datasets. It provides built-in rules, model evaluation methods, and supports custom evaluations, making it ideal for pre-training, fine-tuning, and evaluating datasets.

How do I install Dingo?

You can install Dingo using pip with the following command:nnbashnpip install dingo-pythonn

What types of data does Dingo support?

Dingo supports commonly used text datasets and multimodal datasets, including pre-training, fine-tuning, and evaluation datasets.

Can Dingo integrate with LLMs?

Yes, Dingo supports integration with Large Language Models (LLMs) such as OpenAI, Kimi, and local models like Llama3 for advanced data quality assessments.

What are the key quality metrics that Dingo evaluates?

Dingo evaluates data quality across seven dimensions:nn* Completenessn* Effectivenessn* Fluencyn* Relevancen* Securityn* Similarityn* Understandability

How do I run Dingo using the command line?

You can run Dingo using the command line interface (CLI) with various options. For example, to evaluate a dataset with rule sets, use:nnbashnpython -m dingo.run.cli --input_path data.txt --dataset local -e sft --data_format plaintext --save_data Truen

Does Dingo offer a GUI for visualization?

Yes, Dingo generates a frontend page for visualization after evaluation (when `save_data=True`). You can manually start the frontend using:nnbashnpython -m dingo.run.vsl --input output_directorynnnWhere `output_directory` contains the evaluation results with a `summary.json` file.

Can I create custom rules with Dingo?

Yes, Dingo supports custom rules, prompts, and models. You can extend Dingo with your own rules and models to address specific data quality challenges.

What is the MCP Server in Dingo?

Dingo includes an experimental Model Context Protocol (MCP) server. MCP is an open protocol that standardizes how applications provide context to LLMs, allowing AI models to access and interact with external data sources and tools.

How does Dingo help with data governance?

Dingo assists organizations in establishing and maintaining data governance policies by providing a tool for monitoring and enforcing data quality standards.

Dingo: Revolutionizing Data Quality Evaluation for MCP Servers with UBOS

In the burgeoning landscape of AI and Machine Learning, data quality stands as the bedrock upon which successful models are built. Recognizing this critical need, UBOS proudly presents Dingo, a comprehensive data quality evaluation tool designed to automatically detect data quality issues across diverse datasets, especially within the context of MCP (Model Context Protocol) Servers. Dingo isn’t just another tool; it’s a paradigm shift in how data is assessed and refined, ensuring that your AI initiatives are fueled by the highest quality data possible.

Dingo offers a versatile suite of built-in rules, model evaluation methods, and support for custom evaluation approaches. This adaptability makes it ideal for a wide spectrum of datasets, including those used for pre-training, fine-tuning, and general evaluation. Whether you’re working with text, multimodal datasets, or integrating with platforms like OpenCompass, Dingo streamlines the data quality assurance process.

The UBOS Advantage: Seamless Integration for Enhanced AI Agent Development

UBOS, a full-stack AI Agent Development Platform, focuses on bringing AI Agents to every business department. Our platform helps you orchestrate AI Agents, connect them with your enterprise data, build custom AI Agents with your LLM model and Multi-Agent Systems. Dingo, integrated within the UBOS ecosystem, amplifies these capabilities by ensuring the data fed into your AI Agents is pristine and reliable.

Key Features that Set Dingo Apart:

Automated Data Quality Detection: Dingo automates the often-laborious process of identifying data quality issues. This automation saves valuable time and resources, allowing data scientists and engineers to focus on model development and innovation.
Versatile Evaluation Methods: From rule-based evaluations to sophisticated LLM-driven assessments, Dingo provides a multifaceted approach to data quality. This ensures that all potential issues are identified, regardless of their nature.
Customizable and Extensible: Dingo’s architecture supports custom rules and models, allowing you to tailor the evaluation process to your specific needs. This extensibility makes Dingo a future-proof solution that can adapt to evolving data landscapes.
Multi-Modal Support: Dingo supports both text and image data, ensuring comprehensive data quality across different modalities. This is crucial for modern AI applications that often rely on a combination of data types.
Seamless Integration with MCP Servers: Dingo includes an experimental Model Context Protocol (MCP) server, facilitating seamless interaction with external data sources and tools, crucial for AI model development. The provided video demonstration walks users through the process of using Dingo MCP server with Cursor.

Use Cases: Transforming Data Quality Across Industries

Dingo’s impact extends across numerous industries and applications, providing tangible benefits to organizations seeking to leverage the power of AI.

Enhanced LLM Training: By evaluating and refining datasets used for training Large Language Models (LLMs), Dingo improves the accuracy, reliability, and overall performance of these models.
Improved Data-Driven Decision-Making: High-quality data is essential for making informed business decisions. Dingo ensures that the data used for analysis is accurate, complete, and relevant.
Streamlined Data Migration: When migrating data between systems, Dingo helps identify and correct data quality issues that could lead to errors or inconsistencies.
Robust AI Agent Development: By providing high-quality data, Dingo enhances the capabilities of AI Agents, enabling them to perform tasks more effectively and efficiently.
Efficient Data Governance: Dingo helps organizations establish and maintain data governance policies by providing a tool for monitoring and enforcing data quality standards.

Diving Deep into Key Features:

Multi-Source & Multi-Modal Support:
- Data Sources: Dingo seamlessly integrates with various data sources, including local files, Hugging Face datasets, and S3 storage. This flexibility allows you to evaluate data regardless of its location.
- Data Types: Whether you’re working with pre-training, fine-tuning, or evaluation datasets, Dingo provides tailored evaluation methods to suit your specific needs.
- Data Modalities: Dingo supports both text and image data, ensuring comprehensive data quality across different modalities.
Rule-based & Model-based Evaluation:
- Built-in Rules: Dingo includes over 20 general heuristic evaluation rules, covering a wide range of data quality issues.
- LLM Integration: Dingo integrates with popular LLMs like OpenAI, Kimi, and local models such as Llama3, enabling advanced data quality assessments.
- Custom Rules: Easily extend Dingo with your own rules and models to address specific data quality challenges.
- Security Evaluation: Dingo integrates with the Perspective API for security evaluations, identifying potentially harmful or inappropriate content.
Flexible Usage:
- Interfaces: Dingo offers both CLI and SDK options, providing flexibility for different usage scenarios.
- Integration: Dingo can be easily integrated with other platforms, streamlining your data quality workflow.
- Execution Engines: Dingo supports both local and Spark execution engines, allowing you to choose the best option for your infrastructure.
Comprehensive Reporting:
- Quality Metrics: Dingo provides 7-dimensional quality assessments, covering completeness, effectiveness, fluency, relevance, security, similarity, and understandability.
- Traceability: Detailed reports provide traceability, allowing you to track down the root cause of data quality issues.

Understanding Dingo’s Data Quality Metrics:

Dingo categorizes data quality issues into seven critical dimensions, each evaluated through rule-based methods and LLM-based prompts:

Completeness: Ensures data is not missing critical components, such as evaluating if text abruptly ends with a colon or ellipsis.
Effectiveness: Verifies if data is meaningful and properly formatted, detecting garbled text or content lacking proper punctuation.
Fluency: Checks grammatical correctness and natural readability, identifying excessively long words or chaotic reading order.
Relevance: Detects irrelevant content, like citation details or HTML tags, ensuring data focuses on pertinent information.
Security: Identifies sensitive information, such as personal details or content related to gambling, pornography, or political issues.
Similarity: Detects repetitive content, evaluating text for consecutive repetitions or multiple occurrences of special characters.
Understandability: Assesses how easily data can be interpreted, ensuring correct formatting for LaTeX formulas and Markdown.

Getting Started with Dingo:

Installation:
bash pip install dingo-python
Basic Usage:
- Evaluate LLM chat data:
  python from dingo.config.config import DynamicLLMConfig from dingo.io.input.Data import Data from dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase from dingo.model.rule.rule_common import RuleEnterAndSpace

data = Data( data_id=‘123’, prompt=“hello, introduce the world”, content=“Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty.” )

def llm(): LLMTextQualityModelBase.dynamic_config = DynamicLLMConfig( key=‘YOUR_API_KEY’, api_url=‘https://api.openai.com/v1/chat/completions’, model=‘gpt-4o’, ) res = LLMTextQualityModelBase.eval(data) print(res)

def rule(): res = RuleEnterAndSpace().eval(data) print(res)

*   **Evaluate Dataset:**

    python

from dingo.io import InputArgs from dingo.exec import Executor

Evaluate a dataset from Hugging Face

input_data = { “eval_group”: “sft”, # Rule set for SFT data “input_path”: “tatsu-lab/alpaca”, # Dataset from Hugging Face “data_format”: “plaintext”, # Format: plaintext “save_data”: True # Save evaluation results }

input_args = InputArgs(**input_data) executor = Executor.exec_map"local" result = executor.execute() print(result)

GUI Visualization:
After evaluation (with save_data=True), a frontend page will be automatically generated. To manually start the frontend:
bash python -m dingo.run.vsl --input output_directory
Where output_directory contains the evaluation results with a summary.json file.

Dingo: A Commitment to Data Quality

Dingo represents UBOS’s unwavering commitment to data quality as a cornerstone of successful AI initiatives. By providing a comprehensive, automated, and customizable solution for data quality evaluation, Dingo empowers organizations to unlock the full potential of their data. Whether you’re training LLMs, developing AI Agents, or making critical business decisions, Dingo ensures that your data is always of the highest quality. Integrate Dingo with UBOS today and experience the transformative power of pristine data.

Dingo: Revolutionizing Data Quality Evaluation for MCP Servers with UBOS