Best practices for data enrichment

Building a Responsible Approach to Data Collection with the AI Partnership

At DeepMind, we aim to ensure that everything we do meets the highest standards of safety and ethics, according to our Operating Principles. One of the most important points it starts with is how we collect our data. Over the past 12 months, we have worked with Collaboration in AI (PAI) to carefully consider these challenges and jointly develop standardized best practices and procedures for the responsible collection of human data.

Collection of human data

More than three years ago, we created the Human Behavior Research Ethics Committee (HuBREC), a governance group modeled after academic institutional review boards (IRBs), such as those found in hospitals and universities, with the goal of protecting the dignity , the rights and well-being of the people who participate in our studies. This committee oversees behavioral research that involves experiments with humans as the subject of study, such as investigating how humans interact with artificial intelligence (AI) systems in a decision-making process.

Alongside projects involving behavioral research, the AI community is increasingly involved in efforts involving “data enrichment” – tasks performed by humans to train and validate machine learning models, such as data labeling and model evaluation. While behavioral research often relies on volunteer participants being studied, data enrichment involves people being paid to complete tasks that refine AI models.

These types of jobs are typically performed on crowdsourcing platforms, often raising ethical issues related to employee pay, welfare and equality, which may lack the necessary guidance or governance systems to ensure adequate standards are met. As research labs accelerate the development of increasingly sophisticated models, the reliance on data enrichment practices will likely increase, along with the need for stronger guidance.

As part of our Operating Principles, we are committed to supporting and contributing to best practices in the areas of AI security and ethics, including fairness and privacy, to avoid unintended outcomes that create risks of harm.

The best practices

Following the PAIs recent white paper on the Responsible Sourcing of Data Enrichment Services, we have worked together to develop our data enrichment practices and processes. This included creating five steps that AI professionals can take to improve working conditions for people involved in data enrichment work (for more details, visit Guidelines for PAI data enrichment):

Choose an appropriate payment model and ensure all employees are paid above the local living wage.
Design and run a pilot program before starting a data enrichment project.
Identify the right workers for the desired job.
Provide verified instructions and/or training materials for employees to follow.
Create clear and regular communication mechanisms with employees.

Together, we created the necessary policies and resources, gathering multiple rounds of feedback from our internal legal, data, security, ethics and research teams in the process, before piloting them on a small number of data collection projects and later rolling them out to the wider organization.

These documents provide more clarity on how best to set up data enrichment tasks in DeepMind, improving our researchers’ confidence in study design and execution. This has not only increased the efficiency of the approval and initiation processes, but importantly has improved the experience of those involved in data enrichment work.

More information on responsible data enrichment practices and how we have incorporated them into our existing processes is explained in PAI’s recent case study, Applying responsible data enrichment practices to an AI developer: The example of DeepMind. PAI also provides useful resources and support material for AI professionals and organizations seeking to develop similar processes.

I look forward

While these best practices underpin our work, we should not rely on them alone to ensure that our projects meet the highest standards for the welfare and safety of research participants or workers. Every project at DeepMind is different, so we have a dedicated human data review process that allows us to constantly work with research teams to identify and mitigate risks on a case-by-case basis.

This paper aims to serve as a resource for other organizations interested in improving data enrichment procurement practices and hopefully this will lead to cross-sector conversations that could further develop these guidelines and resources for teams and partners. Through this collaboration we also hope to spark broader conversation about how the AI community can continue to develop responsible data collection standards and collectively build better industry standards.

Read more about our Operating Principles.

AI lie detectors are better than humans at spotting lies

What are AI agents? | MIT Technology Review

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Guide to Statistical Analysis: Definition, Types, and Careers

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Best practices for data enrichment

AI and Data Extraction for Business Benefits

Exploring the Impact of AI Technologies on Data Mining Companies

AI companies are finally being forced to cough up training data

Enhancing Business Innovation and Operational Efficiency Through Historical Data

Navigating the Challenges of Big Data Security in the Current Landscape

DataRobot: A Leader in the 2024 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Our Picks

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Subscribe to Updates

Best practices for data enrichment

Collection of human data

The best practices

I look forward

Related Posts