Cybersecurity

US DHS Solicits Synthetic Data Expertise for AI Training

Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development

Agency Prepares $1.7M Contracts for Synthetic Data Prototypes

US DHS Solicits Synthetic Data Expertise for AI Training
The U.S. federal government is preparing for an era of synthetic data. (Image: Shutterstock)

In a solicitation for synthetic data generators, the U.S. federal government is looking for a machine that can generate fake data for real-world scenarios, such as identifying cybersecurity threats. Synthetic data can boost the accuracy of machine learning models or be used to test systems.

See Also: Entering the Era of Generative AI-Enabled Security

By generating artificial data, also called synthetic data, the Department of Homeland Security could train machine learning models in instances when real-world data is unavailable or its use poses privacy and security risks, the agency said.

While the department generates a lot of data, its sensitive nature makes it “highly challenging to utilize or share that data across organizational boundaries,” a DHS Science and Technology Directorate solicitation says.

The government envisions awarding multiple contracts for prototypes worth up to $1.7 million over three years through its Silicon Valley Innovation Program, an effort to fund startups and other companies usually outside the government’s orbit. The program circumvents some of the usual contracting procedures and allows foreign companies to compete.

The solutions must be able to “generate synthetic data that models and replicates the shape and patterns of real data, while safeguarding privacy,” the agency said.

The U.S. federal Chief Data Officers Council is also seeking insights on synthetic data to help develop best practices. A request for information solicits a definition for synthetic data, along with information on its applications, challenges and limitations.

In a world awash with data, the need to generate synthetic data might seem to be overkill, but Gartner predicts that during this year synthetic data will account for 60% of data consumed by AI.

Synthetic data, the consultancy has written, can boost the accuracy of machine learning models or be used to test a new system where no live data yet exists. Medical researchers as far back as 2020 investigated using synthetic data to circumvent the limits of sharing real patient data.

The solutions sought by DHS must support structured and unstructured data types, remove or mitigate bias and prevent reverse-engineering of real data from synthetic data.

“The ability to generate synthetic data at scale is necessary to protect and preserve data privacy, as well as safeguard civil rights and liberties,” said Melissa Oh, managing director of S&T’s Silicon Valley Innovation Program.

The solutions must have privacy-preserving technical capabilities that “directly serve the mission needs” of DHS operational components and offices. One of the applications for which DHS might use synthetic data is simulating the attacks on cyber-physical systems or detecting threats “by amplifying synthetic data elements that augment real-world threat indicators.” Companies can submit responses through April 10.