In general, the more data you have, the better your machine learning model is going to be. But stockpiling vast amounts of data also carries a certain privacy, security, and regulatory risks. With new privacy-preserving techniques, however, data scientists can move forward with their AI projects without putting privacy at risk.
Copyright by www.datanami.com
To get the low down on privacy-preserving machine learning (PPML), we talked to Intel ’s Casimir Wierzynski, a senior director in the office of the CTO in the company’s AI Platforms Group. Wierzynski leads Intel’s research efforts to “identify, synthesize, and incubate” emerging technologies for AI.
According to Wierzynski, Intel is offering several techniques that data science practitioners can use to preserve private data while still benefiting from machine learning. What’s more, data science teams don’t have to make major sacrifices in terms of performance or accuracy of the models, he said.
If sometimes sounds too good to be true, Wierzynski admits. “When I describe some of these new techniques that we’re making available to developers, on their face, they’re like, really? You can do that?” he said. “That sounds kind of magical.”
But it’s not magic. In fact, the three PPML techniques that Wierzynski explained to Datanami– including federated learning, homomorphic encryption, and differential privacy–are all available today.
Data scientists have long known about the advantages of combining multiple data sets into one massive collection. By pooling the data together, it’s easier to spot new correlations, and machine learning models can be built to take advantage of the novel connections.
But pooling large amounts of data into a data lake carries its own risks, including the possibility of the data falling into the wrong hands. There are also the logistical hassles of ETL-ing large amounts of data around, which also opens up the data to security lapses. For that reason, some organizations deem creating large pools of data too risky for some data.
With the trick of federated learning, data scientists can build and train machine learning models using data that’s physically stored in separate silos, which eliminates the risk of bringing all the data together. This is an important breakthrough for certain data sets that organizations could not pool together
“One of the things that we’re trying to enable with these privacy-preserving ML techniques is to unlock these data silos to make data source that previously couldn’t be pooled together,” Wierzynski said. “Now it’s OK to do that, but still preserve the underlying privacy and security.”
Intel is working with others in industry, government, and academia to develop homomorphic encryption techniques, which essentially allow sensitive data to be processed and statistical operations to be performed while it’s encrypted, thereby eliminating the need to expose the data in plain text.
“It means that you can move your sensitive data into this encrypted scape, do the math in this encrypted space that you were hoping to do in the raw data space, and then when you bring the answer back to the raw data space, it’s actually the answer you would have gotten if you just stayed in that space the whole time,” he said.
Homomorphic encryption isn’t new. According to Wierzynski, the cryptographic schemes that support homomorphic encryption have been around for 15 to 20 years. But there have been a number of improvements in the last five years that enable this technique to run faster, and so it’s increasingly one of the tools that data scientists can turn to when handling sensitive data.
“One of the things my team has done specifically is around homomorphic encryption is to provide open source libraries,” Wierzynski says. “One is called HE Transformer, which lets data scientists use their usual tools like TensorFLow and PyTorch and deploy their models under the hood using homomorphic encryption without having to change their code.”
There are no standards yet around homomorphic encryption, but progress is being made on that front, and Wierzynski anticipates a standard being established perhaps in the 2023-24 timeframe. The chipmaker is also working on hardware acceleration options for homomorphic encryption, which would further boost performance. […]