Sunil Madhu founder and CEO of Socure, also highlighted the fact that machines are moving away from rules-based learning toward self-education — toward, as it were, unsupervised or semi-supervised machine learning, which is the topic of Episode Three in the podcast series.
“We train machines with data; we feed machines certain patterns of data so they can identify similar patterns,” Madhu said. “You can think of that as an unsupervised or semi-supervised machine learning system.”
Machines can learn to sort through all kinds of data: names, phone numbers, email addresses, network data, geolocation, biometrics, images and more. In sorting these attributes, they can learn how a certain identity is assembled … and how it’s not.
For instance, a machine may be taught “What it means to be an American.” Attributes would include living in the U.S., having a passport issued from the U.S. indicating that they were born or lived there and having a Social Security number. The machine can then compare specific data with the generic attributes to discover whether someone is an American.
Of course, the more proprietary the attributes, the better — that makes it harder to copycat or replicate them, and that makes it harder to get away with fraud.
According to Madhu, 60 percent to 70 percent of the work in data science goes into data engineering, and that work is (for now) done by human beings, though engineers are gradually teaching machines to take over more and more of their own education.
Specialists working in data science have a variety of techniques for translating numbers, range values and strings into different types of features that machines can then look for in real-world data sets to determine whether an identity is real or fake.
Developing those techniques takes a lot of trial and error. The only way to determine whether the data going into the system are reliable is to study them over time, while conducting the process manually before trying to make it automatic.
In Socure’s case, that means pulling data from digital, online, social and offline sources and holding that up against the company’s proprietary features. While data from any one of those domains may be insignificant or couched in noise, together, they can be highly predictive and helpful in fraud prevention, Madhu said.
Over time, said Madhu, it becomes clear what is signal and what is noise, which data is valid and what are some of the typical transformations or mutations versus what are signs of potential fraud.
Another important consideration for data scientists is how the data is being provided. Is the source real–time? If so, the machine must be taught how to account for errors and timeouts. Is it receiving information in batch dumps? How can it optimize queries? How can sparse data be made into useful information — a challenge even for the teachers, let alone the artificially intelligent students?
As of yet, these are still questions for humans working in the data engineering field, not for machines.
“Things change over time,” Madhu said. “You constantly have to train the machine and provide feedback so it can correct and adjust those weights. The machine, given the right guidance on how to treat things that may change over time, should know how to manage that.”
“Self-learning machines are not there yet that they can understand every parameter without human guidance,” Madhu said.
But he’ll talk about how they’re getting closer in Episode Four, when he and Karen Webster discuss letting machines manage the process.