Fighting bias in AI systems is an increasingly big topic – and challenge – for business. Drafting a set of principles is a good start, but when it comes to bridging the gap between the theory and the practical application of responsible AI, organizations often find themselves at a loss.
London-based startup Synthesized set out to ease that challenge, and has now launched a tool that aims to quickly identify and mitigate bias in a given dataset. For data scientists working on an AI project, the company has built a platform that scans datasets in minutes, and provides an in-depth analysis of the way different groups of people are identified within that dataset.
If, in comparison to the rest of the dataset, a particular group is disproportionately tied to a criteria that generates bias, Synthesized's software can flag the issue to the user. The technology also generates a "fairness score" for the dataset, which varies from zero to one and reflects how balanced the data is overall.
As the name suggests, Synthesized has also developed a synthetic data generation technology, which is used at the end of the process to re-balance the dataset with artificial data that fills in the gaps where bias was identified, to make sure that every group of people is represented fairly.
Synthesized's founder Nicolai Baldin told ZDNet: "By creating these simulated and curated high-quality datasets, you can build better services. We wanted to show that it is possible to make the dataset fairer without lowering the quality of the data. In fact, the results of AI models will improve because those groups that were missing will be represented."
The process is seemingly straightforward. Synthesized's bias detection platform only requires uploading a structured data file, like an Excel spreadsheet, to kick off the analysis process; and users can select a specific target, for example "annual income", that bias will be identified against.
The software will then profile the entire dataset in relation to the target to establish whether minority groups are unfairly associated, in this example, to different types of income.
The first step consists of digging out the groups that are likely to be discriminated against, which the technology identifies based on legally protected characteristics that are defined in UK and US law – age, disability, gender, marriage, race, religion, sexual orientation and so on.
As an example, Baldin runs a publicly available dataset of 32,000 people through the platform. With some protected characteristics intersecting, almost 270 minority groups are profiled by the software. For example, 186 individuals in the dataset are identified as "female, married, aged 33 to 41".
Once the software has identified and created groups of protected characteristics, it can assess whether a particular cluster is showing significant differences in relation to the target that was set at the beginning of the process – whether that difference is reflective of a positive bias, or of a negative one.
"What we can see here, in the example of the group of 'female, married, aged 33 to 41', is a positive bias, meaning that the income for this group is actually higher compared to the overall income for the entire dataset," explains Baldin.
"So, the software is able to find those abnormal groups from a statistical point of view," he continues. "It profiles the entire dataset across different groups and compares these distributions statistically. If there is enough evidence to say the distribution is different from the overall distribution, then we flag it."
Based on the outcome of the analysis, the dataset is then assigned a fairness score, and users are given the option to artificially re-balance the data. Creating synthetic data, in fact, is at the heart of Synthesized's technology stack. Using synthetization technology, the platform can simulate new groups of individuals that were previously identified as missing or unfairly represented, adjusting the overall fairness score of the dataset.
"We've seen some attempts in academia and industry to identify those biases, but to the best of my knowledge there is no tool that is able to create simulated datasets with no examples of bias in them," says Baldin.
Bias in AI has been a hot topic for years, and has impacted individuals in all sorts of ways, from recruitment processes to healthcare decisions through law enforcement and criminal justice.
In the summer, there was broad criticism of the use of a biased algorithm to determine UK school-student grades while physical exams were cancelled. The AI system had based its predictions on an unfair dataset that put students from poorer backgrounds at a disadvantage.
There is certainly mounting pressure from the public for companies and developers to build AI systems that are ethical. A recent public poll in the UK found that half of UK adults felt that they couldn't trust computer scientists to create algorithms that are focused on improving the quality of their lives. The majority of respondents (62%) also said computer programmers should be qualified as chartered professionals, meeting similar standards to accountants, for example.
Google's What-If tool and IBM's AI Fairness package both provide analysis tools to test datasets for biases, but they remain designed for experts. Baldin hopes that Synthesized's intuitive platform will encourage more users to tackle the issue.
That is not to say that perfectly unbiased datasets are likely to become a reality any time soon. "If we stick to the legal definitions of protected characteristics, then the platform is able to eliminate all the biases," maintains Baldin. "But we need to be careful with what we mean by 'all'. There may be other groups that are not legally protected by the law, but which some people believe are discriminated against."
The debate is not new, and it won't go away anytime soon. To help advance research in the field, however, Baldin has chosen to open-source the bias identification part of Synthesized's new platform, for engineers and data scientists to contribute new ideas.
In the meantime, interested coders can already make use of the program, and are allowed to upload up to three datasets for free.