TL;DR:
- Poor-quality data hampers the potential of AI models in the life science sector.
- Life science data is vast, unstructured, and regulated, posing challenges for effective AI adoption.
- Data-centric approaches and smaller, high-quality datasets are crucial for training AI models.
- Compliance issues and restricted data access hinder the construction of high-quality life science datasets.
- “Dirty data” – inaccurate, incomplete, or inconsistent information – limits immediate usability.
- Unstructured data in life science requires cleaning and harmonization before training AI models.
- Data bias can affect AI models, requiring careful dataset design to mitigate biases.
- Successful AI adoption in life science requires a clear data strategy and thorough data preparation.
Main AI News:
In the realm of AI models, the quality of data plays a pivotal role. It is unrealistic to anticipate exceptional outcomes from poor-quality data inputs. Unfortunately, this is a pervasive issue in the field of life science, where the data foundation often falls short in terms of quality. Consequently, AI models that could otherwise thrive end up delivering subpar results due to the insufficient quality of the underlying data. The primary hurdle for effective AI adoption in life science lies not in the technology itself, but rather in the datasets specific to the life science domain.
Life science data is characterized by its unclean nature, lack of structure, and stringent regulations. Companies operating in this industry possess vast amounts of data, which can be overwhelming. The advent of the “data deluge” has affected all sectors, but none more so than life science. Data streams flood in from patients, payers, and healthcare professionals through numerous channels, amplifying the patient’s voice in recent years. While this amplification is undoubtedly beneficial for patients, life science teams struggle to keep pace with the multitude of online channels where opinions are shared and valuable information can be extracted. Recognizing this immense opportunity, leading life sciences companies have begun to take notice. NTT Data reports that the advancements in genome sequencing technology have led to an exponential increase in genomic data, exceeding 40 exabytes over the past decade.
However, quantity alone does not guarantee the quality, and it is rarely necessary to utilize an entire enterprise’s data lake to construct an effective AI model. Instead, companies need to adopt a data-centric approach, shifting from vast volumes of information to smaller, meticulously curated samples with high-quality datasets for training purposes.
The availability and compliance of data pose additional challenges to building high-quality life science datasets. Many data sources within the industry are subject to regulations such as the European GDPR or CCPA, along with other regional laws, prohibiting their sharing with external vendors or usage for AI model training. Data access becomes a genuine concern in highly regulated sectors like life science, where regulatory requirements vary from region to region. Deloitte points out that while most companies are embracing new technologies to improve patient outcomes, the ambiguous nature of regulations related to emerging technologies creates a multitude of compliance challenges.
During the construction of life science AI models, valuable datasets often remain inaccessible due to compliance issues, resulting in models built upon incomplete data.
Moreover, the problem of “dirty data” exacerbates the situation. Life science companies have access to an abundance of data, but much of it is subject to strict regulatory processes and remains effectively out of reach. Compounding the issue is the fact that a significant portion of life science data is “dirty” – inaccurate, incomplete, or inconsistent – rendering it unsuitable for immediate use.
Life science data is commonly unstructured, taking the form of typed MSL reports and field team observations that can vary widely in length, format, and even language. While many healthcare organizations have transitioned to electronic medical records (EMRs), others have only partially migrated or are yet to begin the transition. This disparity and lack of consistency in data streams necessitate thorough data cleaning before they can be utilized for training effective AI models in the life science domain.
Addressing data bias is another critical concern. The appeal of data-driven decision-making lies in its objective nature, where data is perceived as a source of truth, leading to accurate choices. However, bias can still influence the process. Machine learning models are susceptible to the diversity present within datasets and the way those datasets are used for training. Consequently, if the datasets contain biased information, the model may inadvertently exhibit the same bias when making decisions. HBR reports that while AI can help identify and mitigate human biases, it can also exacerbate the problem by perpetuating and deploying biases on a larger scale within sensitive application areas.
To overcome the challenge of biased data, researchers at MIT discovered that the training process itself can play a crucial role. The study emphasizes the importance of meticulous dataset design as a means to counter dataset bias. The lead author of the study, Xavier Boix, highlights the need to move away from the notion that collecting a vast amount of raw data alone will yield results. Thoughtful dataset curation is essential.
AI adoption in the life science industry has yielded mixed outcomes thus far. In many cases, projects have encountered setbacks not due to immature technology, but rather due to the unclean, unstructured, or regulated nature of the underlying data. According to Deloitte’s research, as AI transitions from being a “nice to have” to a “must-have” element, companies and their leaders must formulate a vision and strategy to leverage AI effectively. They must then lay the foundation required to scale its use.
Attempting to implement an AI model before the data is adequately prepared leads to wasted time and resources. Data-related challenges resulting in subpar or biased models can significantly impact the industry’s confidence in AI’s potential to deliver business value. To succeed in training and deploying AI models, life science companies must develop a clear data strategy and dedicate sufficient time to cleaning and harmonizing their data.
Conclusion:
The quality of data plays a critical role in AI adoption within the life science industry. Companies must recognize that poor-quality or incomplete data leads to suboptimal AI models. By adopting a data-centric approach and focusing on smaller, high-quality datasets, life science organizations can overcome these challenges. Additionally, addressing compliance issues, cleaning unstructured data, and mitigating biases are essential steps in preparing data for successful AI implementation. The life science market stands to benefit significantly by developing a clear data strategy and investing in data preparation to unlock the full potential of AI technology.