Markus Gershater On Design Of Experiments, Curating Data and The Role Of AI and Machine Learning
A conversation Markus Gershater
One of the great things about looking at Data in Biotech is the variety within the industry. Having looked at regulation, digital twins, and real-human data, our conversation with Markus Gershater, co-founder at Synthace looks at the design of experiments and how this influences the management of data.
The Interview
Guest Profile
Biology is the common theme of Markus’s varied career. He has worked at a range of organizations including Kew Gardens, University College London, Novacta Biosystems - all looking at where biology intersects with other disciplines. He originally co-founded Synthace as a synthetic biology company, but it has evolved toward helping organizations utilize automation not just to eliminate tedious tasks but to design, deliver, and aggregate data from better-designed experiments. Markus is a leading expert on synthetic biology automation and experimental design and passionate about how his work with Synthace can impact the industry as a whole.
The Highlights
If you are looking for a deep-dive conversation on the design of experiments, there is no substitute for listening to the podcast in full, but below are the highlights of the conversation with Markus.
The Need For Multi-factorial Experiments (6:16): Markus starts by discussing the sheer complexity of biology and evolution – he describes it as a “ridiculously interconnected filigreed system.” This leads to his view that classical experiments, where researchers fix everything but the one factor that they are targeting in that particular moment (one Factor at a Time or OFAT experiments), are limited because how that factor affects your system will be dependent on how other aspects also change. It is much more powerful to conduct properly designed, multi-dimensional experiments, to give the greatest insight possible into a system being explored.
A Digital Experiment Platform (8:50): For many of Markus’s insights the context of his work with Synthace is an essential starting point. Synthace is a Digital Experiment Platform that gives scientists a set of digital tools and capabilities to help them through the process of running a multi-factorial experiment. From knowing stock concentrations to sending instructions to the equipment, it automates the tedious elements of experiments to drive efficiency and capture critical data. It has a drag-and-drop interface that allows scientists to map out their experiment and then automate it, generating useful data and metadata at every stage. This bank of information is not only useful for individual experiments, but also as a library of information when we look to conduct meta-analysis and utilize AI.
Fewer, Better Experiments (15:07): Rather than using a Digital Experiment Platform to enable more experiments to be conducted simultaneously, Markus explains that he sees organizations needing to conduct fewer experiments as a result of being able to design more complex, multi-factorial experiments that provide more insight in a shorter time frame. The organizations are able to use parallelized, rather than iterative experiments, that answer questions more quickly with fewer runs. The example he gives is assay development in drug discovery. Biologists can use each experiment run to comprehensively cover a multi-dimensional landscape that can be mapped out “in exquisite detail.” This then allows them to answer the question of what are the best conditions for the assay, much faster.
Curating better data (21:46): How to create more relevant data points at a lower cost is one of the really interesting topics Markus discusses as part of the conversation. Design of Experiments, automation, and reducing the need for tedious manual tasks, also reduces the cost of the experiments and, in turn, the cost of each data point generated. Also, a platform like Synthace’s allows for more complex experiments that cannot reasonably be achieved by hand, generating richer and more complex data.
Creating infrastructure for AI (32:12): The library of information generated by designing and running experiments through an engine like Synthace comes into its own when we start to talk about AI. From Markus’s perspective, for AI to reach its full potential, it needs “highly structured, very well curated, fully contextualized, high-quality datasets.” He also illustrates the importance of being more specific when we talk about AI: it only becomes meaningful when we talk about particular types of AI to achieve a certain goal, for example, active learning to effectively explore high-dimensional biological landscapes. It also helps us understand the data we will need for AI to work and ensure that is what we are generating.
Integrating human feedback into AI-driven experimentation workflows (34:57): When discussing feedback loops for ML, Markus explains it is important to see where human input adds value to experiment-driven data gathering and analysis exercises. There are some instances where the complexity of the subject means an algorithm will understand a problem better than a person, for example, in protein engineering. However, in an area like developing a bioprocess, tapping into the experience of a specialist with 30 years of research in the area is incredibly valuable because of the context-specific knowledge the bioprocess engineer has developed over the course of their career. It is not always about unleashing the power of AI alone, but understanding where each reinforces the other to drive the end result.
Further reading: For those wanting a little more on the topics covered in the podcast Markus regularly contributes to the Synthace blog. In addition, he recommends blogs by Jesse Johnson and Erika Alden DeBenedictis as well as the Codon newsletter from Nico McCarty.
Continuing the Conversation
As with all of our guests, there is certainly more to talk about with Markus Gershater, particularly on the topic of AI. That’s why in this week’s ‘Continuing the Conversation,’ we have chosen to use Markus’s recent blog “AI won't change biology unless we change with it” as the starting point for discussion.It builds on the topics we discussed around AI and that it cannot be used to its full potential, in a biological context, without sufficiently rich data.
The AI machine needs the correct ‘data fuel,’ and the industry needs to adapt to ensure the right material is being created. Markus talks about this in the context of how the Synthace platform can facilitate it, but are there other steps organizations within the industry can take to generate AI-ready data?
At CorrDyn, we work with a number of biotech organizations and concentrate on four key aspects to make sure the data we generate is AI-ready:
Capturing the metadata surrounding all biological manufacturing and experimentation processes:This metadata, including where the materials come from (source, supplier, lot number, origin, etc.) and when and how they are processed prior to evaluation or outcome testing, is essential to contextualizing the results of any experiment or manufacturing process. This also includes unique identifiers for each plate, well, and item your systems receive and produce, which is passed to each system that produces data related to the item.
Ensuring that outcome measures are reliable enough, with sufficient volume: Any outcome that is measured from a biological process will have its confidence intervals and sensitivity to procedures. If the outcome a business cares about is not achievable with the instrumentation being used to measure the outcome, then AI will not be able to predict the outcome or generate reliable results. A model is only as good as the data, so data infrastructure is critical to ensuring that models yield reliable results.
Capturing, storing, and exploring observational, model input data (i.e. sensors, images, readings, transactions) at the volume and resolution needed to capture the phenomena: This is the meat of the model development process, the features that will predict the outcome about which your organization cares. Often the hardest part is getting the data into a single system where it can be cleaned, integrated and analyzed to determine the level of information contained within it and how the data can be improved to contain more information about the outcome.
Conducting data quality checks for each data source providing information: Ensure that each example has the same baseline characteristics so that additional noise does not enter the modeling process.
By strategizing about what type of data is needed for AI in each biotech context, companies can ensure that they will realize the full potential of AI, not a watered-down version as a result of insufficient ‘data fuel.’