The Evolving Healthcare Data Stack
Executive Summary
Growth of healthcare data combined with government regulation is creating new levels of complexity. Today, healthcare data accounts for over 30% of all data in the world and is growing at the fastest rate of any category. The average hospital produces roughly 50 petabytes every year, twice the amount in the Library of Congress. At the same time, this data is becoming increasingly accessible because of new government requirements and the proliferation of APIs.
Data strategy has become a competitive necessity for all healthcare organizations. Healthcare enterprises must access, store, normalize, and share huge amounts of data with 3rd party integration heavy vendors while leveraging it in decision making to stay competitive. This has become untenable at scale, even for the most sophisticated organizations. In 2023, Mulesoft estimated healthcare organizations spend $5M per year on integration alone. Our conversations with industry leaders point to numbers greater than $50M for large healthcare enterprises.
The healthcare data stack can be simplified to include 4 layers:
1. User Interface. Front end interface that the end user interacts with.
2a. Analytics. Interpretation of data to derive insights.
2b. Data storage. Tools for transformation, enrichment, cleaning, and storage of healthcare data.
3. Data infrastructure. Infrastructure creating data sharing and consumption.
As the modern healthcare tech stack evolves, the cost of innovation will decrease. In the past, healthcare organizations spent vast amounts of time and resources to collect disparate data sets. Today, organizations can contract with vendors for nationwide access to clinical, insurance, provider and functional transactions. Accordingly, value will shift to analysis and insights (and away from data collection). As a result, we are seeing a rebundling of the healthtech stack where organizations are able to create their own custom data stacks by connecting point solutions using APIs and cloud infrastructure.
AI and ML are adding fuel to the fire. Historically, unstructured data like clinical notes or diagnostic images have been difficult / impossible to analyze at a population level. Generative AI allows for highly sophisticated interpretation of unstructured or multisource data that far exceeds the capabilities of logic trees or knowledge graphs. This innovation further increases the importance of the healthcare backend data stack.
Company Ventures areas of opportunity. We like companies addressing the following areas. (1) data Interoperability solutions that lower the burden or automate data connections for both clinical and administrative data, (2) scheduling and discovery solutions that integrate provider licensing, insurance data, and digital calendars at scale to reduce search costs and out of network leakage, (3) data storage / management solutions that address distinct needs in data prep (standardization, semantics) or storage (FHIR, vector) for LLM utilization, and (4) AI / ML applications for specific clinical use cases.
How to engage with Company Ventures. If you are a founder or investor interested in this space, please reach out to us! There are multiple ways to partner with Company Ventures including:
Engage with our investment fund.
Apply for the GCT residency program (next cohort September 2024).
Connect with a portfolio company.
I. Framing The Opportunity
Healthcare data accounts for over 30% of all data in the world and is growing at the fastest rate of any category. The average hospital produces 137 terabytes of data per day, or roughly 50 petabytes every year. For perspective, that is twice the amount of data housed in the Library of Congress (source). The amount of data generated in healthcare is increasing at a rate of 36% per year, driven by proliferation of medical devices, electronic medical records, genetic testing, and patient-generated health data (source).
Healthcare data is uniquely difficult to work with for the following reasons:
Complexity and structure. ~80% of HC data is unstructured including imaging data, clinical notes, discharge summaries, and radiology reports. This type of data is difficult to integrate into database format for use in analytics and decision support tools.
Non-standardized semantics. Semantic interoperability allows for data sharing among organizations and their internal ecosystem without missing the meaning. For example, a system recognizes that “BP”, “systolic”, “diastolic” and “blood pressure” are related in meaning when scanning records to produce a graph of a patient’s systolic and diastolic ranges.
Privacy regulations. Companies must comply with HIPAA regulations at a federal level, state regulations, plus local policies, making data exchange between organizations complex. Business Associate Agreements (BAAs) and bespoke data usage rights cause massive delays. Complexity increases exponentially when data is exchanged across countries with different privacy regulations.
Data silos. Large healthcare organizations often store data across several different cloud providers and on-premise servers.
Innovation has moved outside of the EMR, creating complex integration challenges. In wave 1 of the healthcare digital revolution (2008-2016), EMRs acted as a singular solution for medical records, RCM, scheduling, data storage, analytics, etc. In wave 2 (2017-present), new entrants have created innovative point solutions across these categories. As a result, we are starting to see the proliferation of modern data sharing modalities, driven by the private sector adopting modern APIs (e.g. RESTful, FHIR).
Government regulation is creating standardization and interoperability requirements.
Anti-Information Blocking: As a part of the 21st Century Cures Act, information blocking is prohibited by health networks, HIEs, health IT vendors and providers. Information blocking is the restriction of use and exchange of medical information where legally applicable. This can occur when providers ask for patient records at another institution, migrating to a new EHR or when patients try to access their medical records. (AMA)
The Trusted Exchange Framework and Common Agreement: Establishes standards for nationwide interoperability to simplify healthcare connectivity and information exchange. It is a common set of principles and technical requirements designed to facilitate trust between Health Information Networks (HINs) to enable widespread information exchange. (HealthIT.gov)
United States Core Data Interoperability (v5): A standardized set of health data for nationwide, interoperable health information exchange. Data classes are now extensive in Version 5, including health insurance information, patient demographics, lab tests, imaging, clinical notes, medications, diagnoses, etc. (HealthIT.gov)
Prior Authorization API Final Rule: Requires payers to implement prior authorization APIs to facilitate electronic prior auth processes between payers and providers. This will improve the efficiency of the prior auth process, reduce admin burden, and prevent delays in patient care. (HHS)
Fast Healthcare Interoperability Resources (FHIR): a standard describing data formats and elements (known as "resources") and an application programming interface (API) for exchanging electronic health records (EHR).
Applications in data strategy have become a competitive necessity for all healthcare organizations. Healthcare enterprises must access, store massive amounts of data, normalize and share it with 3rd party integration heavy vendors, and leverage the data in decision making to stay competitive in today’s market. This has become untenable at scale, even for the most sophisticated organizations. In 2023, Mulesoft estimated healthcare organizations spend $5M per year on integration alone. Our conversations with industry leaders point to numbers greater than $50M for large healthcare enterprises.
II. How will changes in backend data systems transform healthcare
We believe the integration and analysis of healthcare data will continue to become more valuable and more complex because of the following:
Proliferation of APIs. There are more APIs than ever before in healthcare, offering high quality data access, functionality and novel capabilities. Leveraging these APIs, in combination with novel analytics, will differentiate healthcare businesses from those that don’t leverage them effectively.
Data has become increasingly standard in operations and care decisions. As payers and providers shift from Fee-for-Service (FFS) to Value Based Care (VBC) they must leverage technology to manage the risk of their patient populations. Pharma increasingly relies on real world evidence, now standard in FDA submissions.
Government regulations and interoperability. Governments are increasingly increasing requiring data sharing, standardization, and interoperability.
Technological advancements - Adoption of cloud, advancement in Artificial Intelligence and proliferation of tech like Robotic Process Automation (RPA) offer meaningful value propositions for large enterprises but require data access, storage and processing to enable.
The backend data stack in healthcare is evolving to include 4 layers. Across these layers we see the most opportunity in Data Infrastructure and Analytics.
The cost of innovation will decrease as interoperability becomes standardized. In the past, healthcare organizations spent vast amounts of time and resources to collect disparate data sets. Today, organizations can contract with vendors for nationwide access to clinical, insurance, provider and functional transactions. Accordingly, value will shift to analysis and insights (and away from data collection). Using large datasets to derive clinical or administrative insights is becoming increasingly more valuable, while the value of simply aggregating data is declining. With federal regulations, like TEFCA, we expect to see data access commoditize. Data insights will (1) increasingly drive care decisions and allow for the use of population level insights in analyzing lab results, choosing providers, etc, and (2) leverage novel data sets like claims, SDoH, etc. to lower administrative burden in areas like billing and health equality.
The adoption of APIs is producing unique, re-bundled healthtech stacks. Healthcare organizations are increasingly creating their own custom data stacks by connecting point solutions, APIs and infrastructure. For example, Carbon Health, a large, tech-first, primary care provider startup, ignored EHRs altogether, famously stating: “We’re going to build our own end-to-end care platform from the ground up — without even looking at legacy EHRs.” Companies can achieve this through a series of build vs. buy decisions, stitching together a unique platform designed for their exact needs.
The Unbundled Healthcare Data Stack Landscape
Healthcare cloud adoption will drive scalability, cost savings, and interoperability - all of which allow for greater capabilities in analytics, AI/ML and next generation applications. Healthcare organizations have been slower to adopt cloud solutions because of control and security complications. That is starting to change as cloud providers develop healthcare specific offerings like AWS HealthLake or Google Cloud Healthcare API. Accordingly, cloud revenue in healthcare is expected to grow from $39.4B in 2022 to $89.4B in 2027, a 17.8% annual CAGR. We believe the shift to cloud is inevitable, however, unique complexities of healthcare data remain that will need to be addressed.
Case study: The challenges of storing EHR data on cloud. Vivian Neilliy at Google wrote an interesting article on the challenges of storing healthcare data on cloud. When health systems stream EHR data to the cloud, it must be converted from HL7v2 format (a messaging standard for data transfer and consistency across disparate systems) to FHIR standard (a new government mandated standard that groups the data into relational categories called “resources”, like conditions that refer to lab test results). A standard, single patient admission record is converted into 15 separate FHIR resources. For a small hospital system that is sending ~1M messages to the cloud each day, that means 15M FHIR resources need to be created. Interestingly, this breaks the limits of some cloud vendor capabilities, showing the challenges of moving to cloud in healthcare.
AI / ML will allow for next generation analysis of structured and unstructured multi-modal data. Historically, unstructured data like clinical notes or imaging data has been difficult / impossible to analyze at a population level. Generative AI opens the door for highly sophisticated interpretation of unstructured data and multi-source data (EMR, claims, Rx, SDoH, etc.) into clinical and operational intelligence. The flexibility and approach that AI takes far exceeds the capabilities of logic trees or knowledge graphs. The recent acquisition of Science.io by Veradigm (Allscripts) points towards a solution in this category finding true scale.
III. Opportunity Areas
Data Interoperability. Data integration is a pressing problem for large healthcare organizations because (1) massive amounts of new healthcare data is being created each year, (2) more and more external point solutions require APIs, and (3) the government is increasingly requiring the exchange of data. Accordingly, we like solutions that lower the burden or automate data connections for both clinical and administrative data.
Scheduling and discovery. Finding and scheduling care with providers who are in network remains one of the largest pain points in healthcare. We think this is a data and business model problem. For example, state licensing databases have comprehensive information on providers practicing in a state, clearinghouses have data on insurance coverage, and providers keep their schedules in cloud based calendars. We are looking for solutions that integrate these data sources at scale to reduce search costs and out of network leakage.
Data storage / management: As healthcare data continues to grow in volume and complexity, it will fuel AI innovations across the healthcare industry. This will require new thinking in data management technologies, allowing for effective utilization of massive and unstructured datasets. We like businesses that address distinct needs in data prep (standardization, semantics) or storage (FHIR, vector) for LLM utilization.
AI / ML applications for specific clinical use cases. AI / ML models can effectively analyze unstructured data (recall that 80% of healthcare data is unstructured). For example, an AI copilot could leverage on premise diagnostic images and the corresponding clinical notes at scale to increase the efficiency and accuracy of the provider. We think this type of business model will benefit from a scale moat over time as the AI model trains on disparate data sources across healthcare organizations.