HOME > Technology
Wang Shiquan: Building a Multimodal Data Lake to Help Data Fusion in Medical IndustryPublished: 2023-10-11 09:50:02
At present, there are many bottlenecks in hospital data governance, such as serious data redundancy and difficult multi-modal data fusion. How to get through data in different business scenarios, especially heterogeneous data, so that data can really circulate? It is an important topic of hospital digital transformation. In July, 2023, at the seminar of smart hospitals organized by China Journal of Health Information Management, Wang Shiquan, a domestic medical industry informatization expert, shared the technical route of data integration by building a multi-modal integrated platform for data lake.
Wang Shiquan said: The essence of big data lies not in the scale of data, but in how to store and process professional data, so as to mine the factor value of data.
Wang Shiquan first analyzed the current difficulties faced by hospital data governance, which are mainly as follows: First, the problems of data duplication and data redundancy are serious. Due to the complexity and diversity of business in the medical field, the types of information systems are complicated. The division of data processing responsibilities between systems is unreasonable and inconsistent, and the problems of data duplication and data redundancy are more serious. Second, the semantic gap caused by heterogeneous data makes it difficult to fuse multimodal data. The descriptions of different modal data are inconsistent. Medical data are generated by different devices, and the data formats, coding methods and data granularity generated by each device are also very different, so it is a long way to go to cross the semantic gap. Third, real-time data processing and high concurrent data processing are difficult. Traditional data warehouse is mainly oriented to data analysis applications, and it is difficult to deal with unstructured data, and it is difficult to quickly realize the requirements of data exploration, data mining and business modeling. Solving the above data governance problems requires technological innovation of big data infrastructure. At present, the parallel architecture of lake and warehouse is widely used, "even more than one lake and one warehouse". The problems brought by this data platform solution are high system complexity and low real-time performance. "Using cloud native multi-modal database technology to build a multi-modal integrated big data platform for data lake is the future technology development direction."Wang Shiquan said. The "four unifications" of the integration of data lake make the data truly integrated, that is, breaking the traditional Hadoop+MPP mixed deployment mode and realizing the unification of the technical architecture of data lake: First, unified integration. By combing the existing data interface specification documents, a unified data access specification system is formulated. Establish a unified data integration platform, compatible with existing acquisition interfaces, break the traditional independent pipeline acquisition mode, and realize unified data acquisition management. The second is unified storage. At present, unstructured and semi-structured data in the medical field are increasing. Distributed data management system should be adopted to provide common storage management services for different storage engines. The concrete implementation is as follows: the structured data is standardized and stored in the archive pool; Indexing unstructured data and storing the indexed data in the archive pool. This new open data platform architecture has unified storage management and unified external interface, and has the advantages of data warehouse structure and governance, as well as the expansibility of data lake, which can effectively reduce the cost of operation and maintenance management and avoid data islands. The third is unified control. Some hospital information systems lack unified planning and data standard management, which leads to many problems in the application of upper data. Multi-modal integrated big data platform for data lake provides data management and control capability in the whole life cycle, which can realize unified management and control of multi-modal data and metadata, and support unified multi-tenant management, so as to ensure that tenants can achieve complete isolation from resource layer, data layer and application layer, and solve the problems of unclear data blood relationship and complex management. The fourth is unified application. The data operation layer provides SQL syntax support, which can realize the unified interface to handle different services and different data models, solve the problems of interface and development language switching caused by scene switching and database switching, and avoid the "overwhelming" of core business systems such as HIS and EMR when sharing data. The multi-modal integrated big data platform not only supports different types of computing tasks such as batch processing and stream processing, but also supports the fusion analysis of cross-modal data, which provides strong technical support for the construction of smart hospitals.
Three application scenarios supported by multi-modal integrated big data platform for data lake; The first is the hospital big data platform solution. The hospital data center and medical application can be decoupled in layers without reforming the existing information systems of the hospital, and the hospital data center can be built as a whole based on the multi-modal big data platform. Provide unified data services on demand for clinical, management and scientific research, so as to avoid unnecessary repeated data acquisition and interface docking. In clinical data center (CDR), production system data is synchronized to CDR in real time, and multi-system data association and fusion are realized based on patient master index, which realizes multi-dimensional flexible self-help analysis and second-level response of BI analysis. In the aspect of operation data center (ODR), it supports the association and fusion of massive multi-source data, and meets the needs of data analysis of hospital industry and finance integration. Research Data Center (RDR) supports the unified storage and efficient retrieval of multimodal data for scientific research diseases, and provides visual self-help analysis and exploration tools; At the same time, it has intelligent tools such as AI analysis, map construction analysis and distributed map database system, which can meet the performance requirements of mass medical knowledge map construction. The second is the solution of regional medical center. Build a "2-center, 2-system, 4-center" medical cloud platform architecture. Among them, the "two centers" are the "medical and health data resource center" that stores the data collected by institutions at all levels and the "unified resource service center" that publishes data, models and services in the form of catalogues. "2 systems", namely "data standard and specification system" and "security operation and maintenance and guarantee system". "2 Midstation" means a technical midstation that provides the technical components of the underlying computing and storage, a data midstation that integrates, develops, schedules, manages and models data, an AI midstation that develops models and algorithms according to requirements, and a business midstation that develops, tests and deploys application systems according to requirements. The third is the clinical research platform solution. In the face of multi-source heterogeneous medical data, single-mode data analysis has been difficult to meet the needs of clinical scientific research, and single-mode is developing into multi-mode big data fusion. A multi-modal clinical research platform integrating lake and warehouse has been built, which can integrate, manage and analyze all-modal and multi-type data including clinical, imaging, pathology, genes, waveforms, monitoring and brain functions, improve the hospital's accurate diagnosis and treatment ability and the efficiency of clinical scientific research innovation, and quickly enter the era of multi-modal big data medical care. The system covers the mainstream terminology systems at home and abroad, such as ICD, SNOMED CT, LOINC, etc., and centrally stores and manages clinical information. The data types include structured and unstructured data from multiple sources. Including: electronic health record (EHR), electronic medical record (EMR), HIS system data, LIS system and other clinical data; CT, MRI, pathology and other image data; FASTQ, VCF and other gene sequencing data; Hospital operation management data, etc. Based on the multi-modal integrated clinical research platform, it can realize the collection, analysis, cleaning, standardization, NLP, quality inspection integration, analysis, mining and application of various hospital data, and help all kinds of medical data to be transformed into high-quality data assets.
In addition, the multi-modal lake warehouse integrated big data platform can also strongly support the development of the business: Improve disease prediction and public health event prediction. By collecting and analyzing data in the medical field, we can find the laws and trends of diseases from big data, predict the occurrence, spread and trends of epidemic diseases, and improve the ability of epidemic prevention and control. Optimize the allocation of medical resources. Multi-modal integrated big data platform can analyze patients' medical treatment, predict the demand of various medical resources and optimize the use scheme, so as to improve resource utilization efficiency. Personalized medical service. Through the collection and analysis of patient data, the multi-modal integrated big data platform can establish patient health information files, provide more accurate personalized medical services for doctors and improve the efficiency of diagnosis and treatment. Improve clinical decision-making level. By analyzing and mining clinical data, the multi-modal integrated big data platform can help doctors to better analyze and judge the disease, formulate more scientific treatment plans and improve the level of diagnosis and treatment. Promote medical scientific research. The multi-modal integrated big data platform for data lake can integrate and manage all kinds of medical data, provide data support for researchers and promote the development of medical scientific research.
Note: Wang Shiquan, an IT technical elite,Senior engineer of Mediway Technology Co., Ltd.,technological innovation explorer of HIT (Hospital Information Technology) in China, has been deeply involved in medical and health informatization for more than 20 years.