Power of Multimodal Machine Learning: Challenges, Innovations, and Future Directions

Intro

Multimodal machine learning combines different types of data, mimicking human perception. This approach is being applied in healthcare, robotics, and multimedia, where diverse information must be processed. Challenges has been faced by researchers in integrating various data types, as models are required to handle different inputs while maintaining coherence. Despite difficulties, multimodal learning’s potential in areas like autonomous driving¹ and human-computer interaction² has been widely recognized.

Huge credit to Liang et al., 2022, a lot of figures are from (except for noted elsewhere)³

Rapid advancements in multimodal research have created challenges in:

Recognizing consistent patterns across past and current studies.
Identifying the fundamental unresolved issues in this area of study.

multimodal challenges — Core research challenges in multimodal learning. Figure above describes the following challenges (1) **Representation**: Capturing the diverse nature and relationships of multimodal data elements. (2) **Alignment**: Finding links between elements across different modalities. (3) **Reasoning**: Using multimodal information to draw conclusions through multiple steps. (4) **Generation**: Creating new content that shows coherent interactions between modalities. (5) **Transference**: Sharing knowledge across different modalities and their representations. (6) **Quantification**: Studying multimodal learning processes through empirical and theoretical methods.

Multimodal Research

Modality refers to a way in which a natural phenomenon is perceived or expressed. Multimodal has multiple involved modalities. Modalities sits along a spectrum from raw (closer to a sensor) to abstract (further away from sensors, down the data processing pipelines). In a multimodal computational study, each entity is heterogeneous but interconnected.

Representation

Representation is about combining different types of data. Are heterogeneity and interconnections learnable during the process?

Representation Fusion: Mixing information from different sources, effectively reducing separate representation.
Representation Coordination: Swapping information between sources to improve understanding
Representation Fission: Breaking down information into smaller, separate parts

Alignment

Alignment is about finding connections between different types of modality elements.

Discrete Alignment: Linking specific pieces of (discrete) data and identify their connections.
Continuous Alignment: For modality elements are NOT already segmented and discretized, methods based on warping and segmentation have been proposed.
Contextualized Representations: Learning how different types of data interact to obtain better representations.

Reasoning

Reasoning means composing knowledge and using information to solve problems, likely through multiple inferential steps.

Structure Modeling: Understanding how the problem is set up and how the reasoning occurs.
Intermediate Concepts: Breaking the problem into steps.
Inference Paradigms: Figuring out how to reach conclusions (abstract concepts are inferred from individual multimodal evidence).
External Knowledge: Using outside information to help.

Generation

This means creating new data (raw modalities) through a generative process.

Summarization: Summing up important points from lots of data.
Translation: Changing data from one type to another.
Creation: Making new data that fits together well.

Transference

This is about knowledge transfer, from one type of modality to help with another.

Cross-modal transfer: Using what we know about one type of data to understand another. Readily available labeled or unlabeled data is used in secondary modalities to develop robust models. These models are then adapted or fine-tuned for tasks involving the primary modality, effectively transferring knowledge across different data types.
Multimodal Co-learning: Sharing information between different types of data by sharing intermediate representation spaces between both modalities.
Model Induction: Keeping different types of data/training separate but still sharing information (Model induction uses separate algorithms for different data views. Each algorithm’s results help label new data for the other view, expanding its training set. This transfers information across views through predictions, not shared spaces.)

Quantification

Studying how all of above works by empirical and theoretical analysis of multimodal models. This study aims to enhance understanding, leading to improvements in model robustness, interpretability, and real-world reliability.

Dimensions of Heterogeneity: How different types of data are unique (understand the dimensions of heterogeneity)
Modality Interconnections: How different types of data connect and interact
Multimodal Learning Process: The challenges of working with heterogeneous data

Some tools / studies

Visual

Gradient-weighted Class Activation Mapping (Grad-CAM)

The Class Activation Mapping (CAM) approach [CVPR 2016] modifies image classification CNN architectures by replacing fully-connected (FC) layers with convolutional layers and global average pooling to achieve class-specific feature maps.

ICCV 2017, Grad-CAM uses gradients of any target concept (i.e. ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept⁴. Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to assign importance values to each neuron for a particular decision of interest.

{@article{xiao2020multimodal, title={Multimodal end-to-end autonomous driving}, author={Xiao, Yi and Codevilla, Felipe and Gurram, Akhil and Urfalioglu, Onay and L{\'o}pez, Antonio M}, journal={IEEE Transactions on Intelligent Transportation Systems}, volume={23}, number={1}, pages={537--547}, year={2020}, publisher={IEEE} }} ↩︎
@article{muhammad2021comprehensive, title={A comprehensive survey on multimodal medical signals fusion for smart healthcare systems}, author={Muhammad, Ghulam and Alshehri, Fatima and Karray, Fakhri and El Saddik, Abdulmotaleb and Alsulaiman, Mansour and Falk, Tiago H}, journal={Information Fusion}, volume={76}, pages={355--375}, year={2021}, publisher={Elsevier} } ↩︎
@article{liang2022foundations, title={Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions}, author={Liang, Paul Pu and Zadeh, Amir and Morency, Louis-Philippe}, journal={arXiv preprint arXiv:2209.03430}, year={2022} } ↩︎
@inproceedings{selvaraju2017grad, title={Grad-cam: Visual explanations from deep networks via gradient-based localization}, author={Selvaraju, Ramprasaath R and Cogswell, Michael and Das, Abhishek and Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv}, booktitle={Proceedings of the IEEE international conference on computer vision}, pages={618--626}, year={2017} } ↩︎