Conceptual design evaluation is an indispensable component of innovation in the early stage of engineering design. Properly assessing the effectiveness of conceptual design requires a rigorous evaluation of the outputs. Traditional methods to evaluate conceptual designs are slow, expensive, and difficult to scale because they rely on human expert input. An alternative approach is to use computational methods to evaluate design concepts. However, most existing methods have limited utility because they are constrained to unimodal design representations (e.g., texts or sketches). To overcome these limitations, we propose an attention-enhanced multimodal learning (AEMML)-based machine learning (ML) model to predict five design metrics: drawing quality, uniqueness, elegance, usefulness, and creativity. The proposed model utilizes knowledge from large external datasets through transfer learning (TL), simultaneously processes text and sketch data from early-phase concepts, and effectively fuses the multimodal information through a mutual cross-attention mechanism. To study the efficacy of multimodal learning (MML) and attention-based information fusion, we compare (1) a baseline MML model and the unimodal models and (2) the attention-enhanced models with baseline models in terms of their explanatory power for the variability of the design metrics. The results show that MML improves the model explanatory power by 0.05–0.12 and the mutual cross-attention mechanism further increases the explanatory power of the approach by 0.05–0.09, leading to the highest explanatory power of 0.44 for drawing quality, 0.60 for uniqueness, 0.45 for elegance, 0.43 for usefulness, and 0.32 for creativity. Our findings highlight the benefit of using multimodal representations for design metric assessment.