The construction of structural equation models traditionally relies on manual literature reviews to define latent variables and their causal relationships, a process that is slow, subjective, and prone to missing critical confounding factors. As scientific literature expands exponentially, researchers need automated methods to synthesize fragmented causal claims into unified, testable theoretical frameworks. Bridging natural language processing and variance-based statistical modeling enables the automatic extraction and validation of complex theoretical networks directly from the corpus of scientific knowledge.
Approach
We propose a framework that integrates LLM-based systematic literature filtering with Partial Least Squares Structural Equation Modeling (PLS-SEM). Using a multi-stage retrieval and extraction pipeline inspired by [Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead](/paper/art_53d63e4e3d0c4657a21637dfddd9e4de), our system first filters thousands of empirical papers to isolate studies relevant to a target domain. It then extracts the defined latent constructs, their observable indicators (the measurement model), and the reported directional effects (the structural model). We aggregate these extracted relationships into a unified meta-analytic PLS-SEM, allowing us to statistically validate the LLM-synthesized theoretical framework against raw empirical data, extending the variance-based techniques used in [Examining the antecedents of Facebook acceptance via structural equation modeling: A case of CEGEP students](/paper/art_1edeb977cc52485f9e627e786163cf9b).
Experimental Plan
We evaluate our pipeline on the domain of technology acceptance, utilizing open-source survey datasets related to digital platform adoption. We use an LLM to systematically review 5,000 papers from the Semantic Scholar Open Research Corpus to automatically generate a structural equation model of user adoption. We hypothesize that the LLM-synthesized model will identify valid latent constructs and structural paths that achieve comparable or superior explained variance to expert-crafted models. Baselines include standard, manually derived theoretical models and models generated via naive keyword-based literature searches. Performance is measured using PLS-SEM fit indices, including Standardized Root Mean Square Residual (SRMR), composite reliability for the measurement model, and out-of-sample predictive power.