Automated Theory Synthesis: Constructing Partial Least Squares Structural Equation Models via LLM-Driven Literature Mining

LLM-based literature filtering partial least squares structural equation modeling Novelty: 7.5

The construction of structural equation models traditionally relies on manual literature reviews to define latent variables and their causal relationships, a process that is slow, subjective, and prone to missing critical confounding factors. As scientific literature expands exponentially, researchers need automated methods to synthesize fragmented causal claims into unified, testable theoretical frameworks. Bridging natural language processing and variance-based statistical modeling enables the automatic extraction and validation of complex theoretical networks directly from the corpus of scientific knowledge.

Approach

We propose a framework that integrates LLM-based systematic literature filtering with Partial Least Squares Structural Equation Modeling (PLS-SEM). Using a multi-stage retrieval and extraction pipeline inspired by [Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead](/paper/art_53d63e4e3d0c4657a21637dfddd9e4de), our system first filters thousands of empirical papers to isolate studies relevant to a target domain. It then extracts the defined latent constructs, their observable indicators (the measurement model), and the reported directional effects (the structural model). We aggregate these extracted relationships into a unified meta-analytic PLS-SEM, allowing us to statistically validate the LLM-synthesized theoretical framework against raw empirical data, extending the variance-based techniques used in [Examining the antecedents of Facebook acceptance via structural equation modeling: A case of CEGEP students](/paper/art_1edeb977cc52485f9e627e786163cf9b).

Experimental Plan

We evaluate our pipeline on the domain of technology acceptance, utilizing open-source survey datasets related to digital platform adoption. We use an LLM to systematically review 5,000 papers from the Semantic Scholar Open Research Corpus to automatically generate a structural equation model of user adoption. We hypothesize that the LLM-synthesized model will identify valid latent constructs and structural paths that achieve comparable or superior explained variance to expert-crafted models. Baselines include standard, manually derived theoretical models and models generated via naive keyword-based literature searches. Performance is measured using PLS-SEM fit indices, including Standardized Root Mean Square Residual (SRMR), composite reliability for the measurement model, and out-of-sample predictive power.

Open Questions

Can large language models automatically extract and synthesize valid measurement and structural models directly from unstructured scientific literature?
How does the predictive validity of an automatically generated structural equation model compare to one manually constructed by domain experts?
To what extent can automated literature filtering resolve conflicting causal claims by aggregating extracted path coefficients into a unified variance-based model?
Explore this research direction with an AI assistant
Powered by Althea