The health impacts and genetic architecture of food liking in cardio-metabolic diseases

Epidemiological study
Studying population
This study used the epidemiological data from the UKB under the projects 90232 and 203867, which were approved by the Northwest Multicenter Research Ethics Committee in the United Kingdom, the National Information Governance Board for Health and Social Care in England and Wales, and the Community Health Index Advisory Group in Scotland. All participants provided written informed consent. From March 2006 to October 2010, a total of 502,507 participants aged 40–69 years across the UK were enrolled10,63. In this study, we included 182,087 participants who completed the food-liking questionnaire and provided valid responses. Our study adhered to the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines to ensure transparent and comprehensive reporting of observational studies.
Food-liking measurement
Food-liking data was collected via an online questionnaire ( from 182,176 participants. Liking was assessed using a 9-point hedonic scale, where 1 represents extremely dislike and 9 represents extremely like. This widely used scale has good statistical properties, discriminates well between different points, and exhibits linearity64. In this study, we included a total of 176 FLTs, consisting of 140 FLTs from the questionnaire and 36 FLTs constructed based on food similarity attributes (Fig. 1B, Supplementary Method 1 and Supplementary Data 25)10. We excluded 186 participants (0.1%) who responded never tried or do not wish to answer on more than 30% of the 176 food and beverage items, resulting in a final sample of 182,087 participants.
Food intake measurement
Dietary intake data was collected using the Oxford WebQ, a validated, web-based 24-h self-administered dietary assessment tool designed for large observational studies36. Previous large-scale cohort studies have shown that this 24-h dietary assessment method provides a reliable estimate of usual dietary intake65. Detailed information on food intake measurement can be found at https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100118.
Measurement of circulating cardiometabolic related proteins
Plasma cardiometabolic protein levels were measured using proteomic profiling on blood plasma samples with the antibody-based Olink Explore 3072 PEA, which detects 2941 protein analytes. A total of 2923 unique proteins were captured, including 732 cardiometabolic-related proteins. Detailed information for Olink technology is presented in Supplementary Method 2.
Covariates and outcome assessment
Standard covariates of no interest is used in the epidemiological study, including age (years), self-reported sex (men/women), current smoking status (yes/no), current drinking status (yes/no), BMI (kg/m2), education level, Townsend deprivation index, Metabolic Equivalent of Task (MET), systolic blood pressure (SBP), diastolic blood pressure (DBP), high density lipoprotein cholesterol (HDL-C), low density lipoprotein cholesterol (LDL-C), total cholesterol (TC), and triglycerides (TG).
The main outcomes in the epidemiological study were CMDs, including total CVDs, IHD, MI, stroke, HF, and T2D. Information on the incidence and timing of these diseases was collected through the cumulative medical records of hospital diagnoses. The International Classification of Diseases, 10th edition (ICD-10) codes were used to categorize these conditions. Total CVDs were defined as ICD-10 codes I20–I25, I46, I60, I61, I63 and I64, including IHD (I20–I25), MI (I21-I23, I24.1 and I25.1), stroke (I60, I61, I63 and I64), and HF (I50). T2D were defined as ICD-10 codes E11.
MR analysis
Data source in MR
Exposures
The summary-level GWAS data for more than 176 specific food likings were obtained from a recent study encompassing 161,625 individuals of European ancestry in the UK Biobank study10. The GWAS data for food-liking phenotypes were measured using a 9-point hedonic scale. These FLTs were validated through genetic correlation with magnetic resonance imaging brain traits or corresponding food consumption traits, which showed substantial agreement in the direction of effects across 11 independent cohorts.
The genetic instruments for each exposure were selected based on genome-wide significance (P < 5.00 × 10–8) and were independent of each other linkage disequilibrium (r2 < 0.001 within 10,000 kb). In the multi-variable MR analyses, the genetic instruments consisted of SNVs that were genome-wide significant in the GWAS for each exposure and independent of each other.
Covariates
This study selected educational attainment, household income, and physical activity as covariates because these factors have been documented to be associated with FLTs and CMDs7,15,16. These three covariates were all obtained from the UK Biobank. Household income was estimated as self-reported average total household income in 2006–2010 before tax reduction (N = 311,028)66. Educational attainment was measured by the age at which participants finished full-time education (N = 283,749)67. Physical activity was measured using average acceleration from a wrist-worn accelerometer (Axivity AX3) (N = 91,084)68.
Outcomes
The summary-level GWAS data for T2D (N = 440,735), IHD (N = 453,733), and MI (N = 406,565) were sourced from FinnGen, released in 202422. Previous studies have shown that while the Finnish population has some unique characteristics, it shares the greatest genetic similarity with other European populations22,69. The summary-level GWAS data for HF was sourced from a recent meta-analysis of GWAS data, including 977,323 individuals of European ancestry across 26 studies conducted by the Heart Failure Molecular Epidemiology for Therapeutic Targets Consortium24. Data for stroke was sourced from a meta-analysis of GWAS data, including 446,696 individuals of European ancestry, conducted by the MEGASTROKE Consortium23. Detailed information on the GWAS data for each outcome is shown in Supplementary Table 2.
Our study followed the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology by MR) guidelines to ensure transparent and comprehensive reporting of observational studies.
Statistical analysis
All epidemiological and MR analyses were conducted in R software (version 4.3.1) using the R packages Survival (version 3.5.5), rms (version 6.7.0), and limma (version 3.6.4). MR analyses were conducted using the R packages TwoSampleMR (version 0.5.7), MVMR (version 0.4), MRPRESSO (version 1.0), and MRlap (version 0.0.3.0). All genetic architecture analyses were performed after excluding SNVs in the major histocompatibility complex region (chromosome 6: 25–35 megabase [Mb]) due to its complex linkage disequilibrium structure. Analyses were restricted to biallelic SNVs with a minor allele frequency larger than 0.01.
Descriptive analysis
The baseline characteristics, including demo-graphic characteristics and anthropometric measures, were presented as mean (standard deviation, SD) for continuous variables and Numbers (proportion, %) for categorical variables, which were respectively compared by t-test or χ2 test according to the baseline status of cardiometabolic diseases. A 2-sided P-value < 0.05 was considered statistically significant.
Cox proportional hazards regression models
The Cox proportional hazards models were divided into two sections. Section 1: the models were used to evaluate the survival relationship between FLTs and the incidence of CMDs after excluding participants who had CMDs at the baseline survey. Survival time was defined as the period from participation in the UKB assessment center to the diagnosis of the main outcomes or up to February 1, 2024. A series of covariates were controlled for, including age, sex, smoking status, drinking status, BMI, education level, income, Townsend deprivation index, MET, SBP, DBP, HDL-C, LDL-C, TC, and TG. Section 2: the models were performed to examine whether circulating cardiometabolic proteins, influenced by specific FLTs, were associated with the risk of CMDs after adjustment for covariates in section 1, and a two-sided P-FDR < 0.05 was considered statistically significant. Moreover, we conducted a sensitivity analysis to evaluate the potential for reverse causation between FLTs and CMDs. Specifically, participants classified as high-risk for CMDs, based on QRISK3 risk prediction models70, were excluded. The second sensitivity analysis excluded participants whose self-reported sex did not align with their genetically determined sex, to evaluate whether disaggregating by sex and gender would influence our results. The third sensitivity analysis examined whether sex has modifying effects on the associations between FLTs and CMDs.
Identification of differentially expressed protein
We used the limma Bioconductor package to identify differentially expressed proteins by comparing groups based on their liking for a particular food: those who extremely like it (scored 9 on the hedonic scale) versus those who extremely dislike it (scored 1 on the hedonic scale). The statistical significance threshold was set at FDR-adjusted two-sided P < 0.05.
Restricted cubic spline
Restricted cubic spline analysis was used to visualize the survival dose-response relationship, the identified differential circulating proteins and CMDs by setting 5 knots at the 5th, 25th, 50th, 75th, and 95th percentiles. Analysis of variance was used to examine the linear or non-linear relationship of the spline. The covariates controlled in the restricted cubic spline analysis were similar to those in the Cox proportional hazards regression models.
Univariate MR and multi-variable MR analysis
We applied univariate MR to assess the genetic association of each FLT with each type of CMDs. Significant pairs for FLT and CMD were selected. Then, we performed multi-variable MR to evaluate these identified pairs with adjustment for household income, educational attainment, and physical activity to determine whether FLTs had independent genetic association on CMDs.
MR analysis employs genetic variants as instrumental variables to infer causal relationships between exposures and outcomes. Genetic variants, being randomly allocated at conception, are independent of self-selected behaviors and predate disease onset, reducing confounding and reverse causality. The genetic instrument effect is estimated through genotype randomization during meiosis, similar to blinding in trials, ensuring robust association with the exposure, independence from confounders, and no alternative pathways to the disease outcome.
Instrument variables used for MR need to satisfy 3 assumptions71. First, instrument variables must be robustly associated with CMDs. Second, instrumental variables should not be related to confounding factors. Third, instrumental variables should affect the outcome only through the exposure factor. The 176 FLTs were categorized into 10 major food groups, and we performed univariate MR analysis separately for each group. We calculated an FDR-adjusted two-sided P-value to test statistical significance. Then, statistically significant FLTs and disease pairs were then tested using multi-variable MR analysis, with a P-value < 0.05 considered statistically significant. The IVWs were considered genetic associations only if they had the same direction and statistical significance as at least one sensitivity analysis. MR estimates were presented as ORs with corresponding 95% CIs.
We used the random-effect IVW method as the main analysis in univariable MR72, and performed the multi-variable IVW as the main analysis in multi-variable MR73. The IVW method produces the most precise, unbiased, and efficient causal estimates if instrument variables satisfy MR assumptions74. The random-effects IVW method accounts for heterogeneity among instrument variables, while the multi-variable IVW accounts for correlations between multiple exposures.
MR sensitivity analysis
In univariable MR, we performed the Bayesian weighted, weighted median, weighted mode, and MR-Egger methods to validate the robustness of the IVW results based on different assumptions (Supplementary Method 3)75,76,77,78,79. In multi-variable MR, we performed the multi-variable MR-median, multi-variable MR-Egger, and multi-variable MR-Lasso methods to validate the robustness of the multi-variable IVW results (Supplementary Method 4)80. The F-statistics, Egger intercept, Cochran’s Q statistics, and Steiger test were applied to assess the validity and heterogeneity of the instrument variables as well as reverse causation, respectively81,82,83. Additionally, we performed bidirectional univariate MR analysis for the FLTs and CMDs pairs identified in the multi-variable MR analysis. In this analysis, CMDs were treated as exposures and FLTs as outcomes to assess whether genetically predicted CMDs could influence FLTs for detecting reverse causation.
Analysis for genome-wide genetic correlations
The LDSC and high-definition likelihood method were used to estimate genome-wide genetic correlations between FLTs and CMDs84,85. For the LDSC analysis, we used European 1000 Genomes Phase 3 data as the reference and set a filtering threshold for minor allele frequencies above 0.01 to enhance statistical power and minimize the influence of rare variants. The standard error was estimated using the leave-one-out method in LDSC. Population overlap was also estimated to provide unbiased genetic correlation estimates for improving the reliability of the results. Compared with the LDSC method, the high-definition likelihood method makes more efficient use of GWAS data, reducing the variance in genetic association estimates by about 60%, which further validates the reliability of our findings.
Identification of the genomic risk SNVs and related mapped genes
We used the PLACO method to identify shared pleiotropic SNVs between trait pairs with significant genetic correlations86. The PLACO summary statistics were processed using the FUMA SNP2GENE platform (v1.5.0) ( to identify lead and insignificant SNVs, genomic risk SNVs, gene mapping, and exploration of biological pathways, including MAGMA gene analysis, MAGMA gene-set analysis, and TSEA. We set European 1000 Genomes Phase 3 data as the reference, and a P < 1.00 × 10−5 as the threshold value. Lead SNVs were defined as those with linkage disequilibrium r² ≤ 0.1 with a distance of <250 kb. Independent significant SNVs were defined as those with linkage disequilibrium r² ≤ 0.6 with a distance of <250 kb87. Based on the linkage disequilibrium information of lead SNVs, genomic risk loci were further characterized and annotated. Detailed information is presented in Supplementary Data 13 and 14.
To confirm the biological relevance of the genomic risk SNVs, we performed functional gene mapping using two methods: (1) positional mapping, assigning SNVs to genes based on physical distances (default window of 10 kb), and (2) eQTL mapping, assigning SNVs to genes based on gene expression levels using cis-eQTLGen and GTEx Whole Blood v8 data. The P-value threshold of significant SNV-gene pairs was set at P–FDR < 0.05. The MHC region was excluded from both annotations due to its complicated linkage disequilibrium structure, defined as between the MOG and COL11A2 genes.
Phenome-wide association with the lead SNVs
We investigated the lead SNVs identified in each pair of FLT and corresponding disease to determine whether these loci had been previously associated with any clinical phenotype. Three major platforms: the EMBL-EBI GWAS Catalog (EMBL-EBI 2024) (https://www.ebi.ac.uk/gwas/home)88, IEU Open GWAS (v8.9.1) (https://gwas.mrcieu.ac.uk/)89, and GWAS Atlas (v2.0) (https://atlas.ctglab.nl/PheWAS)90 were utilized.
In the EMBL-EBI GWAS Catalog, we examined whether each lead SNV was linked to any clinical phenotype (e.g., rs11052302, query date: September 20, 2024). For IEU Open GWAS and GWAS Atlas, we conducted comprehensive searches for lead SNVs and directly extracted significantly associated clinical phenotypes (query date: September 20, 2024). We applied a uniform P threshold of <1 × 10−5 as the search criterion across all platforms14. To manage linkage disequilibrium and remove redundant associations among independent significant SNVs, we followed the procedure: If a lead SNV had clinical associations, it was considered the primary locus. If the lead SNV did not show any clinical phenotype associations, we examined independent significant SNVs that were closely correlated with the lead SNV, starting with the most significant ones, until an established association was identified14.
For visualization purposes, we categorized clinical phenotypes into 548 items, which were primarily related to CMDs and associated clinical traits, brain development and neurological disorders, inflammation indices, as well as diet and lifestyle behaviors. Keyword cloud plots were created based on the frequency of these clinical phenotypes within each FLT and CMD pair. The size of each rectangle in the keyword cloud plots reflects the total number of associations between the lead SNVs identified in our study and the clinical phenotypes reported in the literature.
MAGMA gene annotation
We performed genome-wide gene-level association analysis by the MAGMA gene analysis method. Genes located in SNVs were annotated and mapped using reference European 1000 Genomes Phase 3 data as the reference, employing the SNV-wide mean model. Gene-level P-values were obtained for traits inputted against curated protein-coding genes containing valid SNVs. The significance of pleiotropic genes was declared at P-FDR <0.0591.
Summary data-based MR analysis using reference pQTL or eQTL
The summary data-based MR analysis consisted of two sections. Section 1: a summary data-based MR analysis was conducted to investigate the potential causal relationship between plasma circulating protein levels associated with FLTs and corresponding types of CMDs. For this, we used reference pQTL data from the UKB PPP project92. Section 2: we conducted SMR analysis to examine whether the SNVs associated with both traits in each FLT-CMD pair were driven by the same gene expression with reference to eQTL data from whole blood gene expression and eQTLGen datasets. The summary data-based MR method integrates an MR framework to infer causality by testing associations between GWAS and eQTL or pQTL data93. The HEIDI test was applied to distinguish causality (or pleiotropy) from linkage, where possible, based on the available data94. This approach deepened our understanding of the underlying genetic mechanisms and provided valuable insights for drug target discovery for each trait pair. We selected significant probes common to both phenotypes from the HEIDI analysis, using the threshold of summary data-based MR estimated P-value < 0.05 and HEIDI test P-value > 0.05.
Druggability assessment
For the genes we identified, we first searched for corresponding drugs and indications in Drugcentral ( DGIdb ( DrugBank ( and Therapeutic Target Database ( We then supplemented this information with clinical trial data from ClinicalTrials.gov ( and verified whether the identified drugs have been approved by the FDA through the official FDA website ( to assess their potential druggability.
GSEA
GSEA was performed to identify potential shared biological pathways using two approaches. First, we utilized pleiotropic genes identified from the MAGMA gene analysis to explore the biological functions of shared variants95. The MAGMA gene sets were obtained from MsigDB v6.2, including 10,678 gene sets (curated gene sets: 4761, GO terms: 5917). Second, we employed Metascape Web Tools (v3.5) ( to analyze gene sets derived from both the MAGMA gene analysis and mapped genes of the genomic risk SNVs. Significantly enriched pathways were determined using a nominal threshold of two-sided P < 0.05.
TSEA
We used the MAGMA platform to perform gene property analyses for each FLT and CMD pair, using reference GTEx V8 data, which encompassed 54 tissue types. This gene property analysis converted gene-level association P-values to z scores and tested the expression value of a specific tissue’s gene against the average expression value across all tissues in a regression model. To further examine the gene properties of these pleiotropic loci, we also conducted TSEA to deTS analysis ( using two different reference panels95: GTEx and Encyclopedia of DNA Elements project (ENCODE). This methodology highlighted the tissue specificity of pleiotropic genes derived from both the MAGMA gene analysis and the mapped genes of the genomic risk SNVs. Significantly enriched pathways were determined using a nominal threshold of two-sided P < 0.05.
Ethics statement
The UK biobank was approved by the Northwest Multicenter Research Ethics Committee in the United Kingdom, the National Information Governance Board for Health and Social Care in England and Wales, and the Community Health Index Advisory Group in Scotland. All the summary-level GWAS data used in the analyses are publicly available, and therefore, this study was exempt from ethics review. Ethical approval for the GWASs can be found in the corresponding GWAS publications cited in the manuscript.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
link