Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEMIC AUTOIMMUNE DISEASES DIAGNOSTIC AND PROGNOSTIC METHOD
Document Type and Number:
WIPO Patent Application WO/2020/260683
Kind Code:
A1
Abstract:
The present invention relates to a method for diagnosis and/or prognosis of a subject afflicted from a systemic autoimmune disease (SAD) which allows the reclassification or reorganization of patients afflicted from a SAD according to their molecular profile.

Inventors:
ALARCÓN RIQUELME MARTA EUGENIA (ES)
BARTUREN BRIÑAS GUILLERMO (ES)
TORO DOMÍNGUEZ DANIEL (ES)
MARTÍNEZ BUENO MANUEL (ES)
CARMONA SÁEZ PEDRO (ES)
CARNERO MONTORO ELENA (ES)
Application Number:
PCT/EP2020/068214
Publication Date:
December 30, 2020
Filing Date:
June 29, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FUND PUBLICA ANDALUZA PROGRESO Y SALUD (ES)
International Classes:
G16B40/00
Other References:
G BARTUREN ET AL: "Transcriptome and Methylome Integrative Molecular Analysis Uncovers a New Systemic Autoimmune Disease Classification", ARTHRITIS RHEUMATOL, vol. 70 (suppl 10), 2018, XP055730569
G BARTUREN ET AL: "Molecular stratification of autoimmune diseases based on epigenetic profiles", LUPUS, vol. 5 (suppl 1), S1A:5, 2018, pages A1 - A1, XP055730402, DOI: 10.1136/lupus-2018-abstract.2
SU-JIN MOON ET AL: "Compendium of skin molecular signatures identifies key pathological features associated with fibrosis in systemic sclerosis", ANNALS OF THE RHEUMATIC DISEASES, vol. 78, no. 6, 5 April 2019 (2019-04-05), GB, pages 817 - 825, XP055730654, ISSN: 0003-4967, DOI: 10.1136/annrheumdis-2018-214778
GUILLERMO BARTUREN ET AL: "Integrative Analysis Reveals a Molecular Stratification of Systemic Autoimmune Diseases", MEDRXIV, 23 February 2020 (2020-02-23), XP055730413, Retrieved from the Internet [retrieved on 20200914], DOI: 10.1101/2020.02.21.20021618
S. BABAEI ET AL: "Rna-sequencing of 800 human blood samples reveals shared and unique expression profiles across seven systemic autoimmune diseases", SATURDAY, 16 JUNE 2018, June 2018 (2018-06-01), pages 225.2 - 226, XP055730389, DOI: 10.1136/annrheumdis-2018-eular.6514
LI, Y. R. ET AL.: "Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases", NATURE MEDICINE, vol. 21, 2015, pages 1018 - 1027
BAECHLER, E. C. ET AL.: "Interferon-inducible gene expression signature in peripheral blood cells of patients with severe lupus", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, vol. 100, 2003, pages 2610 - 2615
BENNETT, L. ET AL.: "Interferon and granulopoiesis signatures in systemic lupus erythematosus blood", THE JOURNAL OF EXPERIMENTAL MEDICINE, vol. 197, 2003, pages 711 - 723
TANI, C. ET AL.: "Rhupus syndrome: assessment of its prevalence and its clinical and instrumental characteristics in a prospective cohort of 103 SLE patients", AUTOIMMUNITY REVIEWS, vol. 12, 2013, pages 537 - 541
ALARCON-SEGOVIA, D.CARDIEL, M. H.: "Comparison between 3 diagnostic criteria for mixed connective tissue disease. Study of 593 patients", THE JOURNAL OF RHEUMATOLOGY, vol. 16, 1989, pages 328 - 334
SHARP, G. C.IRVIN, W. S.TAN, E. M.GOULD, R. G.HOLMAN, H. R.: "Mixed connective tissue disease--an apparently distinct rheumatic disease syndrome associated with a specific antibody to an extractable nuclear antigen (ENA", THE AMERICAN JOURNAL OF MEDICINE, vol. 52, 1972, pages 148 - 159, XP026363186, DOI: 10.1016/0002-9343(72)90064-2
SKOPOULI, F. N.DROSOS, A. A.PAPAIOANNOU, T.MOUTSOPOULOS, H. M.: "Preliminary diagnostic criteria for Sjogren's syndrome", SCANDINAVIAN JOURNAL OF RHEUMATOLOGY, 1986, pages 22 - 25
WILSON, W. A. ET AL.: "International consensus statement on preliminary classification criteria for definite antiphospholipid syndrome: report of an international workshop", ARTHRITIS AND RHEUMATISM, vol. 42, 1999, pages 1309 - 1311
PETERSON, L. S.NELSON, A. M.SU, W. P.: "Classification of morphea (localized scleroderma", MAYO CLINIC PROCEEDINGS, vol. 70, 1995, pages 1068 - 1076
MURPHY, D.MATTEY, D.HUTCHINSON, D.: "Anti-citrullinated protein antibody positive rheumatoid arthritis is primarily determined by rheumatoid factor titre and the shared epitope rather than smoking per se", PLOS ONE, vol. 12, 2017, pages e0180655
TERUEL, M.ALARCON-RIQUELME, M. E.: "Genetics of systemic lupus erythematosus and Sjogren's syndrome: an update", CURRENT OPINION IN RHEUMATOLOGY, vol. 28, 2016, pages 506 - 514
CASEY, K. A. ET AL.: "Type I interferon receptor blockade with anifrolumab corrects innate and adaptive immune perturbations of SLE", LUPUS SCIENCE & MEDICINE, vol. 5, 2018, pages e000286
WANG, B. ET AL.: "Similarity network fusion for aggregating data types on a genomic scale", NATURE METHODS, vol. 11, 2014, pages 333 - 337, XP055576311, DOI: 10.1038/nmeth.2810
LANGFELDER, P.HORVATH, S.: "WGCNA: an R package for weighted correlation network analysis", BMC BIOINFORMATICS, vol. 9, 2008, pages 559, XP021047563, DOI: 10.1186/1471-2105-9-559
CIANG, N. C.PEREIRA, N.ISENBERG, D. A.: "Mixed connective tissue disease-enigma variations?", RHEUMATOLOGY, vol. 56, 2017, pages 326 - 333
ALARCON-SEGOVIA, D: "Mixed connective tissue disease: some statements", CLINICAL RHEUMATOLOGY, vol. 1, 1982, pages 81 - 83
CHAUSSABEL, D. ET AL.: "A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus", IMMUNITY, vol. 29, 2008, pages 150 - 164
LI, S. ET AL.: "Molecular signatures of antibody responses derived from a systems biology study of five human vaccines", NATURE IMMUNOLOGY, vol. 15, 2014, pages 195 - 204, XP055199113, DOI: 10.1038/ni.2789
ZHANG, K.KAUFMAN, R. J.: "From endoplasmic-reticulum stress to the inflammatory response", NATURE, vol. 454, 2008, pages 455 - 462, XP037158788, DOI: 10.1038/nature07203
SCHLENNER, S. ET AL.: "NFIL3 mutations alter immune homeostasis and sensitise for arthritis pathology", ANNALS OF THE RHEUMATIC DISEASES, vol. 78, 2019, pages 342 - 349
SOKHI, U. K. ET AL.: "Dissection and function of autoimmunity-associated TNFAIP3 (A20) gene enhancers in humanized mouse models", NATURE COMMUNICATIONS, vol. 9, 2018, pages 658
SMITH, J. A.: "Regulation of Cytokine Production by the Unfolded Protein Response; Implications for Infection and Autoimmunity", FRONTIERS IN IMMUNOLOGY, vol. 9, 2018, pages 422
POLLACK, B. P.: "EGFR inhibitors, MHC expression and immune responses : Can EGFR inhibitors be used as immune response modifiers?", ONCOIMMUNOLOGY, vol. 1, 2012, pages 71 - 74, XP055385135, DOI: 10.4161/onci.1.1.18073
WOHNER, M. ET AL.: "Molecular functions of the transcription factors E2A and E2-2 in controlling germinal center B cell and plasma cell development", THE JOURNAL OF EXPERIMENTAL MEDICINE, vol. 213, 2016, pages 1201 - 1221
BEST, J. A. ET AL.: "Transcriptional insights into the CD8(+) T cell response to infection and memory T cell formation", NATURE IMMUNOLOGY, vol. 14, 2013, pages 404 - 412
JEFFERIES, C. A.: "Regulating IRFs in IFN Driven Disease", FRONTIERS IN IMMUNOLOGY, vol. 10, 2019, pages 325, XP055645240, DOI: 10.3389/fimmu.2019.00325
JAMIN, C. ET AL.: "Multi-center harmonization of flow cytometers in the context of the European ''PRECISESADS'' project", AUTOIMMUNITY REVIEWS, vol. 15, 2016, pages 1038 - 1045, XP029773806, DOI: 10.1016/j.autrev.2016.07.034
YARILINA, A.IVASHKIV, L. B.: "Type I interferon: a new player in TNF signaling", CURRENT DIRECTIONS IN AUTOIMMUNITY, vol. 11, 2010, pages 94 - 104
SPROSTON, N. R.ASHWORTH, J. J.: "Role of C-Reactive Protein at Sites of Inflammation and Infection", FRONTIERS IN IMMUNOLOGY, vol. 9, 2018, pages 754
MANICONE, A. M.MCGUIRE, J. K.: "Matrix metalloproteinases as modulators of inflammation", SEMINARS IN CELL & DEVELOPMENTAL BIOLOGY, vol. 19, 2008, pages 34 - 41, XP023981475, DOI: 10.1016/j.semcdb.2007.07.003
MILLER, S. ET AL.: "Hypomethylation of STAT1 and HLA-DRB1 is associated with type-I interferon-dependent HLA-DRB1 expression in lupus CD8+ T cells", ANNALS OF THE RHEUMATIC DISEASES, vol. 78, 2019, pages 519 - 528
FERNANDO, M. M. ET AL.: "Defining the role of the MHC in autoimmunity: a review and pooled analysis", PLOS GENETICS, vol. 4, 2008, pages e1000024
WU, Y. ET AL.: "Isoliquiritigenin prevents the progression of psoriasis-like symptoms by inhibiting NF-kappaB and proinflammatory cytokines", JOURNAL OF MOLECULAR MEDICINE, vol. 94, 2016, pages 195 - 206
MALFITANO, A. M. ET AL.: "Arvanil inhibits T lymphocyte activation and ameliorates autoimmune encephalomyelitis", JOURNAL OF NEUROIMMUNOLOGY, vol. 171, 2006, pages 110 - 119, XP025039223, DOI: 10.1016/j.jneuroim.2005.09.005
ALTMAN, A.KONG, K. F.: "Protein kinase C inhibitors for immune disorders", DRUG DISCOVERY TODAY, vol. 19, 2014, pages 1217 - 1221
OLEKSYN, D. ET AL.: "Protein kinase Cbeta is required for lupus development in Sle mice", ARTHRITIS AND RHEUMATISM, vol. 65, 2013, pages 1022 - 1031, XP002775956, DOI: 10.1002/art.37825
SEKHON, S.KOO, J.: "Indirubin: a novel topical agent in the treatment of psoriasis", THE BRITISH JOURNAL OF DERMATOLOGY, vol. 178, 2018, pages 21
ZHAO, Y. ET AL.: "Indirubin modulates CD4(+) T-cell homeostasis via PD1/PTEN/AKT signalling pathway in immune thrombocytopenia", JOURNAL OF CELLULAR AND MOLECULAR MEDICINE, vol. 23, 2019, pages 1885 - 1898
TORO-DOMINGUEZ, D. ET AL.: "Stratification of Systemic Lupus Erythematosus Patients Into Three Groups of Disease Activity Progression According to Longitudinal Gene Expression", ARTHRITIS & RHEUMATOLOGY, vol. 70, 2018, pages 2025 - 2035
RITCHIE, M. E. ET AL.: "limma powers differential expression analyses for RNA-sequencing and microarray studies", NUCLEIC ACIDS RESEARCH, vol. 43, 2015, pages e47
"swamp: Visualization", ANALYSIS AND ADJUSTMENT OF HIGH-DIMENSIONAL DATA IN RESPECT TO SAMPLE ANNOTATIONS, 2018
VENABLES, W. N.RIPLEY, B. D., MODERN APPLIED STATISTICS WITH S., 2002
HOADLEY, K. A. ET AL.: "Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin", CELL, vol. 158, 2014, pages 929 - 944, XP029046515, DOI: 10.1016/j.cell.2014.06.049
BRODIN, P. ET AL.: "Variation in the human immune system is largely driven by non-heritable influences", CELL, vol. 160, 2015, pages 37 - 47, XP029132654, DOI: 10.1016/j.cell.2014.12.020
LI, J.LIU, Y.KIM, T.MIN, R.ZHANG, Z.: "Gene expression variability within and between human populations and implications toward disease susceptibility", PLOS COMPUTATIONAL BIOLOGY, vol. 6, 2010
HEYN, H. ET AL.: "DNA methylation contributes to natural human variation", GENOME RESEARCH, vol. 23, 2013, pages 1363 - 1372
BARTUREN, G.BERETTA, L.CERVERA, R.VAN VOLLENHOVEN, R.ALARCON-RIQUELME, M. E.: "Moving towards a molecular taxonomy of autoimmune rheumatic diseases", NATURE REVIEWS. RHEUMATOLOGY, vol. 14, 2018, pages 75 - 93
WEINER 3RD, J.DOMASZEWSKA, T.: "tmod: an R package for general and multivariate enrichment analysis", PEERJ PREPRINTS, vol. 4, 2016, pages e2420v2421
SHABALIN, A. A.: "Matrix eQTL: ultra fast eQTL analysis via large matrix operations", BIOINFORMATICS, vol. 28, 2012, pages 1353 - 1358
HEINZ, S. ET AL.: "Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities", MOLECULAR CELL, vol. 38, 2010, pages 576 - 589
STUNNENBERG, H. G.INTERNATIONAL HUMAN EPIGENOME, C.HIRST, M.: "The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery", CELL, vol. 167, 2016, pages 1897
FERNANDEZ, J. M. ET AL.: "The BLUEPRINT Data Analysis Portal", CELL SYSTEMS, vol. 3, 2016, pages 491 - 495
MARTIN, M.: "Cutadapt removes adapter sequences from high-throughput sequencing reads", EMBNET.JOURNAL, vol. 17, 2011, pages 10 - 12, Retrieved from the Internet
HOTHORN, T.BRETZ, F.WESTFALL, P.: "Simultaneous inference in general parametric models", BIOMETRICAL JOURNAL. BIOMETRISCHE ZEITSCHRIFT, vol. 50, 2008, pages 346 - 363
JOSSE, J.HUSSON, F.: "missMDA: A Package for Handling Missing Values in Multivariate Data Analysis", JOURNAL OF STATISTICAL SOFTWARE, vol. 70, 2016, pages 1 - 31
PURCELL, S. ET AL.: "PLINK: a tool set for whole-genome association and population-based linkage analyses", AMERICAN JOURNAL OF HUMAN GENETICS, vol. 81, 2007, pages 559 - 575, XP055061306, DOI: 10.1086/519795
SUBRAMANIAN, A. ET AL.: "A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles", CELL, vol. 171, 2017, pages 1437 - 1452 e1417
BOUSFIELD, D. ET AL.: "Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources", F1000RESEARCH, vol. 5, 2016
MOSCA, M.TANI, C.VAGNANI, S.CARLI, L.BOMBARDIERI, S.: "The diagnosis and classification of undifferentiated connective tissue diseases", JOURNAL OF AUTOIMMUNITY, vol. 48-49, 2014, pages 50 - 52, XP028636658, DOI: 10.1016/j.jaut.2014.01.019
ALETAHA, D. ET AL.: "rheumatoid arthritis classification criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative", ANNALS OF THE RHEUMATIC DISEASES, vol. 69, 2010, pages 1580 - 1588
HOCHBERG, M. C.: "Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus", ARTHRITIS AND RHEUMATISM, vol. 40, 1997, pages 1725
MIYAKIS, S. ET AL.: "International consensus statement on an update of the classification criteria for definite antiphospholipid syndrome (APS", JOURNAL OF THROMBOSIS AND HAEMOSTASIS : JTH, vol. 4, 2006, pages 295 - 306
VAN DEN HOOGEN, F. ET AL.: "classification criteria for systemic sclerosis: an American college of rheumatology/European league against rheumatism collaborative initiative", ANNALS OF THE RHEUMATIC DISEASES, vol. 72, 2013, pages 1747 - 1755
VITALI, C. ET AL.: "Classification criteria for Sjogren's syndrome: a revised version of the European criteria proposed by the American-European Consensus Group", ANNALS OF THE RHEUMATIC DISEASES, vol. 61, 2002, pages 554 - 558
BOSSINI-CASTILLO, L.LOPEZ-LSAC, E.MARTIN, J.: "Immunogenetics of systemic sclerosis: Defining heritability, functional variants and shared-autoimmunity pathways", JOURNAL OF AUTOIMMUNITY, vol. 64, 2015, pages 53 - 65, XP029321356, DOI: 10.1016/j.jaut.2015.07.005
BERETTA, L. ET AL.: "2016 ACR/ARHP Annual Meeting", ARTHRITIS RHEUMATOL, vol. 68, no. 10, 2016
AGUILAR-QUESADA, R. ET AL., DESIGN AND IMPLEMENTATION OF A MODEL FOR HARMONIZED AND QUALITY SAMPLE HANDLING IN THE PRECISESADS MULTI-CENTER PROJECT, 2019
SCHEUFELE, E. ET AL.: "tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform", AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE, vol. 2014, 2014, pages 96 - 101
FRANKS, A. L.SLANSKY, J. E.: "Multiple associations between a broad spectrum of autoimmune diseases, chronic inflammatory diseases and cancer", ANTICANCER RESEARCH, vol. 32, 2012, pages 1119 - 1136
TANG, H.PENG, J.WANG, P.RISCH, N. J.: "Estimation of individual admixture: analytical and study design considerations", GENETIC EPIDEMIOLOGY, vol. 28, 2005, pages 289 - 301
THORNTON, T. ET AL.: "Estimating kinship in admixed populations", AMERICAN JOURNAL OF HUMAN GENETICS, vol. 91, 2012, pages 122 - 138, XP028425638, DOI: 10.1016/j.ajhg.2012.05.024
CHANG, C. C. ET AL.: "Second-generation PLINK: rising to the challenge of larger and richer datasets", GIGASCIENCE, vol. 4, 2015, pages 7, XP021214690, DOI: 10.1186/s13742-015-0047-8
LOH, P. R. ET AL.: "Reference-based phasing using the Haplotype Reference Consortium panel", NATURE GENETICS, vol. 48, 2016, pages 1443 - 1448
HOWIE, B.FUCHSBERGER, C.STEPHENS, M.MARCHINI, J.ABECASIS, G. R.: "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing", NATURE GENETICS, vol. 44, 2012, pages 955 - 959, XP055151106, DOI: 10.1038/ng.2354
DAS, S. ET AL.: "Next-generation genotype imputation service and methods", NATURE GENETICS, vol. 48, 2016, pages 1284 - 1287
ANDREWS, S., FASTQC: A QUALITY CONTROL TOOL FOR HIGH THROUGHPUT SEQUENCE DATA, 2010
DOBIN, A. ET AL.: "STAR: ultrafast universal RNA-seq aligner", BIOINFORMATICS, vol. 29, 2013, pages 15 - 21, XP055500895, DOI: 10.1093/bioinformatics/bts635
LI, B.DEWEY, C. N.: "RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome", BMC BIOINFORMATICS, vol. 12, 2011, pages 323, XP021104619, DOI: 10.1186/1471-2105-12-323
MIN, J. L.HEMANI, G.DAVEY SMITH, G.RELTON, C.SUDERMAN, M.: "Meffil: efficient normalization and analysis of very large DNA methylation datasets", BIOINFORMATICS, vol. 34, 2018, pages 3983 - 3989
SHERRY, S. T. ET AL.: "dbSNP: the NCBI database of genetic variation", NUCLEIC ACIDS RESEARCH, vol. 29, 2001, pages 308 - 311, XP055125042, DOI: 10.1093/nar/29.1.308
CHEN, Y. A. ET AL.: "Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray", EPIGENETICS, vol. 8, 2013, pages 203 - 209
PIDSLEY, R. ET AL.: "Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling", GENOME BIOLOGY, vol. 17, 2016, pages 208
SALMA, N. ET AL.: "Thrombotic risk assessment and analytical performance of the chemiluminescent analyzer IDS-iSYS for the detection of anti-cardiolipin and anti-beta 2 glycoprotein I autoantibodies", CLINICAL IMMUNOLOGY, vol. 194, 2018, pages 92 - 99, XP085440438, DOI: 10.1016/j.clim.2018.07.006
CAPALDO, C. ET AL.: "The active immunological profile in patients with primary Sjogren's syndrome is restricted to typically encountered autoantibodies", CLINICAL AND EXPERIMENTAL RHEUMATOLOGY, vol. 34, 2016, pages 722
BIESEN, R. ET AL.: "Anti-dsDNA-NcX ELISA: dsDNA-loaded nucleosomes improve diagnosis and monitoring of disease activity in systemic lupus erythematosus", ARTHRITIS RESEARCH & THERAPY, vol. 13, 2011, pages R26, XP021091614, DOI: 10.1186/ar3250
VAN DER AUWERA, G. A. ET AL.: "From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline", CURRENT PROTOCOLS IN BIOINFORMATICS, vol. 43, 2013, pages 11 10 11 - 33
Attorney, Agent or Firm:
ELZABURU S.L.P. (ES)
Download PDF:
Claims:
CLAIMS

1 . A method for the diagnosis and/or prognosis of a subject afflicted from a systemic autoimmune disease (SAD) comprising the following steps: a. Analyzing the transcriptome in a sample from said subject and/or b. Analyzing the DNA methylome in a sample from said subject, and c. Defining a set of clusters performing an integrated clustering analysis of the transcriptome and/or methylome data retrieved in steps (a) and/or (b), and d. Assigning said subject to one of the clusters defined in step (c).

2. The method according to the preceding claim, wherein the integrated clustering analysis is performed using the Similarity Network Fusion (SNF) algorithm for integration of the transcriptome and/or methylome data retrieved in steps (a) and/or (b).

3. The method according to any one of the preceding claims wherein 2 or 4 clusters are defined in step (c).

4. The method according to any one of the preceding claims wherein 4 clusters are defined in step (c).

5. The method according to any one of the preceding claims, wherein the integrated clustering analysis of step (c) comprises the selection of features with statistically increased variability in individuals diagnosed with a SAD compared to healthy controls.

6. The method according to the preceding claim, wherein said selection of features is made with a FDR Levene-test below 0.1 and case vs control variability fold-change higher than 0.5.

7. The method according to any one of the preceding claims, wherein a discovery set and a validation set are used for the integrated clustering analysis of step (c).

8. The method according to any one of the preceding claims, wherein the integrated clustering analysis of step (c) comprises a training step on the discovery set by 10 times nested cross- validation approach on discovery set, preferably not including healthy controls.

9. The method according to the preceding claim, wherein said training comprises inner 5-fold cross-validation for hyperparameter model optimization.

10. The method according to any one of the two preceding claims, wherein the training comprises a background structure substraction.

1 1 . The method according to any one of the three preceding claims, wherein the training comprises an outer cross-validation.

12. The method according to any one of the preceding claims, wherein said systemic autoimmune disease is selected from systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), mixed connective tissue disease (MCTD), systemic sclerosis (SSc), secondary Sjogren’s syndrome (SjS), primary antiphospholipid syndrome (PAPS), undifferentiated connective tissue disease (UCTD) and a combination thereof.

13. The method according to any one of the preceding claims, wherein steps (a) and/or (b) are performed in cell subtypes from whole blood, selected from neutrophils, eosinophils, classic monocytes, intermediate monocytes, non classic monocytes, CD3+ T cells, B cells, NK cells, NK-like cells, DCs and basophils. 14. The method according to any one of the preceding claims, wherein said sample is selected from whole blood, serum, synovia, skin and any combination thereof, preferably is whole blood and/or serum.

15. The method according to any one of the preceding claims, further comprising analyzing SAD related autoantibodies, cytokines, chemokines, inflammatory mediators and any combination thereof in a serum sample.

16. The method according to any one of the preceding claims, wherein the patient is a human patient.

17. The method according to any one of the preceding claims, further comprising the administration of a drug selected from droxinostat, exemestane, BRD-K06817181 , isoliquiritigenin, arvanil, PKCbeta-inhibitor, W-13, IB-MECA, ipsapirone, ITE, olanzapine, MDL-1 1939, veratridine, SB-216763, doxycycline, acebutolol, cholic-acid, ascorbyl-palmitate, MK-1775, rucaparib, indirubin, azaperone, spiroxatrine, alprazolam or any combination thereof.

Description:
SYSTEMIC AUTOIMMUNE DISEASES DIAGNOSTIC AND/OR PROGNOSTIC METHOD

Background

The systemic autoimmune diseases (SADs) are entities diagnosed based on different clinical and laboratory criteria. The diseases are highly heterogeneous with varied progression of disease severity. The search for specific and sensitive biomarkers that may help predict the development of tissue damage have been an important focus of clinical research with little success. In general, the time from disease onset to diagnosis can be of many years, leading to damage accrual and worst prognosis. Some individuals never fulfill the clinical criteria and remain undiagnosed for years or a lifetime. These are known as undifferentiated connective tissue disease (UCTD).

Previous genetic studies have shown that many of the SADs share susceptibility genes 1 and patients may also share, to a larger or lesser degree a gene expression signature known as the interferon signature, that is, the increased expression of interferon inducible genes 23 . Clinical features can also be shared for instance a number of patients with systemic lupus erythematosus (SLE) may develop erosions and joint deformities in hands and feet, similar to those found in rheumatoid arthritis (RA) 1 . Further, the entity known as mixed connective tissue disease (MCTD) is controversial because patients may have clinical manifestations usually observed in SLE, RA or in systemic sclerosis (SSc) While patients with SLE and RA may present with secondary Sjogren’s syndrome (SjS), there are those having what is known as the primary entity (pSjS), without evidence of RA or SLE Z . Similarly, SLE patients may have secondary antiphospholipid syndrome, but there are also patients with primary antiphospholipid syndrome (PAPS) who tend not to develop SLE (hence, primary) even after many years follow-up s .

The heterogeneous nature of these diseases is also observed at cellular and molecular levels. Not all patients with SLE have an interferon signature while the signature is more prevalent in childhood-onset disease. Some patients with SSc have a disease limited to the skin 2 , and not all patients who fulfill the diagnostic criteria for RA have anti-citrullinated peptide antibodies (-70%) 12 Patients with SLE and pSjS share the presence of anti-SSA and anti-SSB antibodies, and in this regard, also share alleles of the HLA class II gene DRB1 * 0301— .

The heterogeneity of the diseases also impedes the identification of new therapies, and worst, it has consequences in our selection of response endpoints and the overall results of clinical trials. One very clear example is the recent reported failure of Anifrolumab (anti-type I IFNR) in a phase III trial, while in a phase II it was successful when patients had been divided into IFN hi and IFN low 1312 It is clear that as we advance, those patients having an IFN signature may benefit most from this new biological, but so far we know nothing of those patients who do not have an IFN signature and remain drug orphan.

In an unprecedented study in autoimmunity, high dimensional molecular data from whole blood shows here how the seven conditions (SLE, RA, SSC, SjS, MCTD, PAPS, and UCTD) distribute among clusters of specific molecular patterns, and how for each individual the patterns are stable over time.

Summary of the invention

The present invention relates to a method for the diagnosis and/or prognosis of a subject afflicted from a systemic autoimmune disease (SAD) comprising the following steps: a. Analyzing the transcriptome in a sample from said subject and/or

b. Analyzing the DNA methylome in a sample from said subject, and

c. Defining a set of clusters performing an integrated clustering analysis of the transcriptome and/or methylome data retrieved in steps (a) and/or (b), and d. Assigning said subject to one of the clusters defined in step (c).

This method allows the reclassification or reorganization of patients afflicted from a SAD according to their molecular profile.

Description of the invention

The inventors have found that performing the method of the invention, SAD patients can be assigned to four different groups which are stable in time and three of which have a very well defined molecular signature: inflammatory, adaptative immune and interferon related.

In a preferred embodiment of the method of the invention, the integrated clustering analysis is performed using the Similarity Network Fusion (SNF) algorithm for integration of the transcriptome and/or methylome data retrieved in steps (a) and/or (b).

In a preferred embodiment of the method of the invention, 2 or 4 clusters are defined in step (c). Preferably, 4 clusters are defined in step (c).

In a preferred embodiment of the method of the invention, the integrated clustering analysis of step (c) comprises the selection of features with statistically increased variability in individuals diagnosed with a SAD compared to healthy controls. Preferably, said selection of features is made with a FDR Levene-test below 0.1 and case vs control variability fold-change higher than 0.5.

In a preferred embodiment of the method of the invention, a discovery set and a validation set are used for the integrated clustering analysis of step (c). In a preferred embodiment of the method of the invention, the integrated clustering analysis of step (c) comprises a training step on the discovery set by 10 times nested cross-validation approach on discovery set, preferably not including healthy controls. In a preferred embodiment of the method of the invention, said training comprises inner 5-fold cross-validation for hyperparameter model optimization. In a preferred embodiment of the method of the invention, the training comprises a background structure substraction. In a preferred embodiment of the method of the invention, the training comprises an outer cross-validation.

In a preferred embodiment of the method of the invention, said systemic autoimmune disease is selected from systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), mixed connective tissue disease (MCTD), systemic sclerosis (SSc), secondary Sjogren’s syndrome (SjS), primary antiphospholipid syndrome (PAPS), undifferentiated connective tissue disease (UCTD) and a combination thereof.

In a preferred embodiment of the method of the invention, steps (a) and/or (b) are performed in cell subtypes from whole blood, selected from neutrophils, eosinophils, classic monocytes, intermediate monocytes, non classic monocytes, CD3 + T cells, B cells, NK cells, NK-like cells, DCs and basophils.

In a preferred embodiment of the method of the invention, the sample is selected from whole blood, serum, synovia, skin and any combination thereof, preferably is whole blood and/or serum.

In a preferred embodiment of the method of the invention, said method further comprises analyzing SAD related autoantibodies, cytokines, chemokines, inflammatory mediators and any combination thereof in a serum sample.

In a preferred embodiment of the method of the invention, the patient is a human patient.

In a preferred embodiment of the method of the invention, said method further comprises the administration of a drug selected from droxinostat, exemestane, BRD-K06817181 , isoliquiritigenin, arvanil, PKCbeta-inhibitor, W-13, IB-MECA, ipsapirone, ITE, olanzapine, MDL-1 1939, veratridine, SB-216763, doxycycline, acebutolol, cholic-acid, ascorbyl-palmitate, MK-1775, rucaparib, indirubin, azaperone, spiroxatrine, alprazolam or any combination thereof.

In another aspect, the present invention relates to a method for defining a SAD patient subpopulation according to the patient’s molecular profile, defined using the method of the first aspect. Another aspect relates to the administration of roxinostat, exemestane, BRD- K06817181 , isoliquiritigenin, arvanil, PKCbeta-inhibitor, W-13, IB-MECA, ipsapirone, ITE, olanzapine, MDL-1 1939, veratridine, SB-216763, doxycycline, acebutolol, cholic-acid, ascorbyl-palmitate, MK-1775, rucaparib, indirubin, azaperone, spiroxatrine, alprazolam or any combination thereof for the treatment of said SAD.

Description of the drawings

Figure 1 . Molecular pattern distribution is limited to 4 clusters and does not follow the clinical diagnoses. (A) Heatmap showing the distribution of gene and CpG functional modules across the 4 autoimmune disease clusters. In columns patients are grouped by cluster assignment and in rows the functional modules of the features are shown with their scaled median values. Each subset of patients represented (discovery and validation sets) is presented separately. For the transcriptome, red represents over-expression and blue represents under-expression. For the methylome, purple represents hypo-methylation and orange represents hyper- methylation. At the top of the figure the annotation shows: two configurations of clusters for 4 (4 CLUSTERS) and 2 (2 CLUSTERS) groups, each of the treatment groups for each individual (SABIO, systemic antibiotics; STED, steroids; BIO, biologies; IMS, immunosuppressors and AM, antimalarials), recruitment centers distribution, disease activity as physician global assessment (PGA), duration of the disease since diagnosis, age at onset, gender, age and diagnosis. The abbreviations for the recruitment centers can be found in Table 1 . (B) Mosaic plot showing the distribution of diseases in each cluster (C1 -C4, clusters 1 to 4) with the number of individuals inside each block. Each disease is represented by a color in columns. (C) Association of covariates with clinical diagnosis, molecular clusters, and the associations between them. The significance of the associations is represented as a scale ranging from blue (non-significant) to red (significant). The direction of the association is shown as the z- scored beta coefficients where orange is enrichment and purple is depletion.

Figure 2. OMIC layers of information functionally characterize each of the molecular clusters. (A) Annotation of the selected features according to the hypergeometric enrichment of their modular functional assignment. The module annotations were obtained using the blood immunological signature databases of Chaussabel et al. (Ch) and Li et al (Li). Only significant enrichment results are shown (q-value < 0.01 ). Significant modules are shown in columns and their annotation in rows. (B) Association between CpG and gene modules (FDR < 0.01 was considered significant). Left and right alluvial diagrams show cis and trans associations (cis associations were considered at distances < 10 Kb). (C) Proportion of significant and non significant cis-associations across modules. Significant ones are annotated by module. (D) Enrichment of histone marks (rows) in CpG modules according to blood cell types (in columns). Significant enrichments (q-value < 0.01 ) are colored by their enrichment score (red means enriched and blue depleted). (E) Enrichment of transcription factor binding sites in the CpG modules. Only significant enrichments (q-value < 0.01 ) with absolute values of enrichment score > 1 are shown. Figure 3. Each cluster is associated with specific low-dimensional layers of information. Z- scored beta coefficients of the significant ANOVA associations (FDR < 0.01 ) for (A) serology, (B) flow-cytometry, and (C) clinical data, are shown. The serology information included autoantibodies (blue), cytokines (green), and antibodies against small lipid moieties or natural autoantibodies (red). Abbreviations are as follows: CCP2, anti-citrullinated peptide antibodies; CENTB, anti-centromere B; DNA, anti-dsDNA; SM, anti-Sm; SSB, anti-SSB or anti-La; U1 RNP, anti-U1 RNP; PFLC, Protein free light chains; SSA, anti-SSA or anti-Ro; PC.IGM, IgM anti-phosphatydilcholine. Hierarchical clustering of the results expressed as Z-scored beta coefficients are represented in the circle plots (A, B, and C). The four autoimmune molecular clusters are shown concentrically from 1 , the outtermost, to 4, the innermost circle. Clinical information was summarized using principal components (PC) and the PCs most significantly associated with each cluster are shown. (D) Only those clinical items having a significant contribution to each significant PC (observed contribution higher than five times the expected) are depicted.

Figure 4. Three of the clusters present aberrant molecular patterns that may be reverted by specific drugs. (A) Distribution of healthy individual assignments to the SADs molecular classification. The pie diagram shows that nearly 74% of controls are similar to patients in cluster 2. (B) Differentially expressed genes (DEGs) between clusters and healthy controls. Top barplot (black bars) shows shared DEGs across clusters, right barplot (colored bars) represents the number of DEGs by condition. The dot connectivity plot represents the intersections across clusters. (C) SLEDAI boxplots by molecular cluster, n=138 SLE patients. (D) ESSDAI boxplots by molecular cluster, n=79 SjS patients. In both boxplots, the lower and upper hinges correspond to the first and third quartiles, while whiskers represent the 1 .5 * interquartile ranges. (E) Heatmap showing the drugs (columns) with the highest probability of reverting the aberrant molecular pattern of each pathological cluster (rows). Only drugs with a connectivity score below -90 for at least one of the clusters are shown (Red values, related molecular profiles between clusters and drugs; green values, opposite molecular profiles between clusters and drugs). Euclidean distance and complete linkage methods were used for drug clustering. The mechanisms of action for each drug can be consulted in Figure 12.

Figure 5. Aberrant molecular patterns are stable in time and related to relapsing moments of disease. (A) The stability Jaccard index between molecular cluster assignments at baseline (rows) and at 6 months (columns). (B) Stability jaccard index between molecular cluster assignments at baseline (rows) and at 14 months (columns). (C) Patient classification considering the 3 time point cluster assignments. The definitions are as follows: stable patients, patients assigned to exactly the same cluster in the 3 time points; relapse-remission patients, patients assigned to only one pathological cluster (cluster 1 , 3 or 4) but have been at any given time point in the healthy-like cluster 2; unstable patients, patients assigned to more than one aberrant cluster at any one time point.

Figure 6. Confounder analysis. Most of the potential confounders show a statistical relationship with the clinical diagnosis of the participants (A) and with the first 10 molecular principal components: transcriptome (B) and methylome (D). After linear model correction most of the confounding signals can be removed, as can be seen in both transcriptome (C) and methylome (E) corrected associated principal component analysis. The p-values of the associations are shown in the heatmaps. Associations were tested using linear models, except for those between categorical values where a chi-squared test was used. Only significantly associated confounders are shown (p-value < 0.05). Demography and clinical information: Clinical diagnosis (STATUS), AGE, GENDER, Recruitment center (CENTER). Current use of medications: antimalarials (AM), immunosupressants (IMS), biologicals (BIO), steroids (STEDS) and systemic antibiotics (SABIO). Technical information: batch of RNA samples processed and sequenced together (POOL), RNA integrity number (RIN), methylation well position in the array (WELL), methylation plate location in the array (PLATE), methylation sentrix id (S.ID) and methylation sentrix position (S.POS).

Figure 7. Clustering analysis workflow and results. The feature selection was only performed on the discovery set, the volcano plots from the results are shown for transcriptome (A) and methylome (B) datasets. Red and blue dots are features with a significantly higher and lower variation in cases compared to healthy controls, respectively (False Discovery Rate £ 0.01 and log2 fold-change over 0.5). In both volcano plots the 20 most significant genes and genes associated with CpGs are labeled in the graphs (only the most 20.000 significant CpGs are shown). After quality control a 10 times nested cross-validation approach was used to optimize the SNF model (with 80% of the discovery set). Then the optimized model was applied to the validation set (20%) and the results were compared in order to test the stability of the clustering process. In each nested cross-validation, an inner 5-fold cross-validation was performed for each of the training sets of the outer 10-fold cross-validation (C). Outer and inner cross- validation results are represented (D and E, respectively) as the average of the stability index and the standard deviation for each combination of parameters (K-nearest neighbors, alpha hyperparameter used in constructing the network, and number of clusters). Both uncorrected values for real datasets and shuffled datasets are shown. Additionally, a corrected stability metric is also shown (shuffled values are subtracted from the real ones).

Figure 8. WGCNA module detection results. WGCNA power was chosen based on signed R- squared threshold of 0.9 for both genes (A) and CpGs modules (C). Module dendrograms identified 5 gene modules (B) and 3 CpG modules (B) correlated across patient samples included in the clustering analysis. Additionally, for both layers of information a module of unassigned features is shown (grey annotated module). WGCNA was run using default parameters with a“signed hybrid” network type, a minimum module size of 70 features, height cut-off of 0.75 and power defined as proposed by WGCNA authors.

Figure 9. Selected CpG overlap with genomic elements. (A) Percentage of selected CpG overlap with gene regions. UTR, untranslated region; TTS, transcription termination site. (B) Percentage of selected CpG overlapping CpG island (CGIs) regions.

Figure 10. Clinical features and risk allele association analysis. (A and B) Principal component (PCs) variance explained for the real dataset (green) and average of 1000 randomized dataset (blue) representations for clinical features and risk alleles. PCs with real variance explained higher than 5 times standard deviation plus the average of the randomized datasets were selected for the association analysis. First 1 1 and 64 PCs were selected for clinical features and risk associated SNPs, respectively (cutoff represented by vertical dashed red lines in A and B). (C) Percentage of informative values for clinical features (informative values were considered non-negative and non-missing values). (D) Z-scored beta coefficients of the significantly ANOVA risk associated SNPs PC.1 . (E) SNPs that contribute more than 5 times what is expected by chance to the variance explained by risk associated SNPs PC.1 . Gene associated with each SNP, and in brackets, diseases associated with the SNPs and the SNPs identifiers, are shown.

Figure 1 1 . Cluster genome-wide association analysis (GWAS). Manhattan plots for each cluster are shown: (A) cluster 1 , (B) cluster 2, (C) cluster 3 and (D) cluster 4. Associations with at least 5 markers with suggested p-values (p-value < 1 e-5) were annotated (suggested and significant associations were colored in blue and red respectively). Only one significant association was found at GWAS significant level (p-value < 1 e-8) for HLA in cluster 4. Additionally, HLA suggested or associated genes were different across clusters. Cluster 1 HLA region associated genes with at least 5 suggested markers were at HLA-C, HLA-B, MICA and HCP5. In cluster 2 HLA-B and MICA were also suggested as genetic markers. In cluster 3 there were no significant associations in the HLA region, and cluster 4 presented the most significant markers in the HLA region, being the only cluster associated with HLA class II (HLA- DRA, HLA-DRB5, HLA-DRB1 , HLA-DQA1 , HLA-DQB1 , HLA-DQA2, HLA-DQB2 and H LA- DOB).

Figure 12. Cluster-based drug repurposing analysis. (A) Heatmap showing the drugs (columns) most prone to revert the molecular patterns of the aberrant clusters (rows). Only drugs with connectivity scores below -90 for at least one of the clusters are shown (Red values, related molecular profiles between clusters and drugs. Green values, opposite molecular profiles between clusters and drugs). Euclidean distance and complete method were used for the drug clustering. (B) Mechanisms of action (MoAs) related with each selected drug. High values of MoA score (blue) represent the mechanisms where the drugs have an effect.

Examples

Integrative molecular analysis redefines the distribution of systemic autoimmune diseases into functional clusters that are independent of the diagnoses

A total of 1521 patients had information from all three OMIC types. After individual platform quality control, 86% of the samples remained. Out of these, 6% showed discrepancies in terms of genetic information between platforms or gender compared with the clinical information and were discarded (Table 1 1 ). Following complete quality control, 955 patients and 267 healthy controls were available for analysis (diagnosis distribution is shown in Table 12). For the inception cohort, of 173 patients with verified European ancestry, 148 (86%) had data from at least one of the three time points with information for all OMIC types. Three patients were lost after evaluation of discrepancies (detailed sample evaluation can be found in Table 13). The numbers of patients with complete information for the three time points and the distribution of their diagnoses can be consulted in Table 14.

With the aim of reclassifying the SADs indistinctively of the clinical diagnoses, genome wide transcriptome and methylome information was used in an unsupervised protocol design to perform an integrative molecular analysis. Cases from the cross-sectional cohort were divided into a discovery set (80%) and an independent (3 separate centers) validation set (20% of the samples). The discovery set cases were used in the next steps. The feature selection procedure resulted in 1821 genes and 4144 CpGs (Figures 7 A and 7B). Using similariry Network Fusion (SNF) algorithm—, two optimal configurations of 2 and 4 clusters with similar stability values were obtained. Next, using weighted correlation network analysis, WGCNA 15 genes were grouped into 5 significant modules, while CpGs formed 3 modules, considering that module 0 is defined as a non-significant noisy module by WGCNA developers (Figure 8). Interestingly, 32% of the CpGs were included in the CpG module 0, while only 10% of genes were grouped in gene module 0. The difference in the number of noisy features identified by WGCNA between both datasets is possibly due to the dynamic nature of DNA methylation, which might result in variation over time that could disrupt the stringent correlation needed for inclusion into the significant WGCNA modules.

The 2 and 4 cluster configurations totally agreed with patient assignments in clusters 1 and 3, but not for clusters 2 and 4, where patients presented different assignments in the 2 cluster configuration. Additionally, an independent signature defined cluster 4, and cluster 2 did not present a pattern as strong as the ones defining clusters 1 and 3. Thus, the patterns arising from the feature module analysis better fit with a 4 cluster configuration (see 2 cluster and 4 cluster annotations in Figure 1 A). Otherwise, clusters 2 and 4 might be subdivided by the 2 cluster configuration, but such further subdivision had low stability (see Figures 7D and 7E). These molecular patterns were clearly replicated in the validation set, showing that the molecular patterns, which defined the clusters, were stable in two independent datasets coming from different recruitment centers (Figure 1 A).

The clusters of patients were defined by specific molecular signatures (Figure 1 A). Clusters 1 and 3 had opposite patterns of gene modules 1 and 3 and CpG modules 1 and 2. Cluster 1 presented over-expression of gene module 3 and hypo-methylation of CpG module 1 , while cluster 3 showed over-expression of gene module 1 and hypo-methylation of CpG module 2. Additionally, cluster 3 also showed slight over-expression of gene module 5. Cluster 4 had over-expression of gene module 2 and hypo-methylation of CpG module 3. Cluster 2 however, had no discernible pattern, showing a heterogeneous signature across the modules defined by WGCNA. Cluster 2 grouped around 40% of the patients.

Patients with all clinical diagnoses could be found in all 4 clusters (Figure 1 B). Clusters 2 and 4 presented a somewhat biased distribution of some diagnoses (Figure 1 C). Cluster 2 had an increased number of RA, SSc and less expected, PAPS. Expectedly, cluster 4 was enriched in SLE, pSjS, and also MCTD (Figure 1 B). Interestingly, UCTD patients were the only ones not significantly enriched or depleted in any cluster (Figure 1 C), even though the number of UCTD patients was higher than the number of MCTD and PAPS patients (Table 12). For MCTD the results are quite remarkable. MCTD has strong clinical similarities with RA, SSc and SLE, but not pSjS, and its existence as a disease entity has been controversial. In fact, for some, MCTD patients are in a prior stage of disease and move toward any of the three disease entities 12 , for others, MCTD is a separate disease^—. That the great majority of patients fell in the interferon cluster is unexpected.

Except for systemic antibiotics (SABIO), most of the covariates associated with the transcriptome and the methylome principal components (Figure 6), were unevenly distributed across the clinical diagnoses (Figure 1 C). Importantly, most of these covariates did not show a dependency with the molecular clusters (Figure 1 C), and the covariates that remained significantly associated to the clusters (mainly 3 treatments: antimalarials, AM; biologicals, BIO and steroids, STED) were related to the enrichment of the clinical diagnoses in those clusters (Figure 1 C). For example, antimalarial (hydroxychloroquine) treated patients were enriched in cluster 4, and this association was driven by the enrichment of SLE, pSjS and MCTD (Table 3). In short, treatments did not define the clusters. Gene and CpG modules share and complement specific molecular functions that stratify the systemic autoimmune diseases

Gene and CpG modules that defined the clusters were highly concordant in their functionality. These showed a relationship, where overexpressed gene modules and hypomethylated CpG modules in the same clusters were enriched with the same functionalities as defined by Chaussabel et al. M and Li et al. 12 module enrichments (Figure 2A). Cluster 1 was defined by overexpression and hypomethylation of genes and CpGs included in inflammatory modules driven by monocytes and neutrophils (gene module 3 and CpG module 1 ). Cluster 3 was represented by T lymphocyte and natural killer (NK) cell functions (gene module 1 and CpG module 2), while cluster 4 was defined by interferon signature modules and dendritic cell functions (gene module 2 and CpG module 3). As previously shown, cluster 2, had no clear functional module signature (Figure 2A). Gene modules 4 and 5 further complemented the molecular information. Cell cycle and transcription upregulation (gene module 4) was associated with the interferon cluster (cluster 4) and B lymphocyte functions (gene module 5) were particularly observed in clusters 3 and 4 (Figure 2A).

Direct significant associations between CpGs and genes revealed different regulatory relationships between functionally-related modules. Cis significant associations linked CpG modules with their counterpart gene modules, while trans significant associations were evenly distributed without showing major relationships between homologous functional modules (Figure 2B). Nearly all cis associated CpGs (99%) in the interferon module (CpG module 3) were associated with the interferon gene module (gene module 2), 89% of the cis associations from the inflammatory gene module (gene module 3) were associated with CpGs in the inflammatory module (CpG module 1 ), and 75% of the cis associated CpGs from the T lymphocyte module (CpG module 2) were associated with its functionally-related gene module (gene module 1 ). However, the c/s associations revealed a more complex relationship between DNA methylation and gene expression, where all CpG modules presented significant cis associations with more than one gene module. In particular, the inflammatory CpG module presented cis associations with every gene module (Figure 2B), which is not surprising as the CpGs englobed in the inflammatory module were enriched in the immune generic activation module defined by Li et al., enrichment shared with three of the gene modules (Figure 1 A). However, and a major difference between clusters is that more than 80% of the CpGs included in the interferon module were associated in cis with genes included in the interferon transcriptome module, whereas most of the features in the rest of the modules had few (1 1 to 17%) significant cis associations (Figure 2C). Thus, the complex relationships shown between layers of information revealed that features selected from both datasets shared common molecular functions but were not necessarily directly associated, providing a deeper view of the molecular state of the patients than that given by a single layer.

In order to understand how DNA methylation regulates the molecular patterns that clustered different clinical diagnoses, cell-type specific histone marks and transcription factor binding sites (TFBSs) regulated by the CpG modules were analyzed. In general, the histone mark cell- type enrichments were clearly related with the functional annotations of the modules. CpG module 1 (inflammation) was enriched in active histone marks (H3K4me3 and H3K27ac) in promoters and enhancers in monocytes, neutrophils, and dendritic cells. Surprisingly, H3K4me1 , a histone mark for distal regulatory elements and enhancers, was significantly enriched in CpG module 1 for almost all cell-types included in the analysis. This difference between distal and proximal active histone marks in the CpG enrichments, suggests different regulation of the inflammatory response in different cell-types mediated by DNA methylation. On the other hand, the CpG module 2 (T cell immunity) showed a pattern opposite to that of CpG module 1 , with all active marks (including H3K4me1 ) significantly depleted for the cell- types encompassing this module and with enrichments for lymphocytes and NK cells. Both CpG module 3 (interferon) and 0 showed uniform patterns across all cell-types, but with a different consequence. The interferon module was enriched in all active histone marks of all cell-types, and depleted for the H3K27me3 gene expression repression mark. This result shows that the activation of the interferon signature is a general response for all immune cell- types. On the other hand, the CpG module 0 was depleted in active and enriched in inactive histone marks.

TFBSs significantly enriched in each CpG module were also related with the functional annotation of the modules. The TFBS for NFIL3, ATF4, CHOP, PU.1 , CEBP, and SPIB were specifically enriched for CpGs in the inflammatory module. Among these, ATF4 and CHOP have been described as regulators of the innate inflammatory response mediated via endoplasmic reticulum stress and the promotion of apoptotic processes 22 , while all of them have been shown to be regulators of inflammatory processes in different autoimmune conditions 21122 . The TFBSs for EGRF2 and TCF3 were enriched in CpG module 2, which may play a role during the regulation of adaptive immunity 2422 , while TFBS for IRF2, IRF3, and IRF8 (enriched in CpG module 3) are known regulators of interferon inducible genes and under the control of the type I IFN receptor—.

Low dimensional layers of information show specific patterns in the molecular stratification

Having validated the clusters and established their functionality, their characterization from the clinical and serological point of view was next. For this, data from pre-selected autoantibodies, cytokines, small lipid moiety (natural) autoantibodies measured, and cell surface antigen markers detected in real-time with flow-cytometry collected from the recruited individuals at the time of sampling were used 22 . The serology characterization showed four patterns of associations with clusters: anti-citrullinated peptide, anti-centromere B and IgM anti phosphatidylcholine natural autoantibodies were slightly increased in clusters 2 and 3, with a strong depletion in the interferon cluster 4; most of the autoantibodies (anti-dsDNA, anti-Sm, anti-SSA, anti-SSB, anti-U1 RNP), immunoglobulin markers (protein free light chains), and cytokine levels (IP-10, BAFF, MCP-2 and TNF-a) were enriched in the interferon cluster (cluster 4); the inflammatory cluster 1 was enriched in MMP-8 and C-reactive protein; while the enrichment of IL-1 RA and BLC was shared by the inflammatory and interferon clusters (Figure 3A). In general, the enrichment of serological markers followed the molecular functions defined for the clusters. For example, the interferon cluster 4 was enriched in cytokines regulated by type I IFNs, such as IP-10 and BAFF, but also in TNFa, which, under some situations, was shown to induce interferon type I production mediating the innate immune response 22 . On the other hand, cluster 1 patients were producing elevated levels of C-reactive protein and MMP- 8, both markers of acute inflammatory processes 22 —.

Determining the cell population composition of the clusters revealed the enrichment in neutrophils in cluster 1 , a certain enrichment of NK cells in cluster 2, T cells, B cells, NK cells and NKT cells in cluster 3, and with the exception of some B cell enrichment, cluster 4 did not show any particular cell composition being instead expressed in all cell types as shown for the functionality of the features (above) (Figure 3B), again following the previous molecular characterization of the clusters.

We then investigated if the clusters showed any differences in the genetic contribution of risk alleles for autoimmune diseases. For this, risk alleles discovered in different GWAS performed on all the four main SADs were included in the analysis (Figure 10D and 10E). This analysis showed that the interferon cluster was particularly enriched in HLA (Human Leukocyte Antigen) class II risk alleles, but no other cluster showed significant enrichment. In order to validate this result, even if the number of individuals is far from the GWAS standards, four GWAS were performed comparing each cluster and healthy controls. Again, the only signal below genome wide significance level (5e10 ~8 ) came from alleles located in HLA class II genes in the interferon cluster analysis (Figure 1 1 ). The weak HLA associations found in the other clusters were primarily HLA class I (HLA-B), and some suggestive signals were observed. Also suggestive was the association with IRF5 in the interferon cluster 4. Moreover, when GWAS were performed using the clinical diagnoses as covariates in the model, the associated signal remained, verifying that the significant signal found was independent of the disease composition of the cluster. This result implies that the usual genetic associations observed for each of the systemic autoimmune diseases actually reflect the molecular and cellular mechanisms occurring in individuals whose molecular disease pathway is the type I interferon pathway. Some evidence exists to this, as recently it was observed that HLA class II was strongly associated in patients with SLE positive for the interferon signature 32 , and additionally in a plethora of specific autoantibodies as it was shown in our interferon cluster 4 33 .

Most important is the clinical relevance of the clusters. The clinical information of the patients was summarized into principal components and differential associations were found for each molecular cluster (Figure 3C and 3D). The interferon cluster 4 was associated with some of the most extreme phenotypes in SADs such as kidney function abnormalities (including nephritis), thrombosis, nervous system involvement, and leukopenia, in addition to other minor comorbidities. Fibrosis complications in both skin and muscle skeletal organs were enriched in the inflammatory cluster 1 and in cluster 2, while these clusters were different in terms of kidney clinical features associations (enriched in the inflammatory cluster 1 ). Some comorbidities related with abnormal lipid metabolism and dyslipidemias were enriched in cluster 2. The T cell immune cluster 3, in general, presented less aggressive phenotypes than the other clusters, with also an enrichment of dyslipidemias, presence of abdominal pain, diarrhea, and constipation. Association with sicca syndrome was also found in this cluster.

Cluster 2 shows a healthy-like molecular pattern

Cluster 1 , 3 and 4 were found to be defined and characterized by specific molecular patterns, however, cluster 2 presented an undefined pattern from the molecular point of view. To gain insight on the type of patients that were being grouped in this cluster, different analyses were performed. Healthy individuals were assigned to the molecular clusters by means of the trained molecular model, and 74% of them were grouped in cluster 2, compared to a 12%, 1 1 % and 3% of assignments to 1 , 3 and 4, respectively (Figure 4A). Additionally, differential expression analysis was performed between each cluster and healthy controls. The highest number of differentially expressed genes (DEGs) was found for the inflammatory cluster 1 (2898 DEGs) with a preponderance of down-regulated genes (2477 DEGs), followed by the interferon cluster 4 (820 DEGs) and the T cell immune cluster 3 (294 DEGs), while only 9 genes were found to be differentially expressed in cluster 2. The low amount of DEGs in cluster 2 compared with healthy controls suggested a close related molecular pattern, without differentially regulated pathways between both groups of individuals. We reasoned that this could be due to two non- mutually exclusive reasons. On the one hand, some of the most enriched diseases in this cluster, RA and SSc could be undergoing processes in the target tissues (synovia and skin, respectively), and that blood was unable (with the exception of NK cells), to identify a defined molecular pattern. Potentially, cluster 2 might also be grouping patients undergoing remission or having low disease activity. To test this, disease activity was compared between clusters. Disease activity indexes were considered for each clinical diagnosis through different scores measuring clinical manifestations, traditionally associated with relapses, but are not performed across all diseases. For this analysis 138 SLEDAI-scored (out of 222 SLE patients), 79 ESSDAI-scored (out of 190 pSjS patients) and 17 DAS28-scored (out of 206 RA patients) were available. No activity score was available for SSc. In both, SLEDAI and ESSDAI higher and similar disease activity scores were shown in clusters 1 , 3, and 4 as compared to cluster 2 (Figure 4C and 4D). Significant differences were found for cluster 1 and 4 when compared with cluster 2 in the SLEDAI analysis (p=0.04 and p=0.02, respectively, Wilcoxon rank sum test), while the other showed suggestive p-values. The results suggest that in some instances, low disease activity could lie behind the undifferentiated patterns of cluster 2, but possibly also the lack of a sufficiently large cell population from which the data would be extracted when using whole blood. Nevertheless, RA and SSc are known to have prolonged periods of low activity or remission, which could also explain the results.

Specific drugs are found that might revert pathogenic molecular patterns

In order to investigate if the clustering found could have therapeutic implications, drug reposition analysis was conducted for each of the aberrant clusters (1 , 3, and 4). None of the drugs found with the lowest connectivity scores that theoretically might revert the abnormal gene expression patterns were of common use in any of the SADs (Figure 4E). Nevertheless, some of the drugs identified with the best scores were described to potentially be effective to treat autoimmune disorders. For instance, topic use of isoliquiritigenin in psoriasis 31 , arvanil in multiple sclerosis 35 , or a PKCbeta-inhibitor which was associated by GWAS analysis with inflammatory bowel disease and RA 35 . Protein kinase Ob has been also associated with SLE development in mouse models, thus an inhibitor might prevent developing the disease 32 . While a few drugs did not differentiate clusters, most drugs did have cluster specificity (Figure 4E). One of the identified drugs was shared across all three clusters, indirubin, which is a traditional Chinese medicine that was clinically used in the treatment of autoimmune disorders. Recently, some studies have been published showing its potential beneficial effects in psoriasis 33 and immune thrombocytopenia—, and some clinical trials were performed in psoriasis patients (clinicaltrials.gov). The inflammatory cluster 1 presented the highest number of significantly associated drugs, and it is also the cluster with the largest diversity of Mo As (Mechanisms of Action), without any common MoA for all the drugs (Figure 12B), probably explained by the general nature of the inflammatory processes. The T cell cluster 1 drugs presented a couple of common MoAs, including NFK and EGFR pathway inhibitors among others (in agreement with the EGRF2 TFBSs enrichment shown above, Figure 2E), while the interferon cluster drugs showed common mechanisms related to the inhibition of apoptosis pathways. Pathological molecular patterns are stable over time

It is possible that the clusters represent the disease state of the individual patients and that these could move, with time, to different clusters as disease progresses. Furthermore, although we have clearly adjusted for confounders, the long time of disease combined with years of treatment (12 years mean since diagnosis for the CS cohort) could lie behind the configuration of clusters that we observe. In order to define, and as a second validation, if the clusters could be observed by means of the same trained model in patients with newly diagnosed disease, the inception cohort was analyzed. The same clusters were observed, and each of the three sampling points was independently assigned to one molecular cluster.

In order to analyze whether the molecular assignments were stable over time, the assignment for each time point were summarized by patient. The Jaccard index for patients with complete molecular information at recruitment and 6 (n=103) or 14 months (n=78) samplings showed similar results (Figure 5A and 5B). In both comparisons, most of the patients remained in the same cluster after recruitment (62-63%) and 34-33% of them moved from a pathological cluster to the healthy-like cluster or vice versa, while only 4% of patients were assigned to a different pathological cluster. The analysis of all three time points together (n=68) showed that only 4 patients (6%) jumped between different pathological clusters, while 33 patients (48%) remained in the same cluster over all three time points (Figure 5C). The remaining patients (46%) showed a behavior that resembled a relapse-remission dynamic of SADs from a molecular point of view: their pathological cluster was stable (never assigned to a different pathological cluster) but could be assigned in a given time point to the healthy-like cluster (Figure 5C). In brief, when changing cluster, patients would only move to and from cluster 2 to a single pathological cluster. This is similar to what we observed in our recent work on longitudinal stratification of SLE^. Our results suggest that patients ' cluster assignment is stable in time and may signify that each patient has a single mechanism of disease pathology out of 3 possible, when detectable in an integrated fashion through transcriptome and methylome in blood.

First steps towards systemic autoimmune diseases precision medicine

In summary, we observe primarily 3 pathological clusters that we may describe as an acute phase inflammatory cluster 1 with a pattern coming from neutrophils and monocytes, a T cell immunity cluster 3 enriched in lymphocytes, particularly T cells, NKT cells but also B cells, and an interferon cluster 4, spread across all cell types but with a slight enrichment of NK cells. Cluster 2 had no particular molecular pattern where healthy controls were assigned to using the same trained model and indeed with very few differentially expressed genes as compared to controls. This study shows for the very first time and in an unprecedented number of individuals with several layers of high and low molecular data that systemic autoimmune disease patients from 7 different diseases share molecular clusters defined by specific molecular patterns that are stable in time. The clusters have, each, specific clinical and serological characteristics, but also quite different regulatory profiles. The results obtained in this study are a first step towards laying the foundations of the personalized medicine in systemic autoimmune diseases.

Samples, Data types and OMIC Platforms

Two cohorts of individuals with 7 different systemic autoimmune diseases (SADs) were recruited: a cross sectional (CS) cohort composed of 2003 patients and 617 healthy controls (HC) and an inception cohort of 215 patients followed and sampled at 0, 6 and 14 months. Inclusion and exclusion criteria are detailed in Table 2. Demographic information and treatment can be consulted in Tables 3-6. A detailed description of both cohorts can be found below.

Blood and serum samples were obtained from all patients in order to gather several types of molecular information. High-dimensional genome wide genotype, transcriptome, DNA methylome (OMIC datasets) and proportions of relevant cell types (using flow cytometry custom marker panels) were analyzed from whole blood samples. Low dimensional information was obtained from serum samples, including selected serology information such as SADs related autoantibodies, cytokines, chemokines and inflammatory mediators. A detailed description of all the protocols and methods can be found below.

Each OMIC dataset was quality controlled and controlled the integrity of the patient origin. Briefly, genetic information was obtained from RNA sequencing and from the single nucleotide polymorphisms (SNPs) included in the DNA methylation arrays. The concordance between all OMIC datasets was assessed, and discordant samples were discarded from further analysis. Detailed information on the sample integrity analysis can be consulted below.

Cluster Discovery and Validation Analysis

Design

The quality-controlled dataset of patients with SADs was split into a discovery set (-80% cases and healthy controls) and a validation set (-20% cases). The validation set was composed of all patients from three independent recruitment centers (France, UBO; Belgium, UCL and Germany, UKK). The discovery set was corrected for batch and potential confounder effects (gender, age, treatment effects and center of recruitment biases) using linear regression models but preserving the differences between clinical conditions using the removeBatchEffect from the limma R package^. The potential confounders were assessed using principal component association with the sample annotation. Annotations associated (linear regression p-value < 0.05) with any of the transcriptome and/or methylome first ten principal components were considered confounders and included in the models (Figure 6). Identification analysis of the confounders was performed using the swamp R packaged. After correction, samples that were identified as outliers either for transcriptome or methylome were removed from the analysis. Outliers were assessed by multidimensional scaling and a fixed threshold based on mahalanobis distance cut-off of 8 for both transcriptome and methylome (MASS R packag^). In total, from the discovery cohort, 44 samples were discarded (28 due to transcriptome, and 16 samples for methylome). Outliers were not associated with any clinical feature or meta-information.

Feature Selection

The selection of features on high-throughput datasets is aimed at selecting a subset of informative features to be used in the clustering analysis. Feature selection prevents uninformative features (that usually are the majority in high-throughput datasets) that could mask the signal of interest, and additionally the reduction of features makes the use of machine learning approaches in these large datasets feasible. In general, feature selection techniques are based on comparisons to obtain the features that maximize the differences between groups (supervised feature selection techniques).

When we began, preliminary analyses of the results to find the best way to perform feature selection and integrative clustering, we observed that the feature selection performed in all cancer studies published to date would not allow us to find any clusters or groups. In cancer reclassification studies, features with the highest variability across cancer types are chosen for reclassification purposes in an unsupervised way 44 , but the natural variability of the immune- system makes this approach inappropriate for systemic autoimmune diseases because of the dynamics of the immune system and its dependence on natural variability—. We took a completely new approach where only features with statistically increased variability in cases (being cases those individuals diagnosed with any kind of SAD included in the analysis) compared to healthy controls were chosen, allowing us to identify the clusters. Expression of genes and DNA methylation sites variability contributes to disease susceptibility 42 —, which implies that our feature selection might be able to select features which discriminate between differential molecular patterns, and at the same time remove high variable features not related with the pathologies included in the study. Additionally, we chose unsupervised feature selection techniques because supervised techniques can give features that discriminate between diagnoses, but important features that could subdivide those diagnoses into new groups would be less prone to be selected. Thus, a FDR Levene-test below 0.1 and a case- versus-control variability fold-change higher than 0.5 were used to select the informative features from transcriptome and methylome data from the cases. Importantly, confounding correction, outlier detection, and feature selection were only performed on the discovery cohort.

Clustering

An integrated clustering approach was chosen for molecular clustering analysis of the data. With this approach merging of all OMIC data together is the first aim followed by a single clustering. High-dimensional OMIC datasets were used for clustering analysis, while low dimensional molecular data was used for characterization of the molecular clusters obtained. Similarity Network Fusion (SNF)— was selected to obtain the clusters of SAD patients. SNF is based on network theory allowing to find shared and complementary patterns across layers of information (different OMICs), as it deals well with the differential dimensionality of the layers 42 . Moreover, the SNF algorithm can be used in a discovery / validation approach, as it allows the training of a mathematical model that can be validated with an independent dataset.

Transcriptome and methylome selected features were integrated using SNF and the clustering model was trained on the discovery set by means of a 10 times nested cross-validation approach (Figure 7C). After background structure substraction, two optimal configurations arised for 2 and 4 clusters of patients with similar stability values in the inner cross-validation (Figure 7E) and validated in the outer cross-validation (Figure 7D).

Training and testing of the clustering model were done as follows. Healthy controls were not used in the clustering model. Ten times nested cross-validation was run on the discovery set (without controls) in order to set up the optimal hyperparameters for the SNF algorithm. Inside each nested cross-validation, an inner 5-fold cross-validation was performed for hyperparameter model optimization with each training subset of the outer 10-fold cross- validation (Figure 7C). Multiple combinations of different values of the recommended main SNF hyperparameters were then tested: 10 to 30 neighbors in the K-nearest neighbors (K) with a step of 5, 0.3 to 0.8 hyperparameter used in constructing similarity network with a step of 0.1 and 2 to 20 numbers of clusters. The hyperparameters were optimized through maximizing the proportion of cluster assignments that can be recovered from the test datasets using the trained SNF model, and a model trained using each dataset independently (considered as the ground truth). The average proportion of labels recovered for each combination of hyperparameters were compared with the results of the randomized matrixes (random background structure), the results can be seen in Figure 7D. The clustering results obtained from the discovery set were validated using the validation set (Figure 7C). Functional Characterization of the Clusters and their Features

In order to assess the molecular signatures that drive the clusters and their relationships across layers of information, selected genes and CpGs were independently grouped into functional modules (groups of correlated features) using weighted correlation network analysis, WGCNA 15 . For genes and CpGs selected and included in the clusters, meta-features were identified using default parameters and“signed hybrid” network type 15 . Two immunological databases were used for meta-feature molecular characterization: Chaussabel et a\. and Li et al. 12 by means of the tmod R packaged. The enrichment of genes in those modules was tested and functional terms below a hyper-geometric enrichment FDR < 0.01 were selected. The genes related with CpGs were obtained from the lllumina probe annotation.

CpG Gene Associations

The association between CpG DNA methylation and gene expression was assessed using the Matrix eQTL R package implementation 52 . Associations were only tested on selected features. Briefly, methylation data was categorized in five bins (0.2 range each) and considered as loci in the analysis, while gene log10(TPM+1 ) were used as the quantitative feature (emQTL, expression-methylation Quantitative Trait Loci). Batch effects and other potential confounders were included in the linear regression model (see above). Significant associations considered were those with FDR < 0.01 and they were classified into c/s-regulatory associations (distance between CpG and associated gene < 10kb) or trans-re gulatory associations (>10kb).

Epigenetic Characterization of CpG Modules

CpGs selected were characterized in terms of transcription factor binding site motifs and histone modification co-localization. Transcription factor enrichment analysis was performed using HOMER software V4.10 51 . In order to determine the relative enrichment of known motifs in the methylation modules, known motif enrichment was tested in the DNA sequence contained in 500 bp windows around CpGs for each module. Histone mark modification enrichment analysis was performed using a subset of blood cells and histone modifications from the BLUEPRINT Consortium 52 . BLUEPRINT enriched regions for H3K4me3, H3K27ac, H3K4me1 , H3K36me3, H3K27me3 and H3K9me3 histone modifications in monocytes (CD14+ CD16-), neutrophils (CD45+ CD66b+ CD16+), naive CD4 T cells (CD3+ CD4+ CD45RA+), naive CD8 T cells (CD3+ CD8+ CD62L+ CD45RA+), naive B cells (CD19+ CD27- lgD+), dendritic cells, NK cells (CD3- CD56dim CD16+) and eosinophils (CD45+ CD66b+ CD16-) were downloaded from BLUEPRINT data analysis portal—. Histone modification co-localization enrichment was tested for the CpGs in each module compared to all CpGs selected, the enrichment significance was calculated by means of Fisher’s exact test and FDR was calculated for multiple test correction.

Healthy pattern characterization of SADs clusters

Healthy individuals were mapped to the clusters in order to define where in the model do the healthy molecular conditions fit. In this sense, the selected gene and CpG features were extracted from healthy individuals’ OMIC datasets and tested with the pre-defined model. Additionally, differential expression analysis was performed between the clusters and the healthy individuals. Transcriptome expression dataset samples and genes were filtered and cleaned as described before. Read count expression values were normalized using voom algorithm from the limma R package^ 1 . Known batch and potential confounders previously defined were included in the linear model limma package was used for differential expression analysis. An FDR < 1 e-5 and log2FC > j0.5j was the cut-off to consider transcripts as differentially expressed between healthy controls and SADs patients by cluster.

Cluster Characterization for Clinical and Low-dimensional Data

The distribution across clusters of clinical annotations (diagnosis and clinical symptoms) and low-dimensional data (auto-antibodies, serum cytokine protein levels, cell subsets, and genetic risk markers) were analyzed. Numerical and categorical low-dimensional dataset associations with clusters were assessed using linear and logistic regression models, respectively. Features with Analysis of Variance (ANOVA) p-values below 0.01 were considered unevenly distributed across clusters, and z-scored beta coefficients were used to assess enrichments and depletions. R packages car M and multcomp 55 were used in the analysis. Genetic risk loci and clinical features were reduced to Principal Components (PCs) prior to regression analysis, only PCs with variance explained higher than 5 times the standard deviation plus the average of 1000 randomizations of the original matrix were tested (First 64 and 1 1 PCs were selected for risk associated SNPs and clinical features respectively, see Figure 9A and Figure 9B). Clinical features with less than 5% informative value (non-negative and non-missing values) were filtered before the analysis (Figure 9C) and the remaining missing values were imputed using missMDA R package^. Briefly, significant PCs were assessed by means of estim ncpPCA function, and they were used for the imputation of the missing values using imputePCA function and regularized method.

Genome-wide association analyses

Genome-wide association analyses were performed for each cluster. Genotype data filtering and association analysis were conducted using PUNK .S 62 applying the following criteria: minimum total call rate per sample of 90%, minimum call rate per marker of 98%, minor allele frequency threshold of 1 %, Hardy-Weinberg Equilibrium p-value at a minimum of 1 e-5. Analysis of association for each cluster versus HC were performed by logistic regression under additive allelic model. Genetic stratification was not detected by genomic inflation factor (lambdaGC): lambdaGC cluster 1 versus HC = 1 .025, lambdaGC cluster2 versus HC = 1 .049, lambdaGC cluster3 vs HC = 1 .038 and lambdaGC cluster4 versus HC = 1 .037).

Drug repurposing analysis

Gene expression signatures from the most significant differentially expressed genes (top and bottom 150 DEGs), from the differential expression analysis against HC, for clusters 1 , 3 and 4 were queried independently into the CLUE database^. Thus, drugs that might potentially revert the aberrant molecular patterns of each cluster were assessed (drug repurposing analysis). The CLUE repurposing tool retrieves a similarity score (ranged from 100 to -100), where positive values represent related molecular patterns and negative values represent opposite molecular patterns between drug signatures and gene expression signatures. Drugs with a connectivity score below -90 were considered significant (Figure 4A and Figure 12A).

In order to acquire insight into the relationship between the molecular effects of the significant drugs, and direct links to possibly shared mechanisms of action (MoA) recorded in the CLUE database, the CLUE touchstone tool was used to obtain ranked lists of the mechanisms of action (MoAs) related with the significant drugs. Touchstone tool retrieves a ranked list of related drugs, and the MoA for each related drug was obtained parsing drug-MoA links from the CLUE database. Significant drug-MoAs ranked lists were imputed into fgsea R package in order to obtain a score of association between each significant drug and all the MoAs related. Only MoAs with p-value < 0.01 and gsea (gene set enrichment analysis) similarity score ³ 0.99 with at least one other drug were represented in Figure 12B.

Inception cohort analysis

Inception cohort time points were assigned individually to clusters by means of the SNF model trained with the cross-sectional cohort. The results were analyzed by pair of samplings and samples with complete information: patients with information for M000 and M006 samplings, n=103; M000 and M014 samplings, n=78 and complete sampling points, n=68. The classification of patients for complete samples was as follows: stable patients, patients with the same cluster assignment for all the time points; relapse - remission patients, patients with assignments only to an aberrant cluster (clusters 1 , 3 or 4) and assignments to healthy-like individual cluster (cluster 2) at any time point; unstable patients, patients with assignments to different aberrant clusters across time.

Data availability Data will be transferred to ELIXIR 52 and made available upon request and in agreement with the privacy rules of the European Union. Additionally, gene expression and DNA methylation datasets can be explored in http://bioinfo.aenvo.es/precisesadsdata/.

Participant recruitment

The PRECISESADS Consortium recruitment centers, patients’ recruitment monitoring and biobank/sample logistics participants can be found in Table 1 .

Study design

The study had two types of patient recruitment that included a cross-sectional study and an inception prospective study. The cross sectional study was a European multi center, non- randomized, and observational clinical study with recruitment performed between December 2014 and October 2017 at 19 institutions in 9 countries (Austria, Belgium, France, Germany, Hungary, Italy, Portugal, Spain and Switzerland). The target number of included patients affected by SADs was 2000 (400 by disease or group of diseases as listed below) and 666 healthy controls (HC). An ethical protocol was prepared, reached consensus across all partners, academic and industrial, translated into all participant ' s languages and approved by each of the local ethical committees of the clinical recruitment centers.

Eligible patients to the cross sectional study were men or women aged 18 years or older, and diagnosed as having one of the following systemic autoimmune diseases (SADs): rheumatoid arthritis (RA), scleroderma or systemic sclerosis (SSc), primary Sjogren’s syndrome (pSjS), systemic lupus erythematosus (SLE), primary antiphospholipid syndrome (PAPS), mixed connective tissue disease (MCTD), undifferentiated connective tissue disease (UCTD). Inclusion and exclusion criteria can be consulted in Table 2. PAPS and MCTD are relatively infrequent and for this reason the number of individuals expected was much lower and considered a separate group together with UCTD (Table 3). UCTD was selected as a denomination for patients not fulfilling criteria for any of the established diagnoses, and has been used previously®. Each patient was diagnosed according to the prevailing international classification or diagnosis criteria established for each of the diseases 61 66 . As criteria for UCTD we considered patients with clinical features of SADs not fulfilling any of the SADs criteria (RA, SSc, pSjS, SLE, PAPs, MCTD), or any other SADs criteria for at least 2 years, with presence of antinuclear antibodies (ANA) > 1 :160 with or without other specific autoantibodies; patients fulfilling 3 out of 1 1 SLE classification criteria and patients with early systemic sclerosis® were not classified as UCTD.

Patients treated with high doses of immuno-suppressants for the 3 months prior to recruitment, or cyclophosphamide or Belimumab in the past 6 months were not eligible for the study. Cross sectional patient treatments can be consulted in Table 4. Eligible HC were matched on the projected and expected profile of patients in terms of gender, age, and clinical center of origin (Table 3).

Inception patients were recruited by 6 of the centers in Spain, Belgium, France, Italy, Portugal and Switzerland. Eligible patients were diagnosed as above with any of the SADs within less than a year since diagnosis. The patients were recruited at baseline, and followed at months 6 and 14. Again, patients were aged 18 years or older and at baseline had not had high doses of immuno-suppressants, cyclophosphamide or Belimumab at least 3 months prior to recruitment. For time points at 6 and 14 months, patients could have any standard of care therapy indicated by their physician. Eligible patients who signed an informed consent form were recruited to the study. At baseline we had 215 patients, followed by 192 at 6 months and 175 at 14 months. Inception cohort’s demographic and treatment prescriptions can be consulted in Table 6 and Table 7.

Study Protocol and eCRF

Clinical data describing the disease phenotype was collected using an electronic case report form (eCRF). The challenge was the generation of an eCRF able to capture similarities and diversities without being redundant with the diagnosis itself and without being unnecessarily detailed. To define the core set of clinical items deemed sufficient to be in line with the project goals, a working group of experts on SADs was established— and the desired items were selected via a Delphi technique. A first broad set of cross-indication items (n = 130) was created and divided into 8 domains (constitutional symptoms, gastrointestinal, vascular, heart and lung, nervous system, skin and glands, musculoskeletal, therapy). After 3 tiers, convergence was reached and 28 core items were selected. T o obtain consensus, the selected items were discussed by all the clinical members of the consortium and approved. A final set of items was created, digitalized and pilot tested. The final eCRF, which proved to be flexible and easy to compile, was released along with explicit definitions.

Clinical and Biological Assessments

After the confirmation of patient inclusion, clinical data was collected including patient’s age, sex, ethnicity, dates of first disease manifestation and diagnosis, clinical and biological characteristics at baseline, the physician global assessment of disease activity, comorbidity, and current use of treatments.

Blood and urine samples were obtained from all individuals (patients and healthy controls). With the exception of samples collected for flow cytometry analysis, all samples were shipped by the clinical sites to the Central Biobank (see below) for processing, storage, and onward shipment to the analysis sites, where the various determinations were performed.

Sample Logistics

Sample collection, processing and storage were coordinated by the Andalusian Biobank System, which is integrated into the Andalusian Health system as a way of maintaining a centralized and harmonized sample processing. To ensure harmonization and traceability of prospective biological samples, a fit for purpose centralized model for management of sample collection consumables, sample storage and distribution was developed by the project biobank®. Sample management procedures were established prior to the start of the project and clinical centers were trained to ensure standardized collection, preparation and transportation of the samples across countries. Samples were collected through the different hospital centers following established standard operating procedures. Genomic DNA and RNA extraction were performed at the Central Biobank (See below).

Flow cytometry was managed at each center after a multi-center harmonization of flow cytometers to ensure mirroring of all instruments, thereby allowing subsequent integration of all the data obtained across the different sites and instruments®.

Databases

The study databases and eCRFs were designed specifically to capture data for the PRECISESADS project. The Inform□ eCRF was the tool used to collect data from each participating site. All data and analyses produced were stored in a project TransMart database® used in collaboration with the eTRIKS IMI consortium and hosted by QuartzBio/Precision Medicine. Data will be transferred to ELIXIR 22 and made available upon request and in agreement with the privacy rules of the European Union.

Ethical Approvals

The Ethical Review Boards of the 18 participating institutions approved the protocol of the CS study. In addition, the boards of the 6 sites involved approved the inception study protocol. The studies adhered to the standards set by International Conference on Harmonization and Good Clinical Practice (ICH-GCP), and to the ethical principles that have their origin in the Declaration of Helsinki (2013). The protection of the confidentiality of records that could identify the included subjects is ensured as defined by the EU Directive 2001 /20/EC and the applicable national and international requirements relating to data protection in each participating country. The CS study is registered with number NCT02890121 , and the inception study with number NCT02890134 in ClinicalTrials.gov. Demographics and other Patient Entry Characteristics

A total of 2623 individuals were included. As OMICs data were not available for 3 patients, 2620 were accessible: 2003 SADs patients and 617 HC. Baseline characteristics are presented in Table 3 and Table 5. Four patients withdrew their consent at entry and their samples were excluded from further analyses.

Mean age of the patients was between 46 and 58 years depending on the indication, HC were on average 47 years old. The majority of SADs patients (85%) and HC (78%) were female, and most were European (98%). SAD duration was on average 12 ± 9 years with a median of 10 years. Mean disease activity score was 23 ± 20, as calculated on a 100mm scale, indicating mild disease activity at the study level but some of the patients reported moderate to severe disease activity (max at 90).

Comorbidities reported across SADs patients were generally limited and with different rates according to the indication. The most frequently reported comorbidities (³ 15%) were abdominal pain for SjS (15%) and MCTD (20%), dyslipidemia for RA (32%), SLE (19%), SSc (26%), pSjS (27%), MCTD or PAPs (29%), UCTD (24%), constipation for SSc (19%) and SjS (16%) and UCTD (20%), obesity with body mass index > 30 kg/m 2 for RA (18%) and PAPs (19%), thyroiditis for SSc or SjS (15%), MCTD (17%) and UCTD (18%) groups.

Proportion of current smokers ranged from 10.5% (lowest) in pSjS to 25.7% (highest) in PAPs and do not differ with that seen in HC (15%); between 4% (UCTD) and 10.5% (SSc) patients had a history of cancer compared to 2% in HC confirming the known increased risk of cancer in SADs 23 . Coronary artery disease affected less than 4% of the overall study population with no trend in any of the diseases studied.

The therapies most frequently used across the SADs patients were antimalarials and steroids with higher rates in SLE (70.4 and 51 .6% respectively). Immuno-suppressants at a dose lower than that defined in the exclusion criteria were also frequently given, mainly in RA (75.5%) and MCTD (40.4%). The use of biologies, which includes any TNFn nblocker or inhibitor, anti-IL6 and abatacept, was almost limited to RA (44.4 %) with very few patients treated in MCTD (5.1 %), UCTD (3%) and SSc (2.5%). The distribution of therapies across diagnoses can be seen in Table 4 and Table 6.

Ciinical Symptoms

As expected, some clinical symptoms were shared across SADs. Abnormal lipid profile was found in all disease groups (³ 30%), arthritis in RA (97%), SLE (70%), SSc (30%), SjS (33%), MCTD (65%) and UCTD (52%); history of Raynaud syndrome in SLE (39%), SSc (97%), MCTD (87%) and UCTD (49%); sicca syndrome in SSc (35%), SjS (93%), MCTD (30%) and UCTD (41 %).

Conversely some symptoms were seen in only a limited number of SADs, for example history of esophageal reflux disease, 67% in SSc and 44% in MCTD; photosensitivity in SLE (58%), SjS (25%), and UCTD (40%); and telangiectasia, 61 % in SSc and 26% in MCTD.

Of the major clinical manifestations, 58% of RA patients had erosive arthritis, 41 % of SLE had biopsy proven nephritis, 31 % proteinuria/24h, 30% of SSc patients had functional ventilator restriction, 50% ischemic digital ulcers/pitting scars, 59% esophageal dysmotility, 37% pulmonary fibrosis, 20% pulmonary arterial hypertension on echography, and 72% skin fibrosis, and 78% of PAPs patients had arterial/venous thrombosis.

Biological samples acquisition and data generation

Several biological samples were acquired from the participants in the study. Molecular and cellular data (genomics, transcriptomics, methylomics, serology and flow-cytometry) was extracted from the different platforms, processed and quality controlled prior to the clustering analyses. The objectives of this preparation were to remove individuals and features with technical issues or incomplete data that could bias the final results.

Genotyping

DNA was extracted using a magnetic-bead nucleic acid isolation protocol (Chemagic DNA Blood Kit special, CHEMAGEN) automated with chemagic Magnetic Separation Module I (PerkinElmer) from K2EDTA BLOOD TUBE (lavender cap, BD Vacutainer) of 10 ml (extractions were performed on 3 ml). 2 pg of DNA were normalized to 100 ng/pg and sent for genotype analysis. DNA was genotyped using the lllumina HumanCore-24 v1 .0 and the Infinium CoreExome-24 v1 .2 genome-wide SNP genotyping platform (lllumina Inc., San Diego, CA, USA). Individuals were excluded on the basis of incorrect gender assignment, high missingness (> 10 %), non- European ancestry (< 55% using Frappe and REAP 25 ) and high relatedness (PLINK v1 q 25 , pijiat > 0.5). Genotypes were filtered before imputation due to high missingness (> 2%), HWE < 0.001 , MAF < 1 % and AT/CG changes with MAF > 40%. Following the quality control, the final directly genotyped dataset contained 223,463 SNP loci of 2,073 individuals. PLINK was used to carry out quality control (QC) measures on an initial set of 2,328 subjects and 306,670 directly genotyped SNPs, present on both platforms. Genotypes were phased using Eagle v2.3— and imputed using MinimacS 22 against the HRC v1 .1 Genomes reference panel using the Michigan Imputation Server platform 22 . Genotypes were filtered after imputation to have HWE p > 0.001 , MAF > 1 % and imputation info score > 0.7 and resulted in 6,664,685 imputed genotypes. RNA Sequencing

Total RNA was extracted from whole blood samples collected in Tempus tubes using Tempus Spin technology (Applied Biosystems). 1857 samples were processed in batches of 384, randomized to four 96-well plates with respect to patient diagnosis, recruitment center and RNA extraction date. The samples were depleted of alpha- and beta-globin mRNAs using globinCLEAR protocol (Ambion) and 1 pg of total RNA as input. Subsequently, 400 ng of globin-depleted total RNA was used for library synthesis with TruSeq Stranded mRNA HT kit (lllumina). The libraries were quantified using qPCR with PerfeCTa NGS kit (Quanta Biosciences), and equimolar amounts of samples from the same 96-well plate were pooled. Four pools were clustered on a high output flow cell (two lanes per pool) using HiSeq SR Cluster kit v4 and the cBot instrument (lllumina). Subsequently, 50 cycles of single-read sequencing were performed on a HiSeq2500 instrument using and HiSeq SBS kit v4 (lllumina). The clustering and sequencing steps were repeated for a total of three runs in order to generate sufficient number of reads per sample. The raw sequencing data for each run were preprocessed using bcl2fastq software and the quality was assessed using FastQC tools—. Cutadapt was used to remove 3’ end nucleotides below 20 Phred quality score and extraneous adapters, additionally reads below 25 nucleotides after trimming were discarded. Reads were then processed and aligned to the UCSC Homo sapiens reference genome (Build hg19) using STAR v2.5.2b S2 . 2-pass mapping with default alignment parameters were used. To produce the quantification data we used RSEM v1 .2.31 resulting in gene level expression estimates (T ranscripts Per Million, TPM and read counts). A sample would pass the RNA single QC if: (i) the number of reads mapped to the genes was more than 7 million and (ii) the RIN (RNA integrity number) value was higher than 7. Finally, we performed a pre-filtering to remove genes that have less than 5 reads in at least 500 samples. 15.476 genes remain after quality control and pre-filtering steps.

Methylation Assay

DNA was extracted from blood samples using the same protocol described in the Genotype assay, 2 pg of DNA were sent for DNA methylation assay. Of the samples, 540 were analyzed using Infinium HumanMethylation450 BeadChips (lllumina, Inc., San Diego, CA, USA) and 1364 participants using Infinium MethylationEPIC BeadChips, due to the discontinuation of the former by the manufacturer. DNA samples were bisulfite-converted using the EZ DNA methylation kit (Zymo Research, Orange, CA, USA). After bisulfite treatment, the remaining assay steps were performed following the specifications and using the reagents supplied and recommended by the manufacturer. The array was hybridized using a temperature gradient program, and arrays were imaged using a BeadArray Reader (lllumina Inc., San Diego, CA, USA). Samples QC and functional normalization was completed using meffil R package 21 . Briefly, during QC steps we removed subjects based on outliers for methylated vs unmethylated signals, deviation from mean values at control probes, and high proportion of undetected probes (using meffil default parameters). DNA methylation probes that overlapped with SNPs (dbSNPs v147 M ), located in sexual chromosomes or considered cross-reactive 86 · 87 were removed. Additionally, only probes quality controlled and shared between both arrays were used in the subsequent analysis (367.309 probes).

Serology Measurements

Autoantibodies were measured in serum using an automated chemiluminescent immunoanalyzer (IDS-iSYS)— . The generated chemiluminescent signal is expressed in Relative Light Units (RLU) and is then converted to un its/m L based on the calibration curves. The final result is indicative of the concentration of the specific autoantibody present in the sample, in the calibration standards, and in the controls. Polyclonal human rheumatoid factors (RF), complement C3c, C4, and individualized (kappa, lambda and kappa/lambda ratio) free light chains (Combilite and freelight, respectively) were measured in serum using a turbidimetric immunoassay method according to manufacturer’s recommendations (SPAPLUS analyser) 22 . Additionally, based on important variations regarding C3c and C4 values obtained for each center, a corrective factor was calculated in order to normalize the data between the centers (Table 7). For anti-chromatin/nucleosome determination, the anti- dsDNA-NcX ELISA assay was performed in serum according to the manufacturer recommendations 22 . Trisodium citrate blood tubes of 2.7 ml were collected for lupus anticoagulant (LA) analysis. The Russell viper venom test was performed with the reagent STA Staclot DRW and processed on a Stago STA-R3027 coagulation analyzer. A normalized ratio (NR)> 1 .2 was used to assess the presence of LA. Cut-offs can be found in Table 8.

Cytokines, chemokines and inflammatory mediators were measured on serum samples. Up to 88 analytes were initially measured using the Luminex™ xMAP technology in 288 CS individuals. After analysis, 14 were selected for determination on the remaining of the samples based on their non-redundancy and differences between cases and controls. BLC, FAS Ligand, GDF-15, IP-10, MCP-2, MCP-4, MIP-1 , MMP-8, TARC, IL-1 Rll, TNF Rl and IL1 -Ra were measured using the Luminex system. The 12-analyte customized panel was built using human pre-mixed multi-analyte Luminex assay (R&D Systems). Samples were thawed on the day of analysis and tested in batches. The concentration of each mediator was calculated from specific standard curves using the software Bio-Plex Manager v6.0 and expressed as pg/ml. Soluble MMP-2, CRP, TNFa, IL-6, BAFF and TGFp were measured by the quantitative sandwich enzyme immunoassay technique, according to the manufacturer’s instructions. Results were calculated referring to a standard curve created using a four-parameter logistic curve fit and were expressed as pg/ml. The sensitivity and the standard curve range of the assays are reported in Table 9.

Flo w-Cytometry

Blood samples were acquired with K2EDTA tubes of 2 ml for flow-cytometry assays and processed immediately. Multi-parameter flow cytometry analyses were performed in eleven centers (participants and centers can be consulted in Table 10). Therefore, the integration of all data in common with bioinformatical and biostatistical analyses has required a fine mirroring of all instruments. The calibration procedure elaborated to achieve this prerequisite and the antibody panels used have been previously described—. To increase the accuracy of the detection of the phenotypes, AltraBio (Lyon, France) developed a strategy to avoid any redundancy in the different cell subsets. Briefly, the generated automatons were validated in a pilot study with 300 patients where we compared data from the automated gating to data manually gated by the same operator (coefficient of correlation 0.9996). The gating strategy was as follows: after exclusion of debris, dead cells or doublets, frequency of CD15 hi CD16 hi neutrophils, CD15 hi CD16 + eosinophils, CD14 + CD15 hi LDGs, CD14 hi CD16 ~ classic monocytes, CD14 +/h 'CD16 + intermediate monocytes, CD14 CD16 + non classic monocytes, CD3 + T cells (with CD4 + CD8-, CD4 + CD8 + , CD4 CD8 + , CD4 CD8 T cell subsets), CD19 + B cells, CD3 CD56 + NK cells (with CD16 CD56 hi and CD16 hi CD56'° NK cell subsets), CD3 + CD56 + NK-like cells, Lin HLA-DR + DCs (with CD1 1 cCD123 hi pDCs, CD1 1 c + CD123 i0 mDCs (with CD14TCD1 c + mDC1 , CD141 + CD1 1 s mDC2 and CD141 CD1 s mDC subsets)) and CD123 + HLA-DR ~ basophils were automatically extracted from FCS and LMD files and sent in an excel flow cytometry workflow.

OMIC Data Quality Control

Each high-dimensional OMIC dataset (genomics, transcriptomics and methylomics) was quality controlled in order to discard samples and features with low quality due to technical issues, and only individuals considered Europeans in the demographic record were kept. After platform-specific data quality controls were performed, samples with information for all OMIC datasets were selected, and shared information across platforms was compared (a detailed summary of the entire quality control process can be found in Table 1 1 and Table 13). In order to ensure the integrity of the origin of the molecular information, two tests were performed:

(i) Gender Agreement: Each OMIC dataset was used to predict the gender of the samples and those predictions were compared with the gender recorded in the eCRF. Samples with any disagreement between gender predictions and the demographic information were discarded from the analysis. Briefly, transcriptome gender prediction was performed on the basis of specific sexual chromosome gene fractions, methylome gender prediction was performed using meffil software and genotype gender prediction outcome was taken from PUNK 22 .

(il) Genotype Agreement: genotypes were obtained from the RNA-Seq datasets using GATK best practices for RNA-Seq variant calling 21 . Also, SNPs included in the lllumina methylation platforms, were used. In average, -1500 SNPs obtained from RNA-Seq variant calling procedure were also present in the lllumina HumanCore-24 v1 .0 and the Infinium CoreExome-24 v1.2. The concordance between the genotype and RNA-Seq variant calling was sought to be the highest across all potential pairs of comparisons in order to pass the quality criteria (in general samples that fulfilled the criteria presented more than 1.8 times concordance than the next second pair). In the case of the methylation platform, the criteria was more stringent as the number of SNPs shared between genotype and methylation platforms was only of 65 for the Infinium HumanMethylation450K array, and 49 for the Infinium Methylation EPIC array. Only samples with concordance higher than 0.8 (80% of agreement) passed the quality criteria.

Tables

Table 1. PRECISESADS Consortium recruitment centers, patients recruitment monitoring and biobank/sample logistics participants.

Participants Center Abbreviation / Country

Lorenzo Beretta Referral Center for Systemic IRCCS / Italy (IT)

Barbara Vigone Autoimmune Diseases,

Fondazione IRCCS Ca’

Granda Ospedale Maggiore

Policlinico di Milano

Jacques-Olivier Pers Centre Hospitalier UBO / France (FR)

Alain Saraux Universitaire de Brest,

Valerie Devauchelle-Pensec Hospital de la Cavale

Divi Cornec Blanche

Sandrine Jousse-Joulin

Bernard Lauwerys Pole de pathologies UCL / Belgium (BE)

Julie Ducreux rhumatismales systemiques

Anne-Lise Maudoux et inflammatoires, Institut de

Recherche Experimental et

Clinique, Universite

catholique de Louvain

Carlos Vasconcelos Centro Hospitalar do Porto CHP / Portugal (PT)

Ana Tavares

Esmeralda Neves

Raquel Faria

Mariana Brandao

Ana Campar Antonio Marinho

Fatima Farinha

Isabel Almeida

Miguel Angel Gonzalez-Gay Servicio Cantabro de Salud, SCS / Spain (ES)

Mantecon Hospital Universitario

Ricardo Blanco Alonso Marques de Valdecilla

Alfonso Corrales Martinez

Ricard Cervera Hospital Clinic I Provicia, IDIBAPS / Spain (ES) Ignasi Rodriguez-Pinto Institut d’lnvestigacions

Gerard Espinosa Biomediques August Pi i

Sunyer

Rik Lories Katholieke Universiteit KU.LEUVEN / Belgium (BE) Ellen De Langhe Leuven

Nicolas Hunzelmann Klinikum der Universitaet zu UKK / Germany (DE) Doreen Belz Koeln, Cologne

Torsten Witte Medizinische Hochschule MHH / Germany (DE) Niklas Baerlecken Hannover

Georg Stummvoll Medical University Vienna MUW / Austria (AT) Michael Zauner

Michaela Lehner

Eduardo Collantes Servicio Andaluz de Salud, SAS.Co / Spain (ES) Rafaela Ortega-Castro Hospital Universitario Reina

M a Angeles Aguirre- Sofia

Zamorano

Alejandro Escudero- Contreras

M a Carmen Castro-Villegas

Norberto Ortego Servicio Andaluz de Salud, SAS.GrE / Spain (ES) Maria Concepcion Complejo hospitalario

Fernandez Roldan Universitario de Granada

(Hospital Universitario San

Cecilio)

Enrique Raya Servicio Andaluz de Salud, SAS.GrN / Spain (ES)

Inmaculada Jimenez Moleon Complejo hospitalario

Universitario de Granada

(Hospital Virgen de las

Nieves)

Enrique de Ramon Servicio Andaluz de Salud, SAS.Ma / Spain (ES) Isabel Diaz Quintero Hospital Regional

Universitario de Malaga

Pier Luigi Meroni Universita degli studi di UNIMI / Italy (IT)

Maria Gerosa Milano

Tommaso Schioppo

Carolina Artusi

Carlo Chizzolini Hospitaux Universitaires de UNIGE / Switzerland (CHE) Aleksandra Zuber Geneve

Donatienne Wynar

Laszlo Kovacs University of Szeged USC / Hungary (HU) Attila Balog

Magdolna Deak

Marta Bocskai

Sonja Dulic

Gabriella Kadar

Falk Hiepe Charite DRFZ / Germany (DE) Velia Gerl

Silvia Thiel

Manuel Rodriguez Maresca Andalusian Public Health BIOBANK / Spain (ES) Antonio Lopez-Berrio System Biobank

Rocio Aguilar-Quesada

Hector Navarro-Linares

Yiannis loannou LJCB Pharma UCB / United Kingdom (UK) Chris Chamberlain (PRECISESADS Project

Jacqueline Marovac office)

Marta Alarcon Riquelme Department of Medical GENYO / Spain (ES) Tania Gomes Anjos Genomics, Center for

Genomics and Oncological

Research

(PRECISESADS Project

office)

Table 2. Main inclusion and exclusion criteria.

Inclusion criteria Exclusion criteria

For all patients For all patients

- Male or female, aged ³ 18 years with no - Neonatal lupus

upper limit - Drug-induced lupus

- Diagnosed according the prevailing - Severe nephrotic syndrome with

criteria for one of the following proteinuria ³ 3,5 g/day

autoimmune diseases

- Patients with stable doses of steroids

Rheumatoid arthritis (RA) >15 mg/day for the last 3 months or with

Scleroderma or systemic sclerosis IV corticosteroids in the last 3 months (SSc) - Patients under immunosuppressant

Primary Sjogren’s syndrome (SjS) treatment in the last 3 months prior to recruitment and patients with combined

Systemic lupus erythematosus (SLE)

therapy using two or more

Primary antiphospholipid syndrome immunosuppressants

(PAPS) • Methotrexate >25mg/week

Mixed connective tissue disease • Azathioprine

(MCTD) >2.5mg/kg/day

Patients with undifferentiated • Cyclosporine

connective tissue disease (UCTD) for A>3 mg/kg/d ay

over 1 year and that do not fulfill the • Mycophenolate Mofetil > diagnosis of any of the above 2gr/day

diseases. - Treatment with cyclophosphamide (any

- Informed consent signed dose or route of administration) or

belimumab in the past 6 months

- Patients on depletative therapy such as rituximab in the last year

- Chronic HBV or HCV infection

For controls

- Individuals on chronic medication - Individuals suffering from any inflammatory autoimmune, allergic or infectious condition, and with a history of autoimmune disease, particularly thyroid disease or other diseases that may modify cellular profiles in blood.

Table 3. Cross-sectional cohort demographic and other entry characteristics.

RA SLE SSc SjS MCTD PAPS UCTD HC

(N=376 (N=469 (N=402 (N=385 (N=99) (N=106 (N=166 (N=617

) ) ) ) ) ) )

Age (years)

58 ±12 46± 14 58± 13 58 ± 13 50 ± 15 48± 12 53 ± 15 47± 13

Female, n 296 435 339 152 480

85 (86) 76 (72)

(%) (79) (93) (84) (94) (92) (78)

Race, n (%)

105 164 604

European (gg)

(96) (98) (98) 91 ^ (99) (99) (98) Asian 4(<1) 3 (< 1 ) 2 (< 1) 4(4) 1 (d) 8(1)

Black 1 (< 1) 9(2) 1 (d) 1 (1)

Other 2 (< 1) 8(2) 5(1) 3 (< 1) 3(3) 1 (< 1) 1 (< 1) 5(d)

SAD 12.9 ± 14.4 ± 10.9 ± 10.7 ± 11.3 ± 10.0 ± 8.9 ± duration * 9.9 9.7 9.0 7.7 9.7 7.6 7.7

RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, SSc: systemic sclerosis, SjS: Sjogren’s syndrome, MCTD: mixed connective tissue disease, PAPs: primary antiphospholipid antibody syndrome, UCTD: undifferentiated connective tissue disease HC: healthy controls, * mean ± SD in years, N: number of patients with assessable values,

% = (n/N) * 100.

Table 4. Cross-sectional cohort treatment by drug classes.

Treatment RA SLE SSc SjS MCTD PAPS UCTD

(N=376) (N=469) (N=402) (N=385) (N=99) (N-106) (N=166)

Biologies 166 10 (3) . 5 (5) 5(3)

(44)

Abatacept 26 (7) 1 (<1) Anti-TNF (Any) 98 (26) 4 (4) 2(1) Tocilizumab 42 (11) 1 (1) 2(1) Antimalarials 61 (16) 330 30 (8) 142 45 (46) 26 (25) 79 (48)

(70) (37)

Immunosuppresant 284 163 106 62 (16) 40 (40) 4 (4) 24 (15) s (76) (35) (26)

Steroids 284 163 106 62 (16) 40 (40) 4 (4) 24 (15)

(76) (35) (26)

Systemic 167 242 99 (25) 81 (21 ) 49 (50) 8 (8) 51 (31 ) antibiotics (44) (52)

RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, SSc: systemic sclerosis, SjS: Sjogren’s syndrome, MCTD: mixed connective tissue disease, PAPs: primary antiphospholipid antibody syndrome, UCTD: undifferentiated connective tissue disease, N: number of patients with assessable values, % = (n/N) * 100.

Table 5. Inception cohort demographic and other entry characteristics.

RA SLE SSc SjS MCTD PAPS UCTD

Number of

samples, n

M000 41 45 42 43 4 4 36

M006 35 37 39 40 4 3 33

M014 34 37 36 40 3 3 31

Age (years) *

M000 55 ± 1 1 37 ± 12 50 ± 13 53 ± 14 42 ± 7 41 ± 18 49 ± 15

M006 55 ± 1 1 36 ± 13 49 ± 12 52 ± 13 42 ± 7 33 ± 1 1 48 ± 15

M014 55 ± 1 1 36 ± 13 50 ± 12 52 ± 13 44 ± 8 33 ± 1 1 49 ± 15

Female, n (%)

M000 32 (78) 35 (78) 36 (86) 42 (98) 3 (75) 3 (75) 30 (83)

M006 27 (77) 29 (78) 33 (85) 39 (98) 3 (75) 3 (100) 27 (82)

M014 26 (76) 30 (81 ) 30 (83) 39 (98) 2 (67) 3 (100) 25 (81 )

Race, n (%)

European (M000) 40 (98) 38 (84) 39 (93) 42 (98) 4 (100) 4 (100) 33 (92)

Asian (M000) 1 (2) 2 (5)

Black (M000) 3 (7) 1 (3)

Other (M000) 4 (9) 1 (2) 1 (2) 2

35 31 (84) 36 (92) 39 (98) 4 (100) 3 (100) 30 (91 )

European (M006) Asian (M006) 2 (5)

Black (M006) 3 (8) 1 (3)

Other (M006) 3 (8) 1 (3) 1 (3) 2 (6)

34 31 (84) 33 (92) 39 (98) 3 (100) 3 (100) 29 (94)

European (M014) / i nm

( 1 UU)

Asian (M014) 2 (6)

Black (M014) 3 (8)

Other (M014) 3 (8) 1 (3) 1 (3) 2 (6)

SAD duration *

M000 0.32 ± 0.48 ± 0.39 ± 0.9 ± 0.26 ± 0.63 ± 1 .1 ±

0.4 0.6 0.6 2.4 0.3 0.2 2.1

M006 0.29 ± 0.49 ± 0.37 ± 0.89 ± 0.26 ± 0.72 ± 1 .2 ±

0.3 0.6 0.5 2.5 0.3 0.2 2.2

M014 0.3 ± 0.46 ± 0.43 ± 0.87 ± 0.26 ± 0.72 ± 1 .2 ±

0.3 0.5 0.6 2.5 0.3 0.2 2.2

RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, SSc: systemic sclerosis, SjS: Sjogren’s syndrome, MCTD: mixed connective tissue disease, PAPS: primary antiphospholipid antibody syndrome, UCTD: undifferentiated connective tissue disease, % = (number of patients with current treatments/N) * 100.

Table 6. Inception cohort treatments by drug classes.

T reatment (M000)

First line:

19 12

Corticosteroids 5 (12) 5 (12) (5 " 0) 7 (19)

(46) (27)

(Hidrox-) chloroquine 5 (12)

Immunosuppresors:

1

Azathioprine

(25)

Cyclophosphamide

Cyclosporine 1 (3)

Leflunomide

Methothrexate (RA first

4 (10) 1 (3) line) 1 (2) Mycophenolic acid

Sulfasalazine

Biologies:

Abatacept

Anti-TNFa (Antagonist)

Belimumab

Rituximab

Tocilizumab

T reatment (M006)

First line:

Corticosteroids

35 1 1

(Hidrox-) chloroquine 8 (23)

(95) 4 < 10 > («) (50) (33)

Immunosuppresors:

Azathioprine 1 (3) 1 (3) 2 (6)

Cyclophosphamide 2 (5) 2 (5)

Cyclosporine 1 (3)

Leflunomide 1 (3) 1 (3) 1 (3)

Methothrexate (RA first 24

2 (5) 5 (13) 1 (3) {5 2 0) 4 (12) line) (69)

Mycophenolic acid 2 (5) 5 (13)

Sulfasalazine

Biologies:

Abatacept 1 (3)

Anti-TNFa (Antagonist) 1 (3) . 1 (3)

Belimumab

Rituximab . 1 (3) 2 (5)

Tocilizumab 1 (3) T reatment (M014)

First line:

13

Corticosteroids 7 (21) 8(26)

(35) (39) 7 (18) (67)

34

(Hidrox-) chloroquine 9 (26) 4 /1-n 18 1 10

(92) 1 (45) (33) (32)

Immunosuppresors:

Azathioprine 4(11) 1 (3) 2(6)

Cyclophosphamide 2(6)

Cyclosporine 1(3)

Leflunomide 1 (3)

Methothrexate (RA first 20

4(11) 6(17) 2(5) {6 2 7) 3(10) line) (59)

Mycophenolic acid 3(8) 6(17)

Sulfasalazine 1 (3)

Biologies:

Abatacept 1 (3)

Anti-TNFa (Antagonist) 4 (12) 1 (3)

Belimumab 1 (3)

Rituximab 4(11)

Tocilizumab 1 (3) 1 (3) 1 (3)

RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, SSc: systemic sclerosis, SjS: Sjogren’s syndrome, MCTD: mixed connective tissue disease, PAPs: primary antiphospholipid antibody syndrome, UCTD: undifferentiated connective tissue disease, N: number of patients with assessable values, % = (n/N) * 100. M000: baseline recruitment, M006: six month sampling, M014: fourteen month sampling.

Table 7. Normalization factor for complement assays by center.

Center C3c C3 Correction C4 C4 Correction

KU. LEUVEN 1 ,08410,3935 1,077 0,16810,8198 1,434 UCL 1,0510,3206 1,112 0,1708+0,06756 1,411

UNIGE 1,01210,3502 1,154 0,1631+0,08716 1,477

SASCo 1,00110,4407 1,167 0,1713+0,08995 1,407

UBO 1,00810,4041 1,167 0,1341+0,07832 1,797

IDIBAPS 0,9714+0,3307 1,202 0,1778+0,06992 1,355

DRFZ 0,913810,4085 1,278 0,1425+0,077 1,691

IRCCS 0,7710,3617 1,517 0,116110,07177 2,076

BIOBANK 0,7349+0,1159 1,589 0,1475+0,04718 1,633

SASGr 0,7335+0,3967 1,592 0,1263+0,07041 1,908

MHH 0,6932+0,2555 1,684 0,119+0,0561 2,025

0,09911+0,0638

SASMa 0,5913+0,37 1,975 2,432

1

UKK 0,540410,1472 2,161 0,1003+0,03894 2,403

0,09135+0,0406

SCS 0,49610,169 2,355 2,638

4

mean±SD (95 th percentile range)

Table 8. Positive cut-offs for serology quantification. cut-off

Denominations

(Customer)

ENA : mix

SSA+SSB+Sm+U1 +Scl7 >1

0+Jo1

anti-SSA/Ro 52+60 >10

anti-SSA/Ro 52 >10

anti-SSA/Ro 60 >10

anti-SSB/La ³10

anti-Sm >10

anti-U1-RNP ³10

anti-Scl70 ³10

anti-Jo1 >10

anti-dsDNA >40

anti-cardioiipin IgG >20

anti-beta 2 GPI IgG >20 anti-cardiolipin IgM >20

anti-beta 2 GPI IgM ³20

anti-CCP2 >10

anti-MPO >20

anti-PR3 >40

anti-centromere B >10

Rheumatoid factor >20

1,168 (0,811-1,570)

Complement C3

g/L

0,241 (0,129-0,392)

Complement C4

g/L

Kappa free light chains 3.3-19.4 mg/L

Lambda free light

5.7-26.3 mg/L

chains

Ratio lambda/kappa 0,26-1,65

Polyclonal

kappa+lambda 7.93-35.78 mg/L

free light chains

Anti-DNA-NcX >100

Table 9. Features of Luminex multiplex panel and ELISA kits. , Sensitivit Standard Curve

Analyte y* Range Serum

Dilution

(pg/ml) (pg/ml)

BLC 11.5 4.2 - 3060 1:2

FAS

1.2 5.67-4130 1:2

Ligand

GDF15 1.2 10.69-7790 1:2

IP-10 118 0.96 - 700 1:2

MCP-2 1.8 5.49 - 4000 1:2

MCP-4 0.4 2.15- 1570 1:2

MIP-1 5.8 44.38 - 32350 1:2

MMP-8 34.2 151.5- 110440 1:2

TARC 8.9 28.52 - 20780 1:2

IL1-RII 18.0 91.51 - 66710 1:2

TNF-RI 41.0 53.54 - 39030 1:2 I L 1 -ra 18.0 9.18 - 6690 1 :2

MMP-2 0.033 0.50 - 32 1 :20

GRP 0.010 0.78 - 50 1 :100

TNFa 0.049 0.156 - 10 None / 1 :2 #

IL-6 0.09 0.156 - 10 None

BAFF 6.44 62.5 - 4000 None / 1 :2 #

TGF i 3.4 23.4 - 1500 1 :200

* Mean minimum detectable dose

#Samples were diluted 1:2 in case of insufficient volume

available

Table 10. PRECISESADS Consortium Flow-Cytometry.

Participants Center

Christophe Jamin U1227, Universite de Brest, Inserm, Labex IGO,

Lucas Le Lann CHU de Brest, Brest, France

Quentin Simon

Benedicte Rouviere

Jacques-Olivier Pers

Concepcion Marahon GENYO, Centre for Genomics and Oncological

Nieves Varela Research Pfizer, University of Granada, Andalusian

Brian Muchmore Regional Government, PTS GRANADA, Granada,

_ Spain _

Aleksandra Dufour Immunology & Allergy, University Hospital and

Montserrat Alvarez School of Medicine, Geneva, Switzerland

Carlo Chizzolini _

Jonathan Cremer Laboratory of Clinical Immunology, Department of

Microbiology and Immunology, KU Leuven,

Leuven, Belgium

Ellen De Langhe Division of Rheumatology, University Hospitals

Leuven and Skeletal Biology and Engineering Research Center, KU Leuven, Leuven, Belgium

Nuria Barbarroja IMIBIC, Reina Sofia Hospital, University of Chary Lopez-Pedrera Cordoba, Cordoba, Spain

Velia Gerl Department of Rheumatology and Clinical

Laleh Khodadadi Immunology, Charite University Hospital, Berlin,

Qingyu Cheng Germany

Anne Buttgereit Bayer AG, Berlin, Germany

Zuzanna Makowska

Aurelie De Groof Pole de Pathologies Rhumatismales

Julie Ducreux Inflammatoires et Systemiques, Institut de Recherche Experimental et Clinique, Universite catholique de Louvain, Brussels, Belgium

Elena Trombetta Laboratorio di Analisi Chimico Cliniche e

Microbiologia - Servizio di Citofluorimetria,

Fondazione IRCCS Ca' Granda Ospedale

Maggiore Policlinico di Milano, Milano, Italy

Tianlu Li Chromatin and Disease Group, Bellvitge

Damiana Alvarez- Errico Biomedical Research Institute (IDIBELL),

Barcelona, Spain

Torsten Witte Klinik fur Immunologie und Rheumatologie, Medical

Katja Kniesch _ University Hannover, Hannover, Germany _

Nancy Azevedo Servigo de Imunologia EX-CICAP, Centro

Esmeralda Neves _ Hospitalar e Universitario do Porto, Porto, Portugal

Sambasiva Rao _ Sanofi Genzyme, Framingham, MA, USA _

Pierre-Emmanuel Jouve AltraBio SAS, Lyon, France

Table 1 1 . Cross-sectional samples quality control summary.

Genotype Transcriptome Methylome

European processed samples (by 9f)99

1857 1756 platform)

Quality control(by platform) 1868 1740 1686

Quality control 1302

Samples integrity (by platform) - 1236 1282

Samples integrity 1222

Table 12. Distribution of cross-sectional clinical diagnosis after quality control.

Clinical Diagnosis # individuals

Healthy Individuals (HC) 267

Systemic Lupus Erythematosus (SLE) 222

Rheumatoid Arthritis (RA) 206

Sjogren’s Syndrome (pSjS) 190

Systemic Scleroderma (SSc) 200

Mixed Connective Tissue Disease (MCTD) 39

Primary Antiphospholipid Syndrome (PAPS) 34

Undifferentiated Connective Tissue Disease 64

(UCTD)

TOTAL 1222 Table 13. Inception samples quality control summary.

Genotype Transcriptome Methylome

European processed samples

173 * 482 496

(by platform)

Quality control (by platform) 483 449 486

Quality control 362

Samples integrity (by platform) - 435 477

Samples integrity 342

Time points complete _„ **

samples

* Inception samples were only genotyped at the baseline.

** Number of individuals with the 3 time points completed and quality controlled.

Table 14. Distribution of inception clinical diagnosis after quality control.

#

#

Clinical Diagnosis # M000 & # M000 & Complete

Individual

M006 M014 individual s

s

Systemic Lupus Erythematosus 23 17 10 9

(SLE)

Rheumatoid Arthritis (RA) 25 19 17 14

Sjogren’s Syndrome (pSjS) 35 25 19 16

Systemic Scleroderma (SSc) 29 20 16 15

Mixed Connective Tissue Disease 3 3 0 0

(MCTD)

Primary Antiphospholipid Syndrome 2 2 2 2

(PAPS)

Undifferentiated Connective Tissue 28 17 14 12

Disease (UCTD)

TOTAL 145 103 78 68

Bibliography

1. Li, Y. R. et al. Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature medicine 21 , 1018-1027, doi:10.1038/nm.3933 (2015). 2. Baechler, E. C. et al. Interferon-inducible gene expression signature in peripheral blood cells of patients with severe lupus. Proceedings of the National Academy of Sciences of the United States of America 100, 2610-2615, doi:10.1073/pnas.0337679100 (2003).

3. Bennett, L. et al. Interferon and granulopoiesis signatures in systemic lupus erythematosus blood. The Journal of experimental medicine 197, 71 1 -723, doi:10.1084/jem.20021553 (2003).

4. Tani, C. et al. Rhupus syndrome: assessment of its prevalence and its clinical and instrumental characteristics in a prospective cohort of 103 SLE patients. Autoimmunity reviews 12, 537-541 , doi:10.1016/j.autrev.2012.09.004 (2013).

5. Alarcon-Segovia, D. & Cardiel, M. H. Comparison between 3 diagnostic criteria for mixed connective tissue disease. Study of 593 patients. The Journal of rheumatology 16, 328-334 (1989).

6. Sharp, G. C., Irvin, W. S., Tan, E. M., Gould, R. G. & Holman, H. R. Mixed connective tissue disease-an apparently distinct rheumatic disease syndrome associated with a specific antibody to an extractable nuclear antigen (ENA). The American journal of medicine 52, 148- 159 (1972).

7. Skopouli, F. N., Drosos, A. A., Papaioannou, T. & Moutsopoulos, H. M. Preliminary diagnostic criteria for Sjogren's syndrome. Scandinavian journal of rheumatology. Supplement 61 , 22-25 (1986).

8. Wilson, W. A. et al. International consensus statement on preliminary classification criteria for definite antiphospholipid syndrome: report of an international workshop. Arthritis and rheumatism 42, 1309-131 1 , doi:10.1002/1529-0131 (199907)42:7<1309::AID-

ANR1 >3.0.CO;2-F (1999).

9. Peterson, L. S., Nelson, A. M. & Su, W. P. Classification of morphea (localized scleroderma). Mayo Clinic proceedings 70, 1068-1076, doi:10.4065/70.1 1 .1068 (1995).

10. Murphy, D., Mattey, D. & Hutchinson, D. Anti-citrullinated protein antibody positive rheumatoid arthritis is primarily determined by rheumatoid factor titre and the shared epitope rather than smoking per se. PloS one 12, e0180655, doi:10.1371/journal.pone.0180655 (2017).

H . Teruel, M. & Alarcon-Riquelme, M. E. Genetics of systemic lupus erythematosus and Sjogren's syndrome: an update. Current opinion in rheumatology 28, 506-514, doi:10.1097/BOR.0000000000000310 (2016). 12. Vukelic, M., Li, Y. & Kyttaris, V. C. Novel Treatments in Lupus. Frontiers in immunology 9, 2658, doi:10.3389/fimmu.2018.02658 (2018).

13. Casey, K. A. et al. Type I interferon receptor blockade with anifrolumab corrects innate and adaptive immune perturbations of SLE. Lupus science & medicine 5, e000286, doi:10.1 136/lupus-2018-000286 (2018).

14. Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nature methods 1 1 , 333-337, doi:10.1038/nmeth.2810 (2014).

15. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC bioinformatics 9, 559, doi:10.1 186/1471 -2105-9-559 (2008).

16. Ciang, N. C., Pereira, N. & Isenberg, D. A. Mixed connective tissue disease-enigma variations? Rheumatology 56, 326-333, doi:10.1093/rheumatology/kew265 (2017).

17. Alarcon-Segovia, D. Mixed connective tissue disease: some statements. Clinical rheumatology 1 , 81 -83 (1982).

18. Chaussabel, D. et al. A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. Immunity 29, 150-164, doi:10.1016/j.immuni.2008.05.012 (2008).

19. Li, S. et al. Molecular signatures of antibody responses derived from a systems biology study of five human vaccines. Nature immunology 15, 195-204, doi:10.1038/ni.2789 (2014).

20. Zhang, K. & Kaufman, R. J. From endoplasmic-reticulum stress to the inflammatory response. Nature 454, 455-462, doi:10.1038/nature07203 (2008).

21 . Schlenner, S. et al. NFIL3 mutations alter immune homeostasis and sensitise for arthritis pathology. Annals of the rheumatic diseases 78, 342-349, doi:10.1 136/annrheumdis-2018- 213764 (2019).

22. Sokhi, U. K. et al. Dissection and function of autoimmunity-associated TNFAIP3 (A20) gene enhancers in humanized mouse models. Nature communications 9, 658, doi:10.1038/s41467- 018-03081 -7 (2018).

23. Smith, J. A. Regulation of Cytokine Production by the Unfolded Protein Response; Implications for Infection and Autoimmunity. Frontiers in immunology 9, 422, doi:10.3389/fimmu.2018.00422 (2018). 24. Pollack, B. P. EGFR inhibitors, MHC expression and immune responses : Can EGFR inhibitors be used as immune response modifiers? Oncoimmunology 1 , 71 -74, doi:10.4161/onci.1 .1 .18073 (2012).

25. Wohner, M. et al. Molecular functions of the transcription factors E2A and E2-2 in controlling germinal center B cell and plasma cell development. The Journal of experimental medicine 213, 1201 -1221 , doi:10.1084/jem.20152002 (2016).

26. Best, J. A. et al. Transcriptional insights into the CD8(+) T cell response to infection and memory T cell formation. Nature immunology 14, 404-412, doi:10.1038/ni.2536 (2013).

27. Jefferies, C. A. Regulating IRFs in IFN Driven Disease. Frontiers in immunology 10, 325, doi:10.3389/fimmu.2019.00325 (2019).

28. Jamin, C. et al. Multi-center harmonization of flow cytometers in the context of the

European "PRECISESADS" project. Autoimmunity reviews 15, 1038-1045, doi:10.1016/j.autrev.2016.07.034 (2016).

29. Yarilina, A. & Ivashkiv, L. B. Type I interferon: a new player in TNF signaling. Current directions in autoimmunity 1 1 , 94-104, doi:10.1 159/000289199 (2010).

30. Sproston, N. R. & Ashworth, J. J. Role of C- Reactive Protein at Sites of Inflammation and Infection. Frontiers in immunology 9, 754, doi:10.3389/fimmu.2018.00754 (2018).

31 . Manicone, A. M. & McGuire, J. K. Matrix metalloproteinases as modulators of inflammation. Seminars in cell & developmental biology 19, 34-41 , doi:10.1016/j.semcdb.2007.07.003 (2008).

32. Miller, S. et al. Hypomethylation of STAT1 and HLA-DRB1 is associated with type- 1 interferon-dependent HLA-DRB1 expression in lupus CD8+ T cells. Annals of the rheumatic diseases 78, 519-528, doi:10.1 136/annrheumdis-2018-214323 (2019).

33. Fernando, M. M. et al. Defining the role of the MHC in autoimmunity: a review and pooled analysis. PLoS genetics 4, e1000024, doi:10.1371 /journal.pgen.1000024 (2008).

34. Wu, Y. et al. Isoliquiritigenin prevents the progression of psoriasis-like symptoms by inhibiting NF-kappaB and proinflammatory cytokines. Journal of molecular medicine 94, 195- 206, doi:10.1007/S00109-015-1338-3 (2016).

35. Malfitano, A. M. et al. Arvanil inhibits T lymphocyte activation and ameliorates autoimmune encephalomyelitis. Journal of neuroimmunology 171 , 1 10-1 19, doi:10.1016/j.jneuroim.2005.09.005 (2006). 36. Altman, A. & Kong, K. F. Protein kinase C inhibitors for immune disorders. Drug discovery today 19, 1217-1221 , doi:10.1016/j.drudis.2014.05.008 (2014).

37. Oleksyn, D. et al. Protein kinase Cbeta is required for lupus development in Sle mice. Arthritis and rheumatism 65, 1022-1031 , doi:10.1002/art.37825 (2013).

38. Sekhon, S. & Koo, J. Indirubin: a novel topical agent in the treatment of psoriasis. The British journal of dermatology 178, 21 , doi:10.1 1 1 1 /bjd.16074 (2018).

39. Zhao, Y. et al. Indirubin modulates CD4(+) T-cell homeostasis via PD1 /PTEN/AKT signalling pathway in immune thrombocytopenia. Journal of cellular and molecular medicine 23, 1885-1898, doi:10.1 1 1 1 /jcmm.14089 (2019).

40. Toro-Dominguez, D. et al. Stratification of Systemic Lupus Erythematosus Patients Into Three Groups of Disease Activity Progression According to Longitudinal Gene Expression. Arthritis & rheumatology 70, 2025-2035, doi:10.1002/art.40653 (2018).

41 . Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research 43, e47, doi:10.1093/nar/gkv007 (2015).

42. swamp: Visualization, Analysis and Adjustment of High-Dimensional Data in Respect to Sample Annotations. (2018).

43. Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S. (2002).

44. Hoadley, K. A. et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158, 929-944, doi:10.1016/j.cell.2014.06.049 (2014).

45. Brodin, P. et al. Variation in the human immune system is largely driven by non-heritable influences. Cell 160, 37-47, doi:10.1016/j.cell.2014.12.020 (2015).

46. Li, J., Liu, Y., Kim, T., Min, R. & Zhang, Z. Gene expression variability within and between human populations and implications toward disease susceptibility. PLoS computational biology 6, doi:10.1371/journal.pcbi.1000910 (2010).

47. Heyn, H. et al. DNA methylation contributes to natural human variation. Genome research 23, 1363-1372, doi:10.1 101 /gr.154187.1 12 (2013).

48. Barturen, G., Beretta, L., Cervera, R., Van Vollenhoven, R. & Alarcon-Riquelme, M. E. Moving towards a molecular taxonomy of autoimmune rheumatic diseases. Nature reviews. Rheumatology 14, 75-93, doi:10.1038/nrrheum.2017.220 (2018). 49. Weiner 3rd, J. & Domaszewska, T. tmod: an R package for general and multivariate enrichment analysis. PeerJ Preprints 4, e2420v2421 , doi:10.7287/peerj. preprints.2420v1 (2016).

50. Shabalin, A. A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353-1358, doi:10.1093/bioinformatics/bts163 (2012).

51 . Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis- regulatory elements required for macrophage and B cell identities. Molecular cell 38, 576-589, doi:10.1016/j.molcel.2010.05.004 (2010).

52. Stunnenberg, H. G., International Human Epigenome, C. & Hirst, M. The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery. Cell 167, 1897, doi:10.1016/j.cell.2016.12.002 (2016).

53. Fernandez, J. M. et al. The BLUEPRINT Data Analysis Portal. Cell systems 3, 491 -495 e495, doi:10.1016/j.cels.2016.10.021 (2016).

54. Fox, J. & Weisberg, S. An {R} Companion to Applied Regression. 2 edn, (Sage, 201 1 ).

55. Hothorn, T., Bretz, F. & Westfall, P. Simultaneous inference in general parametric models. Biometrical journal. Biometrische Zeitschrift 50, 346-363, doi:10.1002/bimj.200810425 (2008).

56. Josse, J. & Husson, F. missMDA: A Package for Handling Missing Values in Multivariate Data Analysis. Journal of Statistical Software 70, 1 -31 , doi:doi:10.18637/jss.v070.i01 (2016).

57. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics 81 , 559-575, doi:10.1086/519795 (2007).

58. Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1 ,000,000 Profiles. Cell 171 , 1437-1452 e1417, doi:10.1016/j.cell.2017.10.049 (2017).

59. Bousfield, D. et al. Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources. F1000Research 5, doi:10.12688/M OOOresearch.791 1 .1 (2016).

60. Mosca, M., Tani, C., Vagnani, S., Carli, L. & Bombardieri, S. The diagnosis and classification of undifferentiated connective tissue diseases. Journal of autoimmunity 48-49, 50-52, doi:10.1016/j.jaut.2014.01 .019 (2014). 61. Alarcon-Segovia, D. & Cardiel, M. H. Comparison between 3 diagnostic criteria for mixed connective tissue disease. Study of 593 patients. The Journal of rheumatology 16, 328-334 (1989).

62. Aletaha, D. et al. 2010 rheumatoid arthritis classification criteria: an American College of Rheumatology/European League Against Rheumatism collaborative initiative. Annals of the rheumatic diseases 69, 1580-1588, doi:10.1 136/ard.2010.138461 (2010).

63. Hochberg, M. C. Updating the American College of Rheumatology revised criteria for the classification of systemic lupus erythematosus. Arthritis and rheumatism 40, 1725, doi:10.1002/1529-0131 (199709)40:9<1725::AID-ART29>3.0.CO;2-Y (1997).

64. Miyakis, S. et al. International consensus statement on an update of the classification criteria for definite antiphospholipid syndrome (APS). Journal of thrombosis and haemostasis : JTH 4, 295-306, doi:10.1 1 1 1 /j.1538-7836.2006.01753.x (2006).

65. van den Hoogen, F. et al. 2013 classification criteria for systemic sclerosis: an American college of rheumatology/European league against rheumatism collaborative initiative. Annals of the rheumatic diseases 72, 1747-1755, doi:10.1 136/annrheumdis-2013-204424 (2013).

66. Vitali, C. et al. Classification criteria for Sjogren's syndrome: a revised version of the European criteria proposed by the American-European Consensus Group. Annals of the rheumatic diseases 61 , 554-558 (2002).

67. Bossini-Castillo, L., Lopez-lsac, E. & Martin, J. Immunogenetics of systemic sclerosis: Defining heritability, functional variants and shared-autoimmunity pathways. Journal of autoimmunity 64, 53-65, doi:10.1016/j.jaut.2015.07.005 (2015).

68. Beretta, L. et al. in 2016 ACR/ARHP Annual Meeting Vol. 68 (suppl 10) (Arthritis Rheumatol, 2016).

69. Aguilar-Quesada, R. et al. Design and implementation of a model for harmonized and quality sample handling in the PRECISESADS multi-center project. In press (2019).

70. Jamin, C. et al. Multi-center harmonization of flow cytometers in the context of the

European "PRECISESADS" project. Autoimmunity reviews 15, 1038-1045, doi:10.1016/j.autrev.2016.07.034 (2016).

71. Scheufele, E. et al. tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform. AM I A Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 2014, 96-101 (2014). 72. Bousfield, D. et al. Patterns of database citation in articles and patents indicate long-term scientific and industry value of biological data resources. F1000Research 5, doi:10.12688/M OOOresearch.791 1 .1 (2016).

73. Franks, A. L. & Slansky, J. E. Multiple associations between a broad spectrum of autoimmune diseases, chronic inflammatory diseases and cancer. Anticancer research 32, 1 1 19-1 136 (2012).

74. Tang, H., Peng, J., Wang, P. & Risch, N. J. Estimation of individual admixture: analytical and study design considerations. Genetic epidemiology 28, 289-301 , doi:10.1002/gepi.20064 (2005).

75. Thornton, T. et al. Estimating kinship in admixed populations. American journal of human genetics 91 , 122-138, doi:10.1016/j.ajhg.2012.05.024 (2012).

76. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7, doi:10.1 186/s13742-015-0047-8 (2015).

77. Loh, P. R. et al. Reference-based phasing using the Haplotype Reference Consortium panel. Nature genetics 48, 1443-1448, doi:10.1038/ng.3679 (2016).

78. Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G. R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature genetics 44, 955-959, doi:10.1038/ng.2354 (2012).

79. Das, S. et al. Next-generation genotype imputation service and methods. Nature genetics 48, 1284-1287, doi:10.1038/ng.3656 (2016).

80. Andrews, S. FastQC: a quality control tool for high throughput sequence data. , 2010).

81 . Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10-12, doi :http://dx.doi.org/10.14806/ej.17.1 .200 (201 1 ).

82. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 , doi:10.1093/bioinformatics/bts635 (2013).

83. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics 12, 323, doi:10.1 186/1471 -2105-12-323 (201 1 ). 84. Min, J. L, Hemani, G., Davey Smith, G., Relton, C. & Suderman, M. Meffil: efficient normalization and analysis of very large DNA methylation datasets. Bioinformatics 34, 3983- 3989, doi:10.1093/bioinformatics/bty476 (2018).

85. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research 29, 308-31 1 (2001 ).

86. Chen, Y. A. et al. Discovery of cross-reactive probes and polymorphic CpGs in the lllumina Infinium HumanMethylation450 microarray. Epigenetics 8, 203-209, doi:10.4161/epi.23470 (2013).

87. Pidsley, R. et al. Critical evaluation of the lllumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling. Genome biology 17, 208, doi:10.1 186/s13059- 016-1066-1 (2016).

88. Salma, N. et al. Thrombotic risk assessment and analytical performance of the chemiluminescent analyzer IDS-iSYS for the detection of anti-cardiolipin and anti-beta 2 glycoprotein I autoantibodies. Clinical immunology 194, 92-99, doi:10.1016/j.clim.2018.07.006 (2018).

89. Capaldo, C. et al. The active immunological profile in patients with primary Sjogren's syndrome is restricted to typically encountered autoantibodies. Clinical and experimental rheumatology 34, 722 (2016).

90. Biesen, R. et al. Anti-dsDNA-NcX ELISA: dsDNA-loaded nucleosomes improve diagnosis and monitoring of disease activity in systemic lupus erythematosus. Arthritis research & therapy 13, R26, doi:10.1 186/ar3250 (201 1 ).

91 . Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Current protocols in bioinformatics 43, 1 1 10 1 1 -33, doi:10.1002/0471250953.bi1 1 10s43 (2013).