Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
PKHD1, THE POLYCYSTIC KIDNEY AND HEPATIC DISEASE 1 GENE
Document Type and Number:
WIPO Patent Application WO/2003/085088
Kind Code:
A2
Abstract:
Autosomal recessive polycystic kidney disease (ARPKD) is a severe form of polycystic kidney disease that presents primarily in infancy and childhood and is characterized by enlarged kidneys and congenital hepatic ubrosis. We have identified PKHD1, the gene mutaded in ARPKD. PKHD1 extends over 469 kb, is primarily expressed in fetal and adult kidney and includes at least 86 exons that are variably assembled into a number of alternatively spliced transcripts.

Inventors:
GERMINO GREGORY G
ONUCHIC LUIZ F
NAGASAWA YASU
GUAY-WOODFORD LISA M
SOMOLO STEFAN
FURU VASILE M
Application Number:
PCT/US2003/003410
Publication Date:
October 16, 2003
Filing Date:
February 03, 2003
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UAB RESEARCH FOUNDATION (US)
International Classes:
C07K14/47; (IPC1-7): C12N/
Other References:
ONUCHIC ET AL: 'PKHD1, the polycystic kidney and hepatic disease 1 gene, encodes a novel large protein containing multiple immunoglobulin-like plexin-transcription-factor domains and parallel beta-helix 1 repeats' AM. J. HUM. GENET. vol. 70, May 2002, pages 1305 - 1317, XP002978611
XIONG ET AL: 'A Novel Gene Encoding a TIG Multiple Domain Protein Is a Positional Candidate for Autosomal Recessive Polycystic Kidney Disease' GENOMICS vol. 80, no. 1, July 2002, pages 96 - 104, XP004468740
TEFFERI ET AL: 'Primer on medical genomics - Part II: Background principles and methods in molecular genetics' MAYO CLIN. PROC. vol. 77, no. 8, August 2002, pages 785 - 808, XP008010354
JOHNSON ET AL: 'Molecular pathology and genetics of congenital hepatorenal fibrocystic syndromes' J. MED. GENET. vol. 40, 2003, pages 311 - 319, XP002978612
Attorney, Agent or Firm:
Peterson, Gregory T. (1819 Fifth Avenue North Birmingham, AL, US)
Download PDF:
Claims:
We claim:
1. An Isolated polynucleotide encoding a PKHD1 polypeptide or functional fragments thereof.
2. An isolated polynucleotide encoding a polypeptide of about 4074 amino acid residues, wherein the polypeptide is associated with ARPKD, and wherein the polynucleotide includes about 86 exons and about 469 kb.
3. An isolated oligonucleotides that specifically hybridize to regions of PKHD1 under stringent conditions and which allow detection of regions of mutation as compared to wild type genes.
Description:
Summary Autosomal recessive polycystic kidney disease (ARPKD) is a severe form of polycystic kidney disease that presents primarily in infancy and childhood and is characterized by enlarged kidneys and congenital hepatic fibrosis. We have identified PKHD1, the gene mutated in ARPKD. PHD7 extends over 469 kb, is primarily expressed in fetal and adult kidney and includes at least 86 exons that are variably assembled into a number of alternatively spliced transcripts. The longest continuous open reading frame encodes a 4074 amino acid protein, polyductin, that is predicted to have a single transmembrane spanning domain (TM) near its carboxyl terminus and IPT domains and PbHl repeats in its amino terminus. Several transcripts encode truncated products that lack the TM and may be secreted if translated. The PjKHD7 gene products are members of a novel class of proteins that share structural features with hepatocyte growth factor receptor and plexins and belong to a superfamily of proteins involved in regulation of cellular adhesion and repulsion and of cell proliferation.

Introduction Autosomal recessive polycystic kidney disease (ARPKD) (MIM 263200) is a hereditary and severe form of polycystic kidney disease affecting the kidneys and biliary tract with an estimated incidence of 1 in 20,000 live births (Zerres et al. 1998). The clinical spectrum is widely variable with most cases presenting in infancy (Guay- Woodford, 1996). The fetal phenotypic features classically include enlarged and echogenic kidneys, as well as oligohydramnios secondary to a poor urine output (Reuss, 1990). Up to 50% of the affected neonates die shortly after birth as a result of severe pulmonary hypoplasia and secondary respiratory insufficiency. Those who survive the perinatal period express widely variable disease phenotypes. In this patient subset, morbidity and mortality are mainly related to severe systemic hypertension, renal insufficiency, and portal hypertension due to portal tract fibrosis (Zerres et al. , 1996).

Mutations at a single locus, PKHDI (Polycystic Kidney and Hepatic Disease 1), are responsible for all typical forms of ARPKD. We have previously mapped PKHD1 to 6p21. 1-pl2 (Zerres et al., 1994 ; Guay-Woodford et al. , 1995). We subsequently constructed a series of physical and genetic maps that refined the localization of PKHDI to a candidate region of-1 cM, delimited by D6S1714 and D6S1024 as respective telomeric and centromeric flanking marlcers (Lens et al, 1997; Mucher et al, 1998 ; Park et al, 1999). More recent recombination mapping studies have further reduced the size of the interval to 834kob with KIAA0057 (CA) 28 as the new centromeric boundary (Onuchic et al, in revision).

In the current report, we describe the identification of PKHD1, a novel gene encoded in a minimum of 86 exons that are assembled in a complex pattern of alternative splice variants. The predicted translation products are novel proteins that share homology to a superfamily of proteins involved in the regulation of cell proliferation, and cellular adhesion and repulsion.

Materials and Methods Patient samples The patient databases used in this study are from University of Alabama at Birmingham and RWTH University, Aachen, Germany. The diagnostic criteria were the same as previously reported (Zerres et al. , 1998). The group of patients studied had clinical features representative of the entire ARPKD clinical spectrum. Pedigrees were recruited and blood samples were obtained under informed consent for ARPKD patients and members of their families. Control DNA from 40 individuals was also obtained after informed consent and an additional 20 control DNA samples were purchased from the Coriell Cell Repository. DNA was extracted as previously described (Eggermann et al., 1993).

Transcription map Database searches included a systematic surveillance of the UniGene, Sanger, TIGR, Celera (public domain) and GenBank web sites. The gene prediction algorithms FGenesh (Salamov and Solovyev, 2000) and GenScan (Burge and Karlin, 1997; Burset and Guigo, 1996) were used to annotate genomic seqeunce as it became available.

Expressed sequences were confirmed by RT-PCR across putative splice junctions using adult kidney mRNA as template, PCR-amplification using a panel of multiple tissue cDNA samples as template (Origene), and by Northern-blot analysis using human multiple tissue blots (Clontech).

PSHD1 cDNA isolation Most of the PKHDI cDNA products were amplified using adult human kidney double-stranded cDNA purchased from Clontech (Marathon Ready cDNA) as template.

A second set of products were generated by RT-PCR using either 20ng of adult human kidney mRNA (Clontech) or 1.5-4. 0 u. g of human adult kidney total RNA as template. The total RNA was extracted using Trizol reagent (Invitrogen) and reverse transcribed using random hexamer primers and Superscript reverse transcriptase (Gibco BRL). A final set of cDNA products was amplified using a 1: 20 dilution of an oligo-dT primed human adult kidney cDNA library (Gibco BRL). The 5'RACE and 3'RACE experiments were performed according to the manufacturer's instructions (Clontech). All primer sequences used to amplify the set of cDNA products are shown in Supplementary Information Table 1.

Mutation detection PCR primers flanking individual exons and offset by at least 20 bp from intron- exon junctions (Supplemental Table 2) were designed using the program Primer 3 and used to amplify 20 ng of genomic DNA from patients and controls. In cases of exons larger than 400 bp, several overlapping primers were designed to keep amplicons less than 500 bp in size. Mutation detection was performed using the Transgenomic Wave denaturing high-performance liquid chromatography system (Transgenomic Inc. ). PCR products were denatured at 98°C for 4 minutes and allowed to reanneal; 8-12 j. il of each amplicon was injected onto the column and eluted with a linear acetonitrile gradient at a flow rate of 0.9 ml/min. The mobile phase consisted of a mixture of buffers A (0.1 M TEAA and 1 mM EDTA) and B (25% acetonitrile in 0.1 M TEAA). The buffer gradient for each amplicon was determined according to the Wavemaker ver. 3.3 (and later ver.

4.1 ; Transgenomic Inc. ) system control software as was the optimum denaturing temperature required for successful resolution of heteroduplexes. If the resolution was not adequate, a second temperature typically either 2°C above or below the first was used to improve resolution. Samples showing altered elusion properties not present in controls were sequenced in both directions and sequence variations were identified by visual inspection and comparison of the resulting electropherograms. When amplicons had altered elution properties in both control and patient DNA, they were sequenced in both samples to confirm their identity at the sequence level.

Sequence analysis and protein 7nodeling The genomic structure and the gene orientation were established by aligning the confirmed expressed sequences with the interval genomic sequence using BLAST2.

Sequence homologies were identified using the BLASTP/N/X programs. SMART (Simple Modular Architecture Research Tool) (Schultz et al, 1998; Letunic et al, 2002) and PROSITE were used to identify domain architecture and protein motifs. All analyses were performed using the default parameters.

Northern Analysis Probes were amplified using cloned gene fragments as template, gel purified, 32p_ labeled using the multi-prime method, and hybridized to human adult and fetal MTN blots (Clontech). Hybridizations were performed at either 68°C using ExpressHyb (Clontech) or 42°C using a formamide-based buffer and washed under stringent conditions (68°C for lh in 0. 1SSC and 0.1% SDS). Images were obtained using a Phosphoimager (Molecular Dynamics).

Results Transcription map of the minimal interval We assembled a transcription map of the minimal interval using database searches, cDNA library screening, genomic sequence annotation with gene structure prediction programs, RT-PCR, and Northern analyses. A total of 35 non-overlapping sets of genes, cDNA clones, ESTs and RT-PCR products were mapped to the critical region (fig 1A). The genes KIAA0057 (Onuchic et al. , 1999), FLJ10466 (Onuchic et al. , in revision), MCM3 (Hoffinann, 2000), Interleukin-17 (Rouvier et al. , 1993), and ML-1 (Kawaguchi et al. , 2001) have been previously described. Each of these genes either has been excluded as a candidate for PKHD1 or was deemed to be an unlikely candidate based on its Icnown function.

Transcripts with kidney expression were preferentially targeted for mutation analysis. One such transcript, a novel 1. 82-kb expressed sequence identified by gene structure prediction programs and confirmed by RT-PCR from kidney mRNA, mapped to BAC 442L12, within the distal portion of the critical interval (fig. 1B, C). A second novel transcript, initially identified using similar methods and mapped to BACs 771D21 and 374E4 (fig 1B, C), was subsequently found to share some, though not all, of its exons with hCT1642763 and a single human EST (BF822430) derived from a kidney tumor library. The 1.03 Icb transcript appeared to have a mouse ortholog in UniGene cluster Mm. 25855 comprised of ESTs obtained from kidney and liver libraries.

The PKHDI transcript and genomc orga7tization As putative expressed sequences were identified in the PKHD1 region, they were systematically analyzed for mutations by screening 20 unrelated affected patients and 20 normal controls using DHPLC (see below). Virtually simultaneously, we discovered variants in these two apparently unrelated expressed sequences that only appeared in our patient samples, not in controls. Among hundreds of amplicons analyzed in the region up to that point, none had given a pattern of variation exclusive to affected individuals. We analyzed an additional 20 controls individuals to confirm that neither of these variants were present among 80 control chromosomes. Subsequent RT-PCR studies using kidney mRNA as template established that the BAC 442L12-and hCT1642763-related transcripts were part of the same gene. We went on to determine the structure of the complete transcript and its genomic organization.

Northern analyses (see below) suggested that the PKHDI transcript was considerably longer and involved more complex splicing variations than was initially suggested by the BAC442L12-and hCT1642763-related transcripts. We used a PCR- based approach with primers strategically positioned within the BAC442L12-related and hCT1642763-related transcripts to determine the sequence of the longest open reading frame (ORF) of PKHD1, to elucidate its genomic structure and to define the complex pattern of exon assembly. Human kidney mRNA, an adult human kidney cDNA library and adult human kidney double-stranded cDNA were used as templates (fig. 2). A number of primer combinations and end-clone amplifications, as well as 5'RACE and 3' RACE reactions, were required to establish the composite sequence of the full length gene (Supplementary Information, fig. 1). An adult human kidney cDNA library, human kidney mRNA and total RNA and double-stranded cDNA served as templates for these studies. The sequences of all exons and splicing junctions were determined both by double strand sequencing of PCR products and by comparison with the publicly available genomic sequence of the interval (fig 2).

These studies provided rigorous confirmation that the BAC442L12-and hCT1642763-related transcripts were part of the same gene, but they also yielded a number of unanticipated results. We discovered that hCT1646988 (Venter JC et al, 2001), previously reported as an independent gene, was actually part of PKHD1. However, regardless of method used, we were not successful in linking the last three exons of hCT1646988 to the remainder of the PKHDI transcript. We encountered similar problems with hCT1642763, as RT-PCR, cDNA amplification, or 5'RACE could not confirm the existence of exons 1-3 within the PKHDI transcript. These same methods did, however, identify two previously unknown exons at the putative 5'end of PKHDI, one of which contained the predicted translation start site. We also identified a small number of errors in the sequence available in public databases. While most were relatively minor, one would be predicted to alter the reading frame of the protein. We have verified the accuracy of our sequence for the regions in question by determining the sequence of both DNA strands in several templates.

These data indicate that PKHDI spans at least 469kb of genomic sequence and possibly as much as 643kb if the unverified 5'and 3'exons of hCT1642763 and hCT1646988, respectively, are included. These results also show that the 3'end of PKHD1 is positioned at least 74-kb distal to the flanking genetic marker, D6S1714 (fig.

1), thus explaining why we found so few meiotic recombinations between these markers and the disease phenotype. The total number of exons we identified in PKHD1 transcripts is 86 (fig. 2B). This is likely a conservative estimate as it is likely that at least some of the unverified exons in hCT1642763 (exons 1-3,9, 19,23, 32, and 33) and hCT1646988 (exons 9-11) or the EST database, or predicted by computer algorithms will ultimately be demonstrated if mRNA or cDNA from a suitable tissue is used as template.

In our attempt to assemble a complete cDNA, we identified a large number of distinct transcripts that had unique combinations of PKHDI exons. Representative examples are presented in fig 2B. The absolute number of differentially spliced products is almost certainly far higher since we did not perform an exhaustive analysis of every primer combination. In numerous cases, a PCR reaction that appeared to yield a single amplification product was discovered after cloning and sequencing to include multiple differentially spliced products of nearly identical size. We believe these data provide a likely molecular explanation for the lack of a discrete message seen on Northern blot by labeled PKHDI gene segments (see below).

Mutation analyses We performed mutation detection using DHPLC across the 67 exons comprising the longest potential ORF (fig. 2; Supplemental Information, Table 2). We expanded our patient group to 25 individuals (50 disease chromosomes) and our control group to 60 individuals (120 chromosomes). We focused most of our mutation detection efforts on individuals for whom we had family material enabling us to confirm segregation and the individuals represented diverse nationalities and the complete spectrum of clinical disease (Table 1). In all cases where segregation could be established, the mutant alleles resided on separate chromosomes (fig. 3). A minimum of 67 exons (longest ORF) were screened in each individual and in all cases, either 0,1 or 2 pathogenic variants were found. We identified mutations in 21 of 50 disease chromosomes (42%). Mutations were defined as variants seen in our patients by DHPLC and confirmed by sequencing and not seen by DHPLC in any of the 120 control chromosomes. The finding of these mutations establishes this gene as PKHD1 (Table 1).

Eight different non-conservative missense changes accounted for mutations in 12 of 21 disease chromosomes for which we found mutations. Among these, G9337T (D3139Y) corresponded to the initial variant discovered in the BAC 442L12-related transcript and C329T (T39M) corresponded to the initial variant found in the hCT1642763-related transcript (see above). Six different frame shifting mutations accounted for the remaining 9 disease chromosomes. One individual, 340/1395 (Table 1), had frame shifting mutations in both alleles consistent with the notion that PKHDI causes disease by a loss-of-function mechanism. One frame shifting variant, A6117insA, occurred in two unrelated individuals (AL 48 and 340/1395) from geographically distinct locales. Two missense variants also recurred. A886G (I22V) was seen in two individuals without lcnown relation and of distinct national origins. C5092T (R1624W), on the other hand, was identified in a family with mown consanguinity and in two other individuals from the same geographical region. This variant may represent a founder allele in this population. With the exception of the consanguineous family, all other individuals are compound heterozygous for mutations where both variants were identified. In individuals in whom only one or no mutations were found, it is lilcely that the DHPLC screen as applied has failed to detect the second (or either) variant.

The panel of patient material used included both the severe perinatal and milder, later onset disease phenotypes (Table 1). The limited sample size leaves open any conclusions regarding genotype-phenotype correlations, but does permit a few preliminary hypotheses. Among individuals in whom both mutations were identified, the only individual with two chain terminating mutations (340/1395) had the severe phenotype. The three individuals with missense mutations on both alleles (AL 1, AL 47 and AL 52) had the late onset phenotype. Among the five individuals with a chain terminating frame shifting mutation and a missense mutation, two had older onset and three had the severe form; none shared the same missense allele. It is possible that not all missense variants are functionally equivalent-some may result in a hypomorphic allele that allows for a clinically milder course. An expanded study of genotype-phenotype correlations should clarify this point. In light of the complex splicing pattern and multiple transcripts, identification of pathogenic mutations is one means of identifying exons whose presence in a transcript is essential for the function of PKHDI in the kidney and liver. The current analysis suggests that exons 3,9, 11,18, 22,29, 32,36, 58,59 and 61 (of the 67 in the longest ORF) are essential for normal polyductin function. These data also highlight the lack of evidence for clustering of mutations in any one region of the gene or in any functional domain of the putative protein.

Expression features of PKHD1 Commercially acquired human adult and human fetal multiple tissue Northern blots were hybridized with two different probes (exon 59; exons 66-70) to determine the expression pattern of PKHD1. Rather than a single, discrete message, both probes detected a smear that ranged from approximately 8.5 to-131cb (fig. 4). The highest level of expression was observed in the fetal and adult kidney samples, consistent with a role of this gene in kidney development and function. In the adult specimen, the peak signal was observed as two diffuse bands of about 9-lcb and 12-kb. In fetal kidney, the size distribution of the transcripts appeared to be somewhat lower and more uniform. PKHD1 is also present in the pancreas but at much lower levels. PKHD1 is barely detectable in fetal and adult liver. The remaining tissue samples had no visible signal. Given the diffuse signal from the transcripts, however, we cannot exclude a low level of expression in other organs. Hybridization of the identical blots with a probe recognizing exons 43- 46 of human PKDI yielded discrete, high molecular weight bands of the correct size, excluding non-specific degradation of high molecular weight mRNA in the samples as an explanation for our unusual results. <BR> <BR> <BR> <BR> <BR> <BR> <P>The PKHDI gene product : a membrane-anchoredprotein with multiple IPT domains and<BR> <BR> <BR> <BR> <BR> PbHI repeats The composite cDNA that yielded the longest continuous ORF, amplfied from kidney cDNA as two overlapping fragments, is 12. 6kb in length, includes 67 exons and is predicted to encode a protein of 4074 amino acids (Supplementary Information, fig. 2).

This novel protein, which we have named polyductin, is predicted by SMART to be an integral membrane protein with a 3857 amino acid extracellular amino terminus, a single transmembrane domain and a very short carboxyl terminus (fig. 5).

BLASTP analysis revealed that polyductin has highest homology (E value=3~45) to murine protein D86 (Sato and Taniguchi, in press). The region of homology begins near the amino terminus of both molecules and stretches over most of the full length of D86 (1944 aa). The function of D86 is not known, but it is described in Genbanlc as a novel protein secreted from lymphocytes. Significant homologies (E value=9~°9 and 1-°8, respectively) were also observed for two expressed sequences (KIAA1412 and transmembrane protein 2; [AAF21348] ) that encode the same novel protein of unknown function (Nagase et al, 2000; Scott et al, 2000).

Several short segments of polyductin were found to have very weak homology to a number of other proteins whose functions are known, including the HGF receptor (HGFR, P08581), and several plexins. Using SMART, we determined that these sequences encoded IPT domains (Ig-like, Plexin, Transcription factor). The structure of several IPT-containing proteins has been determined (Cramer et al. , 1997), but their function remain's unknown.

IPT domains consist of an immunoglobulin-like fold and proteins that contain IPT domains generally belong to one of two classes: intracellular DNA transcription factors (the Rel family) or single-pass cell surface receptors that are members of the Sema super- family of proteins (HGFR, Ron, and the large family of plexins). While all members of the Rel family have a single IPT unit that is involved in DNA binding, virtually all of the receptor proteins contain multiple IPT domains, often tandemly positioned. Topology predictions indicate that polyductin contains six IPT domains within its extracellular segment (fig. 5). The overall similarity between polyductin and receptor molecules such as HGFR and plexin 3A suggests a similar function for polyductin. However, there are differences in structure that suggest polyductin is unique. Polyductin lacks Sema and PSI domains, common to all other members of the Sema superfamily. It also lacks an intracellular kinase domain present in HGFR and other conserved cytoplasmic sequences present in plexin subclasses.

SMART analysis identified a second motif within polyductin that might provide additional functional clues. The program revealed a minimum of nine (and possibly a tenth) parallel beta-helix (PbHl) repeats clustered within three groups between the last IPT domain and the TM domain (fig. 5). PbHl repeats are most commonly associated with polysaccharidases, and within this enzyme class, bacterial polysaccharidases are the most extensively studied. These bacterial enzymes serve as important virulence factors for plant pathogens as they allow bacteria to degrade plant cell wall polysaccharides. The PbHl repeats are essential for enzyme function, forming both the ligand-binding and catalytic sites. The presence of multiple PbHl domains within polyductin suggests that polyductin may have similar catalytic properties.

Motif analysis with the PROSITE program identified multiple potential N- glycosylation sites and a single arginine-glycine-aspartate (RGD) domain. This motif is found in fibronectin and numerous other proteins where it has been shown to play a role in cell adhesion. In addition, several putative cAMP/cGMP phosphorylation sites were identified within the cytoplasmic carboxyl terminus. No tyrosine phosphorylation consensus sites were recognized within this cytoplasmic tail, further distinguishing polyductin from members of the plexin family.

Finally, we examined how the various splicing arrangements might affect the protein (s) structure. If, in fact, some of the alternatively spliced products are also translated, then the gene products are predicted to fall into two broad groups. One group, which includes the longest continuous ORF but may also include molecules lacking some middle domains, has a single TM element and likely to be associated with the plasma membrane (polyductin-M). The other set lacks a TM domain and thus its members may be secreted (polyductin-S) (fig. 5).

Discussion We report the initial description and characterization of a novel gene, PKHD1, implicated in all typical forms of ARPKD. Multiple lines of evidence strongly support the pathogenic role of mutations in this gene in ARPKD. First, the genomic structure of this candidate extends over nearly 50% of the critical PKHD1 interval defined by recombination mapping. This observation alone provides a high prior probability that this gene is the disease-susceptibility locus. Secondly, we have shown that this gene is expressed predominantly in the kidney, an organ invariably involved in this disorder.

Thirdly, we have identified a large number of missense and protein truncating mutations that are found only in affected individuals and not a large number of controls. For individuals in whom we have identified two mutations, we have shown that the mutations occur on separate haplotypes and segregate with the disease chromosome. In no case did we identify an individual who had more than two putative pathogenic variants.

The PKHDI gene and its translation products have several distinctive features that warrant special note. First, with a minimum genomic size of 469 lcb, PKHDl is among the largest human genes characterized to date. Second, the gene encodes a complex and extensive array of splice variants discovered by RT-PCR and cDNA cloning and confirmed by Northern blotting. We excluded the possibility that the diffuse signal observed on Northern blots results from degradation of RNA by the presence of an intact 14 kb PKD1 transcript on the same blots. Moreover, the multiplicity of different transcripts discovered in public databases, revealed by RT-PCR of kidney mRNA, and amplified from aliquots of double stranded cDNA as well as a cDNA library, correlates well with the Northern results. It is important to emphasize that almost all of the exons exhibit consensus donor and acceptor splice sites, further supporting the conclusion that these are legitimate transcripts.

The multiplicity of splicing variants observed for PKHD1 is an uncommon feature of mammalian genes. Preliminary studies of mouse tissue suggest that the complicated splicing pattern is likely conserved (YN, unpublished observations), indicating a functional role for this property. The abundance of these splice variants in poly-A enriched samples indicates that many, if not all, are fully processed to include a poly-A tail. It is not presently lcnown how many of the transcripts are actually translated into protein. In the event that most of the mRNAs are translated, it could mean that this single gene might encode numerous distinct polypeptides.

Similar findings have been reported for the neurexin family of genes (Missler and Sudhof, 1998). Just three genes may encode more than a 1000 isoforms that differ in size and amino acid sequence through alternative splicing. Interestingly, the general structure of the largest gene products is similar to that of polyductin-M. Likewise, a subset of the transcripts is predicted to contain stop codons and produce secreted proteins without a transmembrane region. The neurexin family of proteins is expressed in neurons where they function as receptors important for neuronal cell recognition.

Such a complicated pattern of splicing poses a significant challenge in predicting the functional consequences of putative pathogenic mutations. For most genes, the implications of protein-truncating mutations are relatively easily defined. Loss of critical domains usually results either in constitutive activation or functional loss. In the case of PKHD1, many of the normal splicing products are predicted to yield truncated proteins that lack critical domains including the transmembrane region and cytoplasmic tail of polyductin. Similar outcomes are predicted for many of the mutations described herein, yet the disease caused by these mutations are a de facto bioassay for normal polyductin function. We suggest two potential explanations to reconcile these findings. First, all of the observed mutations are predicted to alter the sequence of the largest ORF. This may suggest that a critical amount of the full-length protein is necessary for normal function.

An alternative possibility is that mutations disrupt a critical functional stoichiometric or temporal balance between the different protein products that is normally maintained by elaborate, tightly regulated splicing patterns.

The Northern data suggest that PKHDI is predominantly expressed in the kidney, consistent with the observed phenotype in ARPKD. Much lower transcript expression was detected in liver, not an unexpected finding given that biliary ductules, which are abnormal in ARPKD, comprise only a small fraction of the total tissue. The fetal expression pattern of PKHDI is consistent both with the observation that renal and hepatic abnormalities develop in utero and the hypothesis that disease pathogenesis involves a defect in terminal epithelial differentiation (Calvet 1993). Continued expression of PKHDI in adult tissues suggests an additional, undefined role for its gene product in mature, terminally differentiated organs. PKHDI expression slightly greater than and lower than that observed in the liver was also observed in the pancreas and placenta, respectively. A disease-associated phenotype has not been described in either organ of ARPKD patients. This situation is not dissimilar to that found in dominant polycystic kidney disease where pancreatic cysts were an under-appreciated manifestation of the disease until the role of the polycystin genes in pancreatic development became apparent from mouse studies (Lu et al, 1997; Wu et al, 1998).

Pancreatic cysts do not result in clinical symptoms in dominant polycystic disease (Nicolau al, 2000).

We propose that the transcript with the longest ORF is the likeliest gene product of PKHDI since it is the only transcript that would be altered by all of the mutations we have described. Polyductin shares some structural features with the Ron class of tyrosine lcinase receptors and the plexin superfamily and thus may also function to regulate cell- cell recognition or cell motility. However, the PKHD1 gene product (s) lacks key structural elements of these protein classes, suggesting that its mechanism of action will differ from the others. The presence of multiple PbHl repeats in polyductin suggests a possible role for this molecule in carbohydrate recognition and modification. Targets for binding could include carbohydrate moieties present either in glycoproteins on the cell surface or in the matrix of the basement membrane; interactions with polyductin may modulate cell-cell or cell-matrix attachments. One intriguing possibility is that the variable number of IPT and PbHl domains encoded by some of the shorter transcripts could potentially result in products with different specificities or binding affinities for target factors, as has been postulated for the neurexins (Missler and Sudhof, 1998). We presently are unable to determine whether polyductin serves primarily as a receptor, ligand or membrane-associated enzyme.

As noted above, polyductin has a unique combination of structural features not previously observed in a single molecule. The discovery of a second protein, D86, with a very similar pattern suggests that the gene products of the D86 and PKHD1 loci may be prototypes of a novel class of proteins. From a structural perspective, D86 is most similar to the polyductin-S family of polypeptides. By analogy with polyductin, we propose the possible existence of a larger, membrane-associated form of D86. The fact that D86 is described as a secreted protein further supports our hypothesis that polyductin-S may have the same properties.

We have identified the gene responsible for autosomal recessive polycystic kidney disease and determined that it is a novel, large and complex gene. While this complexity may pose some challenges with respect to the implementation of DNA-based diagnostic testing, the discovery of PKHD1 should provide important biologic insights into epithelial differentiation and organogenesis. In addition, these new insights should help establish a platform for developing targeted therapeutic interventions for patients with this often devastating disease.

Acknowledgments The authors would like to thank the many families and their health care providers who have cooperated with these studies. This work was supported by NIH R01 DK51259, FAPESP 2000/00280-3 and the Deutsche Forschungsgemeinschaft. We thank Gabi Muecher, Jutta Becker and Kirstin Mangasser-Stephan for their work Electronic-Database Information Accession numbers and URLS for data in this article are as follows : OMIM, http ://www3. ncbi. nlm. nih. gov/Omim/ Genbanlc, http ://www. ncbi. nlm. nih. gov/Genbank/ for the sequences of all 86 exons and the composite cDNA with the longest ORF (accession number pending).

Unigene, http ://www. ncbi. nlm. nih. gov/UniGene/ Tigr Databases, http : //www. tigr. org/tdb/ Celera, http ://public. celera. com/cds/login. cfm BLASTP/N/X/2, http ://www. ncbi. nlm. nih. gov/blast/ <BR> <BR> <BR> <BR> <BR> SMART, http : //smart. embl-heidelberg. de/<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> PROSITE, http : //ca. expasy. org/ FGenesh, http ://genomic. sanger. ac. uk/gf/gf. shtml GenScan, http ://bioweb. pasteur. fr/seqanal/interfaces/genscan. hanl Sanger Centre Sequence Database, http : //www. sanger. ac. uk/HGP/sequence/ Primer 3, http ://www-genome. wi. mit. edu/cgi-bin/primer/primer3 www. cgi Accession Numbers : PbHl repeat, SMART #SM0710 ; Sema domain, SMART #SM0630 ; PSI domain, SMART #SM0423 ; IPT domain, SMART #SM0429 ; KIAA1412 and transmembrane protein 2, Accession #AAF21348 ; HGFR, Accession #PO8581 ; Plexin A3 Accession #P51805.

References Burge C, Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 78-94.

Burset M, Guigo R. (1996) Evaluation of gene structure prediction programs. Genomics 34: 353-367.

Calvet JP (1993). Polycystic kidney disease: primary extracellular matrix abnormality or defective cellular differentiation? Kidney Int 43: 101-8.

Cramer P, Larson CJ, Verdine GL, Muller CW (1997). Structure of the human NF kappaB p52 homodimer-DNA complex at 2.1 A resolution. EMBO J 16: 7078-90.

Eggermann T, Nothen MM, Propping P, Schwanitz G (1993). Molecular diagnosis of trisomy 18 using DNA recovered from paraffin embedded tissues and possible implications for genetic counselling Ann Genet 36: 214-6.

Guay-Woodford L M (1996). Autosomal recessive disease: clinical and genetic profiles.

Polycystic Kidney Disease. V. Torres and M. Watson. Oxford, Oxford University Press: 237-267.

Guay-Woodford LM, Muecher G, Hopkins SD, Avner ED, Germino GG, Guillot AP, Herrin J, Holleman R, Irons DA, Primack W, et al. (1995). The severe perinatal form of autosomal recessive polycystic kidney disease (ARPKD) maps to chromosome 6p21. 1-pl2 : Implications for genetic counseling. Am J Hum Genet 56 : 1101-1107.

Hoffmann Y, Becker J, Wright F, Avner E, Mrug M, Guay-Woodford L, Somlo S, Zerres K, Germino GG, Onuchic LF (2000). Genomic structure of the gene for the human Pl-protein (MCM3) and its exclusion as a candidate gene for Autosomal Recessive Polycystic Kidney Disease. Eur J Hum Genet 8: 163-166.

Kawaguchi M, Onuchic LF, Li X-D, Essayan DM, Schroeder J, Xiao H-Q, Liu MC, Germino G, Huang SK (2001). Identification of a novel cytokine and its expression in subjects with asthma. J Immunol 167: 4430-5.

Lens XM, Onuchic LF, Wu G, Hayashi T, Daoust M, Mochizulci T, Santarina LB, Stockwin JM, Mucher G, Becker J, et al. (1997). An integrated genetic and physical map of the autosomal recessive polycystic kidney disease region. Genomics 41 : 463- 466.

Letunic I, Goodstadt 1, Dickens NJ, Doerks T, Schultz J, Mott R, Ciccarelli F, Copley RR, Ponting CP, Bork P (2002). Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 30: 242-244.

Lu, W. , Peissel, B., Babakhanlou, H. , Pavlova, A., Geng, L. , Fan, X. , Larson, C. , Brent, G. , and Zhou, J. (1997) Perinatal lethality with kidney and pancreas defects in mice with a targetted Pkdl mutation. Nat. Genet. 17: 179-181.

Missler M, Sudhoff TC. Neurexins: three genes and 1001 products (1998). Trends Genet 14: 20-26.

Muecher G, Becker J, Knapp M, Buttner R, Moser M, Rudnik-Schonebom S, Somlo S, Germino G, Onuchic L, Avner E, et al. (1998). Fine mapping of the autosomal recessive polycystic kidney disease locus (PKHD1) and the genes MUT, RDS, CSNK2 beta, and GSTA1 at 6p21. 1-pl2. Genomics 48: 40-45.

Nagase, T, Kilcuno, R, Ishikawa, KI, Hirosawa, M, Ohara, O (2000). Prediction of the coding sequences of unidentified human genes. XVI. The complete sequences of 150 new cDNA clones from brain which code for large proteins in vitro. DNA Res 7: 65- 73.

Nicolau C, Torra R, Bianchi L, Vilana R, Gilabert R, Darnell A, Bru C (2000).

Abdominal sonographic study of autosomal dominant polycystic kidney disease. J Clin Ultrasound 28: 277-82.

Onuchic LF, Mrug M, Hou X, Nagasawa Y, Furu L, Eggermann T, Bergmann C, Muecher G, Avner ED, Zerres K, Somlo S, Gemiino GG, Guay-Woodford LM.

Refinement of the autosomal recessive polycystic kidney disease (PKH171) interval and exclusion of an EF hand-containing gene as PKHDI candidate gene. In revision by Am J Med Genet.

Onuchic LF, Mrug M, Lakings AL, Muecher G, Becker J, Zerres K, Avner ED, Dixit M, Somlo S, Germino GG, Guay-Woodford LM (1999). Genomic organization of the KIAA0057 gene that encodes a TRAM-like protein and its exclusion as a Polycystic Kidney and Hepatic Disease 1 (PIED1) candidate gene. Mamm Genome, 10: 1175- 1178.

Park JH, Dixit MP, Onuchic LF, Wu G, Goncharuk AN, Kneitz S, Santarina LB, Hayashi T, Avner ED, Guay-Woodford L, et al. (1999). A 1 Mb BAC/PAC-based physical map of the autosomal recessive polycystic kidney disease gene (PKHD1) region on chromosome 6. Genomics 57: 249-255.

Reuss A, Wladimiroff JW, Stewart PA, Niermeijer MF (1990). Prenatal diagnosis by ultrasound in pregnancies at risk for autosomal recessive polycystic kidney disease, Ultrasound Med Biol 16: 355-9.

Rouvier E, Luciani MF, Mattei MG, Denizot F, Golstein P (1993). CTLA-8, cloned from an activated T cell, bearing AU-rich messenger RNA instability sequences, and homologous to a herpesvirus saimiri gene. J Immunol 150: 5445-56.

Salamov AA, Solovyev VV (2000). Ab initio gene finding in Drosophila genomic DNA.

Genome Res 10: 516-22.

Sato, H, Taniguchi, M. Novel protein secreted from lymphocytes. In press.

Schultz J, Milpetz F, Bork P, Ponting CP (1998). SMART, a simple modular auchitecture research tool: identification of signaling domains. Proc Natl Acad Sci USA 95: 5857- 64. Scott DA, Drury S, Sundstrom RA, Bishop J, Swiderslci RE, Carmi R, Ramesh A, Elbedour K, Srilcumari Srisailapathy CR, Keats BJ, Sheffield VC, Smith RJ (2000).

Refining the DFNB7-DFNB11 deafness locus using intragenic polymorphisms in a novel gene, TMEM2. Gene 246: 265-274.

Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M et al. (2001). The sequence of the human genome. Science 291 : 1304-51. <BR> <BR> <P>Wu, G. , D'Agati, V., Cai, Y. , Marlcowitz, G. , Park, J. H. , Reynolds, D. M. , Maeda. Y., Le, T. C. , Hou, H. Jr, Kucherlapati, R. , Edelmann, W. , and Somlo, S. (1998) Somatic inactivation of Plcd2 results in polycystic kidney disease. Cell 93: 177-188.

Zerres K, Mucher G, Bachner L, Deschennes G, Eggermann T, Kaariainen H et al.

(1994). Mapping of the gene for autosomal recessive polycystic kidney disease (ARPKD) to chromosome 6p21-cen. Nature Genet 7: 429-32.

Zerres K, Mucher G, Becker J, Steinlcamm C, Rudnik-Schoneborn S, Heilddia P, Rapola J, Salonen R, Germino GG, Onuchic L, et al. (1998). Prenatal diagnosis of autosomal recessive polycystic kidney disease (ARPKD) : molecular genetics, clinical experience, and fetal morphology. Am J Med Genet 76: 137-44.

Zerres K, Rudnilc-Schoneborn S, Deget F, Holtkamp U, Brodehl J, Geisert J, Scharer K (1996). Autosomal recessive polycystic kidney disease in 115 children: clinical presentation, course and influence of gender. Acta Paediatr 85: 437-445.

Figure legends Figure 1 Chromosomal localization and genomic organization of PKHDl.

A. Schematic representation of chromosome 6pl2. Currently lcnown genes are identified on the far left (italics) while STSs/polymorphic marlcers are on the right. The closest Nanking genetic marlcers that define the minimal PROD1 interval are indicated in bold.

Not all of the 35 overlapping sets of expressed sequences described in the text are shown.

B. Genomic organization of PKHD1. Exons identified by numbers have been shown to be part of PKHD1 transcripts. Letters indicate exons that belong to hCT1642763 or hCT1646988 that have not been confirmed by our analyses. The symbol"-"identifies overlapping exons of hCT1642763 whose boundaries differ from ours. The arrows indicate the positions of the BAC442L12-related and hCT1642763-related transcripts described in the text.

C. BACs and PACs sequenced by the Sanger Centre that cover the interval.

Figure 2. Structure offull length PKHDI and its splicing variants.

A. The set of 71 non-overlapping exons that spans the full-length of PKHDI is shown in the top row. Fifteen additional overlapping exons (gray boxes) that use different splice sites are presented below. Exons, which are not present in the cDNA that encodes the longest ORF, are indicated with hatched boxes. The position of important protein domains is as indicated.

B. The approximate location of each primer set used to amplify various cDNAs is shown, and a representative set of amplified products is indicated below each schema. White boxes indicate non-coding exons in the corresponding transcripts while gray boxes identify exons with alternative boundaries (fig. 2A). The templates used for each amplification are as follows: a) adult human kidney double-stranded cDNA for primer sets 1-4,6, 8; b) human kidney mRNA for primer sets 5 and 7; and c) adult human kidney cDNA library for primer set 9. "SC"identifies the approximate location of stop codons while"ORF"indicates that an open reading frame extends throughout the length of the fragment.

C. The longest ORF identified by RT-PCR/cDNA amplification is shown. It is the composite sequence of products 2.1 and 4.1 of fig. 2B and includes a total of 67 exons.

Figure 3. Segregation ofPKHD1 mutations infamilies.

Representative family segregation analyses of PKHDI mutations for patients AL1, AL 11, AL 36, AL 48 and AL 52 (see Table 1). Sequence electropherograms showing wild type and mutant sequences for amplicons containing the respective variants in each family are given to the right of each pedigree figure. Traces labeled"mutant"show heterozygous alterations in genomic PCR products. Segregation of the mutant allele (denoted with Ex followed by the exon number) is shown for each individual studied.

AL 1 has two missense changes, while AL 11, AL 36 and AL 48 have a missense and a frame shifting mutation. AL 52 and her sibling, products of a consanguineous union, are homozygous for the Ex32 mutation (trace labeled"patients"). The trace labeled"mutant" is from the heterozygous father. Filled symbols, affected individuals; open symbols, not affected.

Figure 4. PKHD1 expression profile.

A. Human adult multiple tissue Northern blot probed with PKHD1 exon 59. Lane 1: Pancreas; 2: Kidney; 3: Skeletal muscle; 4: Liver; 5: Lung; 6: Placenta; 7: Brain; 8: Heart.

B. Same blot as in B, probed with PKD1. The arrow indicates the position of a lcnown splicing variant of PKD1.

C. Human fetal multiple tissue Northern blot probed with PKHD1. Lane 1: Kidney; 2: Liver; 3: Lung; 4: Brain.

D. Same blot as in C, probed with PAD7.

Figure 5. Structure of polyductin and relatedproteiras.

Multiple tandemly repeated IPT domains are common features of the group. Polyductin- M shares the general structure of the HGF receptor and Plexin A3 in having a long extracellular domain, a single TM domain, and a short cytoplasmic carboxyl terminus while Polyductin-S is more like D86.

Table 1: Mutations in PKHD1.<BR> <P>Patient ARPKD Nucleotide ORF changea Exona Parentsb Country<BR> ID phenotype changea<BR> AL 1 older onset C329T T36M 3 F USA<BR> A886G I222V 9 M<BR> AL 11 perinatal onset C3982delCCinsG fs (1254)c, termination codon 1301 32 F USA<BR> G3586A G1122S 29 M<BR> AL 18 older onset C9052insC fs (2944), termination codon 2949 58 F South African<BR> Afrikaner<BR> A886G I222V 9 M<BR> AL 36 perinatal onset T9092C I2957T 58 F USA<BR> A6117insA fs (1965), termination codon 1969 36 M<BR> AL 45 older onset C5092T R1624W 32 F Saudi Arabia<BR> AL 47 older onset G2501A R760H 22 F Saudi Arabia<BR> C5092T R1624W 32 M<BR> AL 48 older onset A6117insA fs (1965), termination codon 1969 36 F USA<BR> T979C F253L 11 M<BR> AL 52d older onset C5092T R1624W 32 F Saudi Arabia<BR> C5092T R1624W 32 M<BR> 376/1559 perinatal onset G1842insAGTT fs (541), terminatiuon codon 556 18 F Germany<BR> G9637T D3139Y 59 M<BR> 340/1395 perinatal onset A6117insA fs (1965), termination codon 1969 36 F UK<BR> delAATG933-936 fs (237), termination codon 244 11 M<BR> 306/1272 unknown delG10297 fs (3359), terminatiuon codon 3399 61 M Turkey<BR> 291/1207 older onset delT3528 fs (1101), termination codon 1102 29 M Turkey<BR> aNucleotide, codon and exon numbers are all based on the predicted 67 exon transcript of the longest ORF.

The first nucleotide of the composite cDNA sequence (Supplemenary Informatiuon fig. 1) is nucleotide 1 and the initiation methionine<BR> is codon 1. The initiation codon occurs in exon 2 of the 67 exon predicted transcript (see text and methods). None of these variants<BR> were identified in 120 control chromosomes.<BR> bRefers to the parent in which the specific variant is identified. F=father, M=mother.<BR> cfs=frame shift (codon affected); terminatuion codon refers to the altered reading frame.<BR> dconsanguineous union