Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
HYBRID DNA SYNTHESIS OF EPIDERMAL GROWTH FACTOR
Document Type and Number:
WIPO Patent Application WO/1985/000369
Kind Code:
A1
Abstract:
DNA sequences and methods of obtaining DNA sequences which include a sequence encoding for mammalian epidermal growth factor. The DNA sequences may be used in cloning and expression vectors for production of DNA and RNA for producing polypeptides including mammalian epidermal growth factor.

Inventors:
BELL GRAEME I (US)
Application Number:
PCT/US1984/001050
Publication Date:
January 31, 1985
Filing Date:
July 02, 1984
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CHIRON CORP (US)
International Classes:
C12N15/00; A61K38/00; C07H1/00; C07K14/485; C12N15/09; C12P21/02; (IPC1-7): C07H21/04; C12N1/00
Foreign References:
US4394443A1983-07-19
EP0046039A11982-02-17
Other References:
SAVAGE et al., CHEMICAL ABSTRACTS, Volume 78, 1973, Abstract No. 39662c, J. Biol. Chem., Vol 247, 1972, pages 7612-7621
RUTTER et al., CHEMICAL ABSTRACTS, Volume 101, 1984, Abstract No. 18357k, Biochem. Clin. Aspects Neuropept., Synth., Process, Gene Struct., 1983, KOCH et al., (ED.), Academic Press, Orlando, FL, pages 293-308
SCOTT et al., Science, Volume 221, 1983 pages 236-240
GRAY et al., Nature, Volume 303, 1983, pages 722-725
HOUGHTON et al., Nucleic Acids Research Volume 8, 1980, pages 2885-2894
ULLRICH et al., CHEMICAL ABSTRACTS, Volume 101, 1984, Abstract No. 18356j, Biochem. Clin. Aspects Neuropept., Synth., Process, Gene Struct., 1983, KOCH et al., (ED.), Academic Press, Orlando, FL, pages 277-291
See also references of EP 0148922A4
Download PDF:
Claims:
16WHAT IS CLAIMED IS:
1. A mammalian DNA sequence having a portion encoding for EGF and including at least a portion of the flanking coding regions in reading phase therewith.
2. A DNA sequence according to Claim 1, wherein said mammal is mouse.
3. A DNA sequence according to Claim 1, wherein said mammal is human.
4. A DNA fragment of at least about 60 nucleotides being a portion of the DNA sequence according to Claim 1 and including a sequence encoding for other than EGF.
5. A functional episomal element comprising a replication system and a DNA sequence or fragment of at least 60 nucleotides thereof according to Claim 1.
6. A DNA sequence having fewer than 5000 base pairs and greater than 159 base pairs and comprising at least 50% of the nucleotides of the gene encoding for a polypeptide including the amino acid sequence of human EGF.
7. A DNA sequence according to Claim 6, having at least about 1000 nucleotides in an open reading frame.
8. A DNA sequence according to Claim 7, wherein said open reading frame includes a portion of said sequence encoding for EGF.
Description:
HYBRID DNA SYNTHESIS OF EPIDERMAL GROWTH FACTOR BACKGROUND OF THE INVENTION Field of the Invention Epidermal growth factor (EGF) is a polypeptide of 53 amino acids that has been characterized in both mice and humans. It is a potent itogen for a variety of cells, such as fibroblasts, glia, epithelial, endothelial and epidermal cells, both cultured and n vivo. EGF is also a potent inhibitor of gastric acid secretion. EGF was first isolated from male mouse submaxillary glands, where it exists in inexplicably high levels.

In glandular o ogenates, EGF is found as a 74,000 dalton complex of two molecules of EGF (Mr 6045) and two molecules of a binding protein (Mr 29,300), a alli rein-li e arginyl enteropeptidase. The amino acid sequence of mouse submaxillary EGF has been determined and the synthesis of EGF and a larger 9000 dalton precursor with a carboxy terminal extension has been demonstrated in cultured submaxillary glands.

Human EGF, which appears to be similar if not identical to urogastrone, is also found in urine in larger forms of 28,000 and 30,000 daltons that do not dissociate on sodium dodecyl sulfate-polyacrylamide gel electrophoresis.

Isolating both the DNA or RNA encoding for EGF and particularly a putative EGF precursor protein is extremely difficult for a number of reasons. Even where the peptide is abundant, the amount of messenger RNA is extremely small. Immunoprecipitation of in vitro translation products, even under strongly denaturing conditions fails to detect a precursor protein, possibly due to the huge size of the precursor and/or the masking of its antigenic determinants on the native peptide to which antibodies were made. Because of the physiological importance of EGF there is

OMPI

substantial interest in being able to obtain DNA sequences encoding for EGF and the EGF polypeptide precursor. In addition, since it is known that a number of hormones are generated by proteolytic processing from larger precursors the cDNA and derived amino acid sequence of the EGF precursor could reveal "cryptic", previously unknown polypeptide hormones and/or growth factors. Description of the Prior Art A human genomic cDNA library in bacteriophage λ is described in Lawn et al. , Cell (1978) 15:1157-1174. Savage et al. , J. Biol. Che . (1972) 247;7612-7621 report the amino acid sequence of mouse EGF. Sporn et al. , Science (1983) 219:1329-1331 and Assoian et al. , J. Biol. Chem. (1983) 258:7155-7160 describe trans¬ forming growth factors (TGF) . See particularly Gray et al.. Nature (1983) 303:722-725.

SUMMARY OF THE INVENTION DNA and RNA are provided encoding for mammalian EGF, polypeptide precursors thereof, and numerous other polypeptides also encoded for by the DNA sequence which includes the segment encoding for EGF. The DNA sequences may be used for production of mammalian EGF, precursors of mammalian EGF, and associ- ated related polypeptides. Employing radiolabeled hybridization probes, messenger RNA encoding for the mouse EGF precursor is detected, isolated and used for the production of cDNA. The cDNA is sequenced and a fragment employed for hybridization with human DNA under conditions where mismatched heterologous hybrids can be detected. In this manner, a DNA sequence encoding for a large precursor peptide encompassing human EGF or urogastrone and numerous, related polypeptides is detected and isolated. DESCRIPTION OF SPECIFIC EMBODIMENTS

In accordance with the subject invention, DNA and RNA sequences are provided which include a segment

encoding for EGF, particularly mouse and human, as well as the expression products of these sequences, including propolypeptides and peptides, particularly peptides having one or more physiological functions of EGF, or peptides having one or more other hormonal or growth factor regulatory functions.

The DNA sequences of interest are single or double stranded ranging from about 60 bases or base pairs (bp) to about 5 thousand base pairs (kbp) , where sequences encoding for a physiologically active poly¬ peptide will generally range from about 60bp to about lOOObp, which may include introns. Generally, the DNA sequences will have open reading frames (involving one or more exons) encoding for polypeptides ranging from about 20 amino acids to polypeptides about 1000 amino acids, where the sequence will include at least 2, usually at least 5, more usually at least lObp, outside the segment encoding EGF. Polypeptides of particular interest will generally be of from about 20 to 250 amino acids, more usually from about 20 to 175 amino acids, particularly 30 to 60 amino acids.

The DNA segments encoding for polypeptides of interest or mature polypeptides may be located in any region in the DNA sequences described in this invention. Of particular interest are sequences bordered by basic amino acids, i.e. arginine and lysine, more particularly when joined to a second basic amino acid, or alanine, leucine, aspartic or glutamic acid or amide thereof. Amongst these sequences of interest are seven previously unknown EGF-like polypeptides with amino acid sequences homologous to EGF. The DNA sequences obtained in accordance with this invention were obtained by the following experimental design.

A mammalian cDNA or genomic DNA library is screened with a plurality of radiolabeled hybridization probes for detection of a sequence encoding for an amino acid sequence present in EGF. A plurality of

probes are employed providing for the various possible redundant codons. encoding for the oligopeptide. In the subject method a cDNA library from mouse submaxillary gland cells was probed. Plasmids binding strongly to the probes are isolated and the several overlapping cDNA inserts sequenced. The mouse EGF encoding cDNA has about 4800 bases. Mouse EGF is encoded for by nucleotides 3281-3440±5, with an open reading frame encoding for 1217±5 amino acid residues and a protein of approximately 130-140 kilodaltons (kdal) , particularly about 133kdal.

The mouse cDNA may then be used to probe a human cDNA or genomic DNA bank. Conveniently a restriction fragment may be employed of about 500 to 1500bp. Particularly, a BstEII-PvuII fragment of about 1213±5bp, may be employed. The hybridization is carried out under conditions which facilitate the detection of mismatched, heterologous hybrids. The BstEII-PvuII fragment encodes mouse EGF (53 amino acids) and 286 amino acids before and 66 amino acids after the EGF-moiety. Clones which hybridized to the probe were isolated and the human DNA inserts characterized.

Once the DNA sequence is isolated, it can be used in a variety of ways: For production of synthetic DNA sequences, either in whole or in part, for replica¬ tion, or for the production of messenger RNA or expression of the precursor protein incorporating EGF, fragments of such protein or of EGF, or analogs of EGF, differing by one or more amino acids, usually by not more than about five amino acids from the naturally occurring EGF amino acid sequence.

Various DNA sequences are of particular interest in encoding polypeptides which can be obtained from the cDNA sequence. These DNA sequences are set forth in the argument map set forth in the experimental section, along with the human EGF sequence. The

polypeptide sequences of interest include, but are not limited to, seven previously undescribed, EGF-like polypeptides identified on the basis of the homology of their amino acid sequences to EGF, especially the positional relationship(s) of the several cysteine residues. (See diagram in experimental section.) These sequences are frequently bounded by one or more basic amino acids.

Once the desired DNA sequence encoding for a protein or peptide of interest, e.g. EGF or its ho ologues, has been isolated, it can be joined with other DNA sequences for replication and expression. A wide variety of vectors are available for unicellular microorganisms, particularly for bacteria and fungi, where the DNA sequence encoding the poly(amino acid) of interest may be replicated and/or expressed.

Various hosts of interest include E. coli, S_. cerevisiae, 13. subtilis, mouse 3T3 cells, or the like. Conventional vectors include replication systems derived from R6-5, ColEI, the 2ym plasmid from yeast, RK plasmids, or the like. Alternative replication systems may be derived from viruses or phage, such as lambda, SV40, etc. In some instances, it will be desirable to have two different replication systems, where different functions may be achieved in different hosts. These vectors, referred to as shuttle vectors, frequently employ a replication system for E . coli and a replication system for a higher organism, e.g. yeast, so that amplification of the gene or cloning may be achieved in the bacterium, while expression may be achieved in the higher organism with appropriate processing, e.g. glycosylation.

Conveniently included with the replication system is at least one marker, which allows for selection or selective pressure to maintain the DNA construct containing the subject DNA sequence in the host. Convenient markers include biocidal resistance,

OMPI

e.g. antibiotics, heavy metals and toxins; complementa¬ tion in an auxotrophic host; immunity; etc. The DNA sequence including the fragment encoding for a poly¬ peptide having epidermal growth factor physiological properties or fragments of such sequence may be repli¬ cated in a cloning vector, which is capable of replica¬ tion in a unicellular microorganism, such as bacteria and yeast. The DNA may also be used in an expression vector for expression of a polypeptide of interest, which may be mammalian EGF, particularly mouse or human, other physiologically active polypeptides present in the sequence, e.g. other hormones or growth factors, fragments thereof or analogs thereof differing by from about one to five amino acids. The open reading frame of the DNA sequence allows for the production of a large polypeptide. The large polypeptide may be treated in a variety of ways. The large polypeptide may be partially digested with a variety of proteases either individually or in combina- tion. Illustrative endopeptidases include trypsin, pepsin, membrane dipeptidases, esteropeptidases or the like. The resulting fragments may then be separated by charge and/or molecular weight by any conventional means, e.g. filtration, sedimentation, chromatography, electrophoresis, or the like and then tested for physiological activity. Of particular interest are growth factors acting as mitogens or differentiation regulators. Based on " the activities observed, the various fractions may be further purified by bioassays to obtain pure active factors.

The DNA sequences of this invention can be used in a variety of ways. Fragments can be used as probes for detecting complementary sequences in genomic DNA or in messenger RNA for detecting mutations and/or deletions in genomic DNA of hosts. The sequences can be used for expressing the polypeptides encoded for by the seσuence.

The following examples are offered by way of illustration and not by way of limitation:

EXPERIMENTAL Methods cDNA Synthesis and Construction of Recombinant Plasmids To ' construct the cDNA library, polyA- containing RNA was isolated from the submaxillary glands of 60-day-old male Swiss-Webster mice. ds cDNAs were prepared and inserted into the PstI site of a pBR322 derivative using the dGdC tailing technique (Chirgwin et al. , Biochemistry (1979) 18:5294-5299; Goodman and MacDonald, Methods in Enzymol. (1980) j58_:75-90). Resultant tetracyline-resistant trans- formants of E. coli HB101 were stored at -70°C in microtiter dishes (Gergen e_t a_l. , Nucl. Acids Res. (1979) 1_'. 2115 -2136 } Ish-Horowicz and Burke, ibid. (1981) 9_:2989-2998) .

Oligonucleotides were then synthesized by solid-phase phosphoramidite methodology as described in copending application, Serial No. 457,412, followed by isolation from 20% acrylamide gels modification of the method described in Beancage and Camthers, Tetrahedron Lett. (1981) 22: 1859-1862. Dodecamers were prepared which were complementary to the strand coding for amino acids 17 to 23 (lacking the last 5'-nucleotide) of mouse EGF cDNA. The fractions had the following sequences, where after the addition of the eleventh nucleotide, two pools were prepared, one terminating in A and the other terminating in G and after addition of the seventeenth nucleotide, the two pools were further divided with addition of the eighteenth nucleotide, with two of the pools now terminating in G and the other two pools terminating in A. In this manner, a total of four pools were obtained, where each pool had a plurality of eicosamers of differing compositions at positions 3, 6 and 9.

1 10 20

3' - C C CC AA C C CC AA CC CC AA AA CC AA TT AA CC GG TT AA T T AA - 5

G G G G G

C C C

T T T

The different sequences are required because of the uncertainty as to the specific codon usage due to the redundancy of the genetic code for amino acids.

Synthetic oligonucleotides were labeled with adenosine 5'-(γ- 32P) triphosphate (ICN, crude preparation, 7000Ci/mmol, lCi=3.7xlO Bq) by a polynucleotide kinase reaction (Wallace et al. , Nucl. Acids Res. (1981)

9_: 879-894) . The labeled oligonucleotides were separated from unincorporated (γ- 32P) tπphosphates by chroma- tography on a C-18 Sep-Pak TM column (Waters Associates, Inc.) as follows: The crude labeling mixture was applied (disposable syringe) to the Sep-Pak cartridge which was then washed wtih 20ml of water to elute the unincorporated adenosine 5'-(γ- 32P) triphosphate. The radiolabeled oligonucleotide was then eluted with 1: 1 (v/v) metnanol: 0. IM triethylammonium acetate (pH 7.3) and the eluate evaported to dryness. The specific

8 9 activity of the probe was of the order of 10 -10 cpm/μg.

Transformants were grown on Whatman 541 filter paper, the plasmids amplified in situ with chloramphenicol and the DNA immobilized on the filters

32 (Gergen et ai. , supra) . P end-labeled probes were used to search the library. Additional screening was with nick translated cloned cDNA, with the same filters being used repeatedly.

The sequence of cDNA inserts were determined by the Maxam and Gilbert method. itiRNA Size Estimation

Glyoxylated total RNA from male and female mouse submaxillary glands was separated on 2% agarose

gels, transferred to nitrocellulose and hybridized with a nick translated P-labeled Pstl-PstI fragment of the EGF cDNA insert. After washing, the RNA was autoradio- graphed at -70°C using an intensifying screen. Glyoxylated HindiII λ and ΦX174 RF Haelll digested DNA fragments were used as size markers.

Results Screeninσ of the cDNA Library

5000 transformants were initially screened, where 11 colonies yielded strong signals with pool 4 probes, the pools having the nucleotides at positions 12 and 18 of G and A, respectively. Weaker, but definitely positive signals were obtained with pool 3 which had the nucleotides G and G, respectively. Pools 1 and 2 gave no positive signals. The largest clone was 1800bp. Terminal restriction and other fragments of this clone were used to screen the original 5000 plus 7500 additional colonies (12,500 total) and yielded additional overlapping cDNA colonies which did not contain the EGF sequence. Since it was subsequently determined by DNA sequence analysis (vide infra) that these overlapping clones lacked the 5'-terminal region of the RNA, another cDNA library was synthesized using an oligonucleotide primer complementary to nucleotides 1032-1051 (see argument map, infra) as follows:

3 '-CCGCTTCCTTCGGTGCGAAT-5 ' and this library was then screened as described above. The relative abundance of the cDNA clones in the initial library suggests that EGF mRNA comprises about 0.2% of the polyA mRNA from this tissue. mRNA Sequence

The size of mouse EGF mRNA was determined by

Northern analysis of mRNA from adult male and female glands to be about the same size as 28s ribosomal RNA, approximately 4800 bases. The mRNA in the male gland was at least ten-fold greater in abundance than in the

OMPΓ

10 female gland. The nucleotide sequence of overlapping cDNA clones provided 4750bp of sequence as follows:

AAAAAAGCAGAAGGGAUUCCUAUCUGUΛUΛUΛGGCΛAGGAAUCCUΛUCUGCÎ ›UAUUUCGUUGUUΛGCΛCCAUCCCUCAUCCCGGUGCGCUUGCAACUUUCCAUCAAUUC UUUCCUGUCU

CGUUUCUCUUUCAUCCUUUGCCUGCUUCUCCCUGUCUCAGGGAGAAAUCACUCACCU GCAGGCCUUGCΛGGCCUCUUACGCUCUGGGAAAUUUCUCAUACCGCUCUCΛGGUACUU CUUA

1

Mot

UUGCU Q UCCAAAGCGAAAAAAAAAGUGACACAAAGAACUCUCCCGGAGCCUUUCCGGCUG CACUCAGAGGCUCUCGAGAGCUGCACCAGGACCUGGAAACCCACCUAAAUAAAAG AUG

A 10 20 The 30

Pro Trp Gly Arg Arg Pro Thr Trp Leu Lou Leu Ala Phe Leu Leu Val Phe Leu Lya lie ' Ser lie Leu Ser Val Thr Ala Trp Gin Thr CCC UGG CGC CGA AGG CCA ACC UGG UUG UUG CUC GCC UUC CUG CUG GUG UUU UUA ΛAG AUU AGC AUA CUC ACC GUC ACA CCA UGG CAG ACC

C U

40 50 60

Cly Asn Cya Gin Pro, Cly Pro Leu Glu Arg Ser Glu Arg Ser Gly Thr Cya Ala Gly Pro Ala Pro Phe Leu Val Phe Ser Gin Gly Lya CGG AAC UGU CAG CCA CGU CCU CUC CAG .AGA AGC GAG AGA AGC GGG ACU UCU CCC CGU CCU CCC CCC UUC CUA GUU UUC UCA CAA CCA AΛG f'3 70 ( ' ' 80 90

<!ÏŠ Ser lie Ser Arg lie Asp Pro Aap Cly Thr Aan Ilia Cln Cln Lou Val Val Aap Ala Gly lie Ser Ala A 3 Mot Asp He Ilia Tyr Lya

«... ACC AUC UCU CGG AUU GAC CCA CAU CGA ACA AAU CAC CAG CAA UUG CUG GUG GAU CCU CGC AUC UCA GCA CAC AUG GAU AUU CAU UAU AAA

•' 100 110 120

Lya Glu Arg Leu Tyr Trp Val Asp Val Glu Arg Gin Val Leu Leu Arg Val Phe Leu Aan Cly Thr Cly Leu Clu Lys Val Cy 3 Asn Val AAA CAG AGA CUC UAU UGG GUG CAU GUA CAA AGA CAA GUU UUG CUA AGA CUU UUC CUU AAC GGG ACA GCA CUA GAG AAA CUG UGC AAU GUA

130- 140 * 150

Glu Arg Lya Vαl Ser Gly Leu Ala He Aap Trp He Asp Asp Glu Val Leu Trp Val Asp Cln Gin Asn Gly Val He Thr Val Thr Asp CAG AGG AAG CUG UCU CGG CUG CCC AUA CAC UGG AUA GAU GAU CAA GUU CUC UGG CUA GAC CAA CAG AAC GGA GUC AUC ACC GUA ACA CAU

160 170 Aan 100

Hot Thr Cly Lys Aan Sor Arg Vnl Lou I.ou Sor Sor Lou l.yo Ilia Pro Sor Ann I o Ala Vnl Aop Pro Ho Clu Arg ou Hot rim Trp AUG ACA GGG AAA AAU UCC CGA GUU CUU CUA ACU UCC UUA AAA CAU CCU UCA AAU AUA CCA CUG GAU CCA AUA CAG AGG UUG AUU UUU UGG

A 190 .200 210

Sor Ser Glu Vnl Thr Gly Ser Leu Ilia Arg Λlft Ilia Lou Lya Gly Val Asp Vnl Lya Thr Leu ou Glu Thr Cly Gly lie Ser Vnl Lou UCU UCA CAG CUG ACC GCC AGC CUU CAC ACA CCA CAC CUC AAA CCU GUU GAU CUA AAA ACA CUG CUG CAG ACA CGG GGA AUA UCC CUC CUG

Cly CGU

Lys AAA

Leu CUG

310 320 , • ,..' 330

Mot Vnl Vnl Ilia Pro Arg Ala Gin Pro Arg Thr Clu Aop Aln Λlα Ly 0 Aap ro Asp Pro Clu Leu Leu Lya Cln Arg Cly Arg Pro Cy3 AUG GUA CUA CAC CCU CGU GCA CAG CCC ACG ACA CAG GAC. CCU CCU AAG GAU CCU GAC CCC CAA CUU CUC AAA CAG ACG CGA ACA CCA UCC 1346

340 350 360

Arg Pho Gly Leu Cya Glu Arg Asp Pro Lya Ser Ilia Ser Ser Aln Cys Ala Glu, Gly Tyr Thr Leu Ser Arg Aap Ar Lya Tyr Cya Glu CGC UUC CGU CUC UGU GAC qOΛ GAC CCC ΛΛO UCC CAC UCC AGC GCA UCC GCU CAG CCC UAC ACG UUA ACC CGA CAC CGG AAC UAC UCC CAA 143

370 , 300 . 390

Aap Val Aan Glu Cys Ala Thr Cln Asn His Gly Cya Thr Leu Cly Cya Clu Aan Thr Pro Gly Ser Tyr Ilia Cya Thr Cys Pro Thr Cly CAU CUC AAU CAA UGU GCC ACU CAG AAU CAC CCC UCU ACU CUU CCC UCU CAA AAC ACC CCU CGA UCC UAU CAC UGC ACA UGC CCC ACA CGA 152

400 ; 410 420

Phe Val Leu Lou Pro Asp Gly Lys Gin Cya Ilia Glu Lou Vnl Sor Cya Pro Gly Aan Val Sor Lya Cya Ser Ilia Cly Cya Val Leu Thr UUU GUU CUG CUU CCU GAU GCG AAA CAA UGU CAC GΛA CUU CUU UCC UCC CCA GCC AAC GUA UCA AAG UGC AGU CAU CGC UGU CUC CUC ACA 16) 0 •430 • 440 450

Ser Asp Gly Pro Arg Cya He Cya Pro Ala Cly Ser Val Leu Cly Arg Aap Cly Lya Thr Cya Thr Gly Cya Ser Ser Pro Asp Asn Cly UCA CAU CGU CCC CGC UCC AUC UGU CCU CCA CGU UCA CUG CUU CCC ACA CAU CGG AΛG ACU UGC ACU CGU UCU UCA UCC CCU CAC AAU CCU 1706

.. 460- 470 400

Gly Cya Ser Cln Ho Cya Leu Pro Leu Arg Pro Cly Ser Trp Glu Cya Asp Cya Pho Pro Gly Tyr Aap Leu Cln Ser Asp Arg Lya Ser CGA UGC AGC CAG AUC UGU CUU CCU CUC 'AGO CCA GCA UCC UCG GAA UGU CAU UGC UUU CCU GCC UAU CAC CUA CAG UCA CAC OCA AAC AGC 1796

490 • 500 • . io

Cya Ala Ala Ser Cly Pro Gin Pro Lou Lou Leu Pho Ala Aan Ser Gin Aap He Arg Hia Met Ilia Phe Asp Gly Thr Asp Tyr Lya Val φ UCU CCA GCU UCA CGA CCA CAG CCA CUU UUA CUG UUU CCA AAU UCC CAG CAC AUC CGA CAC AUG CAU UUU CAU CCA ACA GAC UAC AAA CUU 1006

520 530 540

Leu Leu Ser Arg Gin Mot Gly Mot Vnl Phe Ala Lou Aap Tyr Aβp Pro Vnl Clu Sor Lya He Tyr Pha Ala Cln Thr Ala Leu Lya Trp pi CUG CUC AGO CGC CAC AUO GGA AUG CUU UUU CCC UUG CAU UAU GAC CCU CUG GΛA AGC AAG AUA UAU UUU CCA CAG ACA CCC CUG AAG UGG 1976

Hi

550 , 560 570

He Glu Arg Aln Aan Met Asp Gly Sor Gin Arg Clu Arg Leu He Thr Glu Cly Vnl Asp Thr Leu Clu Cly Leu Ala Leu Asp Trp He AUA CAG AGG CCU AAU AUG CAU CGC UCC CAG CGA GAA AGA CUG AUC ACA GΛA CCA CUA GAU ACG CUU GAA GCU CUU CCC CUC GAC UCC AUU 2066 00 590 600

Gly Arg Arg He Tyr Trp Thr Aap Ser Cly Lya Ser Vnl Vnl Gly Gly Ser Asp Leu Ser Gly Lya Ilia Ilia Arg He He He Cln Clu

CCC CCG ACA AUC UAC UGG ACA GAC AGU CGG AAG UCU GUU GUU CGA GGG AGO GAU " CUG AGC GGG AAG CAU CAU CGA AUA AUC AUC CAC CAG 2156

610 620 630

Arg Ho Ser Arg Pro Arg Gly He Ala Val Hia Pro Arg Ala Arg Arg Leu Phe Trp Thr Aap Val Gly Met Ser Pro Arg He Clu Ser

AGA AUC UCG AGG CCG CCA CCA AUA GCU GUC CAU CCA AGG CCC ACG AGA CUC UUC UCC ACG CAC CUA CGG AUG UCU CCA CCG AUU GAA AGC 224

640 650 . 660

Ala Ser Leu Cln Cly Ser Aap Arg Vnl Leu Ho Ala Ser Ser Aan Leu Leu Glu Pro Ser Cly He Thr He Aap Tyr Leu Thr Asp Thr

JO GCU UCC CUU CAA GCU UCC CAC CCG CUG CUG AUA GCC AGC UCC AAU CUA CUG GAA CCC ACU CGA AUC ACG AUU CAC UAC UUA ACA GAC ACU 233

*_ 2

670 680 . ., . . 690

Leu Tyr Trp Cya Asp Thr Lya Arg Ser Val He Glu Mat Ala Asn Leu Aap Cly Ser Lya Arg Arg Arg Leu He Gin Aan Asp Val Cly UUG UAC UGG UGU CAC ACC AΛG AGC UCU CUG AUU CAA AUG GCC AAU CUG GAU CGC UCC AAA CGC CGA AGA CUU AUC CAG AAC GAC GUA CGU 24

700 710 720

His Pro Phe Ser Leu Ala Val Pho Glu Aap Ilia Leu Trp Val Ser Aap Trp Ala He Pro Ser Val He Arg Val Asn Lys Arg Thr Gly CAC CCC UUC UCU CUA CCC GUG UUU CAG CAU CAC CUG UGG CUC UCG CAU UCG CCU AUC CCA UCG CUA AUA AGG CUC AAC AAG ACG ACU GGC 25

730 740 750

Gin Asn Arg Val Arg Leu Gin Gly Ser Mat Leu Lya Pro Ser Ser Leu Val Val Val Ilia Pro Leu Ala Lya Pro Gly Ala Asp Pro Cya CAA AAC AGG CUA CGU CUU CAA CGC ACC AUG CUC AΛG CCC UCG UCA CUG GUU GUG GUC CAU CCA UUG CCA AAA CCA CGU CCA GAU CCC UCC 26

760 ' 770 " 780

Lou Tyr Arg Asn Cly Gly Cya Glu Ilia Ho Cys Gin Clu Ser Leu Gly Thr Ala Λrg Cya Leu Cya Arg Glu Cly Pho Val Lya Ala Trp UUA UAC AGG AAU CGA GGC UGU CAA CAC AUC UCC CAA GAG AGC CUG GGC ACA GCU CCG UGU UUG UGU CGU CAA CGU UUU CUG AAG CCC UGG 26

790 000 810

Asp Gly Lys Met Cys Leu Pro Gin Aap Tyr Pro lie Leu Ser Cly Glu Asn Ala Aap Leu Ser Lys Clu Val Thr Sor Leu Ser Asn Ser

GAU GGG AAA AUG UGU CUC CCU CAG GAU UAU CCA AUC CUG UCA'CGU GΛA AAU CCU CAU CUU AGU AAA CAG GUG ACA UCA CUG AGC AAC UCC 27

• ' • 820 030 840

Thr Gin Aln Clu Val Pro Asp Aap Asp Gly Thr Glu Ser Ser Thr Leu Val Ala Glu He Met Val Ser Cly Met Asn Tyr Glu Asp Asp

ACU CAG GCU CAA GUA CCA GAC GAU GAU CGG ACA GAA UCU UCC ACA CUA GUG CCU GAA AUC AUG GUG UCA CGC AUG AAC UAU GAA GAU CAC 20

850 â–  860 070 " Cys Cl 'Pro Cly Gly Cya Cly Sor Ilia Ala Arg Cya Vnl Ser Asp Gly Clu Thr Ala Glu Cya "Gin Cys Leu Lys Cly Phe /la Arg Asp i UCU CGU CCC CGG CGG UGU GGA AGC CAU GCU CGA UGC GUU UCA CAC CGA CAG ACU GCU GAG UGU CAG UGU CUG AAA GGG UUU GCC AGG GAU 29 I 800 890 900

Cly Aan Leu Cya Ser Aap He Aap Glu Cys Val Leu Ala Arg Ser Asp Cya Pro Ser Thr Ser'Ser Arg Cya He Aan Thr Glu Gly Gly CGA AAC CUG UGU UCU GAU AUA GAU CAG UGU GUC CUG GCU AGA UCG CAC UGC CCC AGC ACC UCG UCC AGG UGC AUC AAC ACU CAA CCU GCC 30

910 920 • 930

Tyr Vnl Cya Arg Cys Ser Glu Gly Tyr Glu Gly Aap Gly He Ser Cya Phe Asp He Aap Glu Cys Cln Arg Cly Ala Ilia Aan Cya Ala UAC CUC UCC AGA UGC UCA CAA CCC UAC CAA CGA CAC GGG AUC UCC UCU UUC CΛU AUU GΛC GAG UGC CAC CGC CGC CCG CΛC AAC UCC GCU 3»

940 950 960

Glu Aan Ala Ala Cys Thr Asn Thr Clu Cly Gly Tyr Aan Cya Thr Cya Ala Cly Λrg Pro Ser Ser Pro Cly Arg Ser Cys Pro Asp Ser GAG AAU CCC GCC UGC ACC AAC ACC CAG GGA GGC UAC AAC UCC ACC UGC CCA CGC CCC CCA UCC UCG CCC CGA CGG AGU UGC CCU GAC UCU 3

Human Gene GAC TCT Aap Ser

970 Epidermal- Growth Factor 990

Thr Ala Pro Ser Leu Leu Gly Glu Asp Gly His Ilia Leu Asp ::: Arg Asn Ser Tyr Pro Gly Cya Pro Ser Ser Tyr Asp Cly Tyr Cya Leu

ACC CCA CCC UCU CUC CUU GGG GAA GAU GGC CΛC CAU UUG GΛC ::: CGA AAU ACU UΛU CCA CGA UGC CCA UCC UCA UAU CAU CCA UAC UGC CUC

ACT CCA CCC CCG CAC CTC ACG GAA CAT CAC CAC CAC TAT TCC GTA AGA AAT ACT GΛC TCT CΛA TGT CCC CTG TCC CAC CAT CCG TΛC TCC CTC

Thr Pro Pro Pro Ilia Leu Arg Glu Asp Aop Ilia Ilia Tyr Ser Val Arg Aan Ser Aap Ser Glu Cys Pro Lou Ser Hi3 Aβ Gly Tyr Cya Leu

1000 1010 1020

Aan Gly Cly Vnl Cys Met Ilia Ho Clu Ser Leu Aap Sor Tyr Thr Cya Aan Cya Vnl He Gly ' Tyr Ser Cly Aap Arg Cya Cln Thr Arg

AAU GCU GGC CUG UGC AUG CAU AUU GAA UCA CUG CΛC ACC UAC ACA UGC AΛC UCU CUU ΛUU GGC UΛU UCU GCG GAU CGA UGU CΛC ACU CGA 341

CAT CAT GGT GTG TGC ATG TAT ATT CAA CCA TTG GΛC AΛG ' TΛT CCA TGC AAC TGT GTT GTT GGC TAC ATC CGG CAG CGA TCT CAG TAC CGA

Ilia Aap Gly Val Cya Met Tyr He Glu Ala Leu Aap Lya Tyr Ala Cya Aan Cya Vnl Val Gly Tyr He Cly Glu Arg Cya Gin Tyr Arg

1030 1040 1050

Asp Leu Arg Trp Trp Glu Lou Arg His Ala Cly Tyr Cly Cln Lya Ilia Λap He Mot Val Val Aln Val Cys Met Vnl Ala Leu Val Leu

OAC CUA CGA UGG UGG CAG CUG CGU CAU CCU GGC UAC CCG CΛG AΛG CAU GΛC ΛUC AUG CUG GUG CCU GUC UCC AUG CUG GCA CUG CUC CUG 350

CAC CTG AΛG TCO TGG CAA CTG CGC CAC CCT CCC CΛC CGG CAG CΛC CΛC AΛG GTC ATC GTG CTC CCT CTC TCC CTG CTC CTC CTT CTC ATG

Aap Lou Lya Trp Trp Glu Leu Arg Ilia Ala Gly Ilia Gly Gin Gin Gin Lya Val He Val Val Ala Vnl Cya Vnl Val Val Leu Val Met

1060 1070 ' 1000

Lou Lou Leu Lou Cly Met Trp Cly Thr Tyr Tyr Tyr Arg Thr Arg Lya Gin Lou Sor Ann Pro Pro Lya Aan Pro Cya Asp Clu Pro Ser

CUG CUC CUC UUG GCG ΛUG UGG GGG ACU UAC UAC UAC ACG ACU CGG AΛG CΛC CUA UCA AAC CCC CCA AAG AAC CCU UGU CAU CΛC CCΛ ACC 359 CTG CTC CTC CTG ACC CTG TCG GGG GCC CAC TAC TAC ACG Lou Lou Lou Lou Sor Leu Trp Cly Ala Ilia Tyr Tyr Arg ,

1090 , 1100 1110

Gly Ser Vol Ser Ser Ser Gly Pro Aap Ser Ser Ser Gly Ala Ala Val Ala Ser Cyβ Pro Gin Pro Trp Pho Vol Val Leu Clu Lys Ilia CGA AGU GUG AGC AGC ACC GGG CCC GAC AGC AGC AGC CCG QCA CCU CUG GCU UCU UGU CCC CAA CCU UGG UUU CUG GUC CUA GAG AAA CAC 360

' * ' 1120 ' 1130 1140

Cln Asp Pro Lya Aan Cly Ser Leu Pro " Aln Aap Cly Thr Asn Gly Ala Val Vnl Asp Ala Gly Leu Ser Pro Ser Leu Cln Leu Gly Ser CAA CAC CCC AAG AAU CGG AGU CUG CCU CCG CAU CGU ACG AAU CGU GCA CUA CUA CAU CCU CGC CUG UCU CCC UCC CUG CAG CUC CGG UCA 377

1150 1160 1170

Vol Ilia Leu Thr Ser Trp Arg Gin Lya Pro Hia He Aap Gly Met Gly Thr Gly Gin Ser Cya Trp He Pro Pro Ser Ser Asp Arg Gly GUG CAU CUG ACU UCA UGG AGA ' CAG AΛG CCC CAC AUA GAU GGA AUG CCC ACA CCG CAA ΛGC UGC UGG AUU CCA CCA UCA AGU CAC AGA GGA 306

1100 1190 1200

Pro Gin Glu Ho Glu Gly Aan Ser Ilia Lou Pro Sor Tyr Arg Pro Vnl Gly Pro Clu Lya Lou Ilia Ser Leu Cln Ser Ala Asn Cly Ser CCC CAG GAA AUA CΛG CGA AAC UCC CΛC CUA CCC UCC UAC ACA CCU GUC GGG CCG GAC AAG CUG CAU UCU CUC CAG UCA GCU AAU GGA UCG 395

1210 1217

Cya Ilia Glu Arg Ala Pro Λap Lou Pro Arg Gin Thr Glu Pro Val Lya ΛM

UGU CAC CAA AGC CCU CCA GΛC CUG CCA CCG CAG ACA CΛG CCA CUU AAG UAC ΛAΛCUCCCΛGUACΛCΛCAΛCCUACΛCΛACCC ' AΛAAUΛΛCΛAACCACCCUCAUGA 406

UGGUΛGAGUGCUACΛGACUUGGUACUCCACUUUCCACCCCUΛΛUCACUGCUCG CUCAGCCUCCUGΛAGΛUΛCCUGCACAGCUCCAGACCUCCΛCΛCCCGGAUACCUCC GACUUUUCCUUC 410

UIIGCUUUAACCΛGUUCCACUGAΛGAUACUCAAΛΛCAGΛAGUGGAGΛΛΛ AUCAUUAGΛΛΛCCAΛΛCUCAΛGACAUUCAUAUAUAACCUCUGUCUUCUUCACUG GACCGUUUGCCUCUUUUC 430

13

This includes the exact 53 amino acid residue sequence of mouse EGF (nucleotides 3281-3440) , a translational start codon AUG (nucleotides 354-356) and a stop codon TAG (nucleotides 4005-4007) . An open reading frame throughout the sequence which encodes for 1217 amino acid residues and a protein of approximately 133 dal. Also, seven additional EGF-like polypeptides are identified on the basis of the homology of their amino acid sequences to EGF, especially the positional relationship of their cysteine residues, as shown below:

Arg Lys Tyr [Cys Giu Asp Val Asn Glu Cys Ala Thr Gin Asn HislGlyijCys Thr Gin Cys His Glu Leu Val Ser Cys Pro Gly Asn Val Ser Lys Cys Ser

Thr)Cys|Thr Gly |Cys| Ser Ser Pro Asp Asn Gly Gly jCys|Ser Gin

Lys Pro Gly Ala Asp Pro Cys Leu Tyr Arg Asn Gly GTyTCysJGIu Met Val Ser Gly Met Asn Tyr Glu Asp Asp Cys Gly Pro Gl GlyiCysj Gly Ser His

Ser Asp [CysjPro Ser

Gl Ala His Asn Cys Ala Glu Asn

Asn Ser Tyr Pro Gly Cys Pro Ser Ser Tyr Asp Gly Tyr Cys Leu AsnlGly Gly Val

Leu Gly C/s GI u Asn Thr Pro Gl Ser ITyrjHi s Cys Thr Cys Pro Thr iγ Phe Val Leu His Gl Cys Yal Leu Thr Ser Asp Gly Pro Arg Cys I le Cys Pro Ala iy Ser Val Leu

I le jCysJLeu Pro Leu Arg Pro Gly Ser Trp Glu Cys Asp Cys Phe Pro ly Tyr Asp Leu

His 1 le Cys Gin Glu Ser Leu Gly Thr Ala Arg Cys Leu Cys Arg Glu iy Phe Val Lys

Ala Arg Cys Val Ser Asp Gl Glu Thr Ala Glu Cys Gin Cys Leu Lys iy Phe Ala Arg

Ser Arg Cys I le Asn Thr Glu Gly Gly Tyr Val Cys Arg Cys Ser Glu iy Tyr Glu Gly

Ala Ala Cys Thr Asn Thr Glu Gly Gly Tyr Asn Cys Thr Cys Ala Gly Arg Pro Ser Ser jCys| Met His I 1e GIu Ser Leu Asp Ser |Tyr Thr Cys Asn Cys IVal I le[GlylTyr Ser Gly

Asp Gly Asn Leu Cys Ser Asp I le Asp Glu Cys Vat Leu Ala Arg Asp.Gly I le Ser Cys Phe Asp I I e Asp Glu Cys Gin Arg Pro Gl Arg Ser Cys Pro Asp Ser Thr Ala Pro Ser Leu Leu Gly Glu Asp Gly His His Leu Asp Arg.

Asp Arg Cys Gin Thr Arg Asp Leu Arg Trp Trp Glu Leu Arg

14

Human EGF Gene

A 32 P-labeled (O'Farrell, Focus (1981) 3_ :1 --_) l,213bp BstEII-PvuII fragment of mouse submaxillary EGF cDNA clone, pmEGFlO , was hybridized to a human genomic DNA library (Lawn et al. , Cell (1978) 15:1157-1174) in bactεriophage λ (available from Dr. T. Maniatis, Harvard University) using conditions which facilitate the detection of mismatched-heterologous hybrids. The BstEII-PvuII fragment of pmEGFlO encoded mouse EGF (53 amino acids) and 286 amino acids amino terminal to and 66 amino acids carboxy terminal to the EGF moiety. The hybridization conditions were 50% formamide, 5X SSC, 10% dextran sulfate, 20mM sodium phosphate, pH 6.5, lOOyg/ml sonicated, denatured salmon testes DNA, and

0.1% sodium dodecyl sulfate at 30°C ( ahl et al. , Proc. Natl. Acad. Sci. USA (1979) 7_£ : 3683-3687) . The filters were washed for one hour at 50°C in IM NaCl (Perler et_ al. , Cell (1980) _20. :555 "" 56δ ) before autoradiography. Four of the approximately 10 phage screened hybridized to the probe. Characterization of the human DNA inserts in these phage indicated that they represented overlapping DNA segments from the same region of the human genome. The partial sequence of the human DNA in λhEGF35, corresponding to the exons encoding EGF or uragastrone, indicated that these phage contained portions of the human EGF gene.

The human EGF gene was sequenced and the mouse and human sequences compared. The amino acid sequences of EGF from the two species are described by Carpenter, In: Tissue Growth Factors, Handbook of Experimental Pharmacology, R. Baseraga (ed.), Vol. 57, Springer-Verlag, Berlin, 1981, p. 94.

In accordance with the subject invention, polynucleotide sequences are provided which encode for a large polypeptide which includes the amino acid sequence of EGF. The large polypeptide can be used as a source of polypeptides having physiological activity.

15

In particular, seven additional EGF-like polypeptides are identified. The DNA sequences can be used for production of the large polypeptide or fragments thereof by employing recombinant DNA technology and inserting the polypeptide sequence downstream from an appropriate promoter in a functioning episomal element. The episomal element may then be introduced into an appropriate host for replication and expression of the desired polypeptide. Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it will be obvious that certain changes and modifications may be practiced within the scope of the appended claims.