Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NON-REDUNDANT GENE CLUSTERING METHOD AND SYSTEM, AND ELECTRONIC DEVICE
Document Type and Number:
WIPO Patent Application WO/2020/211466
Kind Code:
A1
Abstract:
The present application relates to a non-redundant gene clustering method and system, and an electronic device. The method comprises: step a: performing an alignment operation on an original gene set to acquire gene pairs meeting a similarity threshold in the original gene set; step b: on the basis of the acquired gene pairs, constructing a union-find forest; step c: on the basis of the union-find forest, obtaining gene clustering results of all classes in the original gene set; and step d: on the basis of the gene clustering results, respectively selecting the longest sequence in each class as a representative sequence of each class in order to obtain a non-redundant reference genome. By means of using BLAT alignment and a union-find data structure to implement non-redundant gene set clustering, the present application takes into account the similarity between more genes and improves the degree of precision of de-redundancy.

Inventors:
ZHENG ZHICHUN (CN)
GUO NING (CN)
WEI YANJIE (CN)
Application Number:
PCT/CN2019/130563
Publication Date:
October 22, 2020
Filing Date:
December 31, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SHENZHEN INST OF ADV TECH CAS (CN)
International Classes:
G16B40/00
Domestic Patent References:
WO2018186740A12018-10-11
Foreign References:
CN108846259A2018-11-20
CN108197434A2018-06-22
CN110060740A2019-07-26
Other References:
HOBOHM USCHARF MSCHNEIDER R ET AL.: "Selection of representative protein data sets", [J]. PROTEIN SCIENCE, vol. 1, no. 3, 2010, pages 409 - 417
HOBOHM USANDER C: "Enlarged representative set of protein structures", J]. PROTEIN SCIENCE, vol. 3, no. 3, 2010, pages 522 - 524
HOLM LSANDER C: "Removing near-neighbour redundancy from large protein sequence collections", [J]. BIOINFORMATICS, vol. 14, no. 5, 1998, pages 423 - 429, XP003000272, DOI: 10.1093/bioinformatics/14.5.423
LI WJAROSZEWSKI LGODZIK A: "Clustering of highly homologous sequences to reduce the size of large protein databases", [J]. BIOINFORMATICS, vol. 17, no. 3, 2001, pages 282 - 283
LI WJAROSZEWSKI LGODZIK A: "Tolerating some Redundancy Significantly Speeds up Clustering of Large Protein Databases", [J]. BIOINFORMATICS, vol. 18, no. l, 2002, pages 77 - 82
LI W: "Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences", [M]. SPRINGER US, 2015
WANG G, JR D R.: "PISCES: a protein sequence culling server", [J]. BIOINFORMATICS, vol. 19, no. 12, 2003, pages 1589
Attorney, Agent or Firm:
SHENZHEN KEJIN INTELLECTUAL PROPERTY OFFICE (CN)
Download PDF: