Search & lookup terms#
Entities and ontologies can be complex with many different identifiers.
Here we show Bionty’s lookup model for species, genes, proteins and cell markers. You’ll see how to
access the reference table via
.df()
look up an entity term via
.lookup()
look up an entity term via
.search()
import bionty as bt
.fields: fields of an ontology reference#
gene_bionty = bt.Gene()
gene_bionty
Gene
Species: human
Source: ensembl, release-109
📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object
gene_bionty.fields
{'biotype',
'description',
'ensembl_gene_id',
'hgnc_id',
'ncbi_gene_id',
'symbol',
'synonyms'}
Fields can be accessed as attributes for autocompletion:
(You can pass them to the field
parameter in any bionty function instead of strings.)
gene_bionty.ncbi_gene_id
ncbi_gene_id
.df()
: reference table#
Data scientists love DataFrames, and every entity has a reference table containing all the fields.
df = gene_bionty.df()
df.head()
ensembl_gene_id | symbol | ncbi_gene_id | hgnc_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|---|
0 | ENSG00000000003 | TSPAN6 | 7105 | HGNC:11858 | protein_coding | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | TSPAN-6|T245|TM4SF6 |
1 | ENSG00000000005 | TNMD | 64102 | HGNC:17757 | protein_coding | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | tendin|ChM1L|TEM|myodulin|BRICD4 |
2 | ENSG00000000419 | DPM1 | 8813 | HGNC:3005 | protein_coding | dolichyl-phosphate mannosyltransferase subunit... | CDGIE|MPDS |
3 | ENSG00000000457 | SCYL3 | 57147 | HGNC:19285 | protein_coding | SCY1 like pseudokinase 3 [Source:HGNC Symbol;A... | PACE1|PACE-1 |
4 | ENSG00000000460 | C1orf112 | 55732 | HGNC:25565 | protein_coding | chromosome 1 open reading frame 112 [Source:HG... | FLJ10706 |
To access the information of, for example the multiple gene symbols, we select the corresponding species through Pandas:
df.set_index("symbol").loc[["LMNA", "TCF7", "BRCA1"]]
ensembl_gene_id | ncbi_gene_id | hgnc_id | biotype | description | synonyms | |
---|---|---|---|---|---|---|
symbol | ||||||
LMNA | ENSG00000160789 | 4000 | HGNC:6636 | protein_coding | lamin A/C [Source:HGNC Symbol;Acc:HGNC:6636] | PRO1|LMNL1|MADA|CMD1A|HGPS|LMN1|LGMD1B |
TCF7 | ENSG00000081059 | 6932 | HGNC:11639 | protein_coding | transcription factor 7 [Source:HGNC Symbol;Acc... | TCF-1 |
BRCA1 | ENSG00000012048 | 672 | HGNC:1100 | protein_coding | BRCA1 DNA repair associated [Source:HGNC Symbo... | PPP1R53|RNF53|FANCS|BRCC1 |
.lookup(): Lookup terms and records with autocompletion#
Terms can be searched with auto-complete using a lookup object.
lookup = gene_bionty.lookup()
We provide dot.
accessor for normalized terms (lower case, only contains alphanumeric characters and underscores):
lookup.tcf7
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
To look up the exact original strings, convert the lookup object to dict and use the bracket[]
accessor for autocompletion:
lookup_dict = lookup.dict()
lookup_dict["TCF7"]
Gene(ensembl_gene_id='ENSG00000081059', symbol='TCF7', ncbi_gene_id='6932', hgnc_id='HGNC:11639', biotype='protein_coding', description='transcription factor 7 [Source:HGNC Symbol;Acc:HGNC:11639]', synonyms='TCF-1')
By default, the name
field is used to generate lookup keys.
You can specify another field to look up:
lookup = gene_bionty.lookup(gene_bionty.hgnc_id)
If multiple entries are matched, they are returned as a list:
lookup.hgnc_10478
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
lookup_dict = lookup.dict()
lookup_dict["HGNC:10478"]
[Gene(ensembl_gene_id='ENSG00000227322', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000228333', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000204231', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000231321', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000206289', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP'),
Gene(ensembl_gene_id='ENSG00000235712', symbol='RXRB', ncbi_gene_id='6257', hgnc_id='HGNC:10478', biotype='protein_coding', description='retinoid X receptor beta [Source:HGNC Symbol;Acc:HGNC:10478]', synonyms='NR2B2|RXRbeta|RCoR-1|RXR-beta|H-2RIIBP')]
.search
: Search a term against a field#
celltype_bionty = bt.CellType()
celltype_bionty.search("cytotoxic T cells").head(2)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
cytotoxic T cell | CL:0000910 | A Mature T Cell That Differentiated And Acquir... | cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... | [] | 96.969697 |
Tc2 cell | CL:0000918 | A Cd8-Positive, Alpha-Beta Positive T Cell Exp... | CD8-positive Th2 cell|Tc2 T lymphocyte|Th2 CD8... | [] | 76.190476 |
By default, search also matches against each of the synonyms:
celltype_bionty.search("P cell").head(2)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | cardiac pacemaker cell|myocytus nodalis|P cell | [CL:1000409, CL:1000410] | 100.000000 |
double-positive, alpha-beta thymocyte | CL:0000809 | A Thymocyte Expressing The Alpha-Beta T Cell R... | DP cell|DP thymocyte|double-positive, alpha-be... | [CL:0002430, CL:0002427, CL:0002431, CL:000242... | 92.307692 |
You can turn off synonym matching with synonyms_field=None
:
celltype_bionty.search("P cell", synonyms_field=None).head(2)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0002680] | 92.307692 |
GIP cell | CL:0002278 | An Enteroendocrine Cell Of Duodenum And Jejunu... | type K enteroendocrine cell | [] | 85.714286 |
Match against another field (default is “name”):
celltype_bionty.search(
"CD8+ alpha beta T cells", field=celltype_bionty.definition
).head(2)
ontology_id | name | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
definition | |||||
A T Cell That Expresses An Alpha-Beta T Cell Receptor Complex. | CL:0000789 | alpha-beta T cell | alpha-beta T-cell|alpha-beta T-lymphocyte|alph... | [CL:0000790, CL:0000791] | 75.00000 |
A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor. | CL:0000625 | CD8-positive, alpha-beta T cell | CD8-positive, alpha-beta T-cell|CD8-positive, ... | [CL:0002437, CL:0001052, CL:0000900, CL:000079... | 70.37037 |
Return all results as a DataFrame ranked by matching ratios:
celltype_bionty.search("P cell", top_hit=True)
CellType(name='nodal myocyte', ontology_id='CL:0002072', definition='A Specialized Cardiac Myocyte In The Sinoatrial And Atrioventricular Nodes. The Cell Is Slender And Fusiform Confined To The Nodal Center, Circumferentially Arranged Around The Nodal Artery.', synonyms='cardiac pacemaker cell|myocytus nodalis|P cell', children=array(['CL:1000409', 'CL:1000410'], dtype=object))
Tied results will all be returns as top hits:
celltype_bionty.search("A cell", top_hit=True, synonyms_field=None)
[CellType(name='T cell', ontology_id='CL:0000084', definition='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', synonyms='T-cell|T lymphocyte|T-lymphocyte', children=array(['CL:0000798', 'CL:0002420', 'CL:0002419', 'CL:0000789'],
dtype=object)),
CellType(name='B cell', ontology_id='CL:0000236', definition='A Lymphocyte Of B Lineage That Is Capable Of B Cell Mediated Immunity.', synonyms='B lymphocyte|B-lymphocyte|B-cell', children=array(['CL:0009114', 'CL:0001201'], dtype=object))]