Version 1.3.0 (May 30, 2024)
In this version, we…
- Updated the version of PubMed used
- Updated the version of the Human Phenotype Ontology used
- Added a feature describing whether a gene is evolutionarily constrained
- Added a feature describing the number of publications a gene has according to gene2pubmed alone
- Added a feature describing the number of PDB entries that are associated with each gene
- Added COVID-19 as a disease subfield
Genes Information
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz]. Only genes with an unambiguous mapping of Entrez ID to Ensembl ID were used (n=36,035). Number of gene synonyms, protein-coding status, and official gene symbol were derived from this dataset. A gene symbol was considered undefined if the gene’s entry for HGNC gene symbol was ‘-’.
Genes in title/abstract of primary research articles
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz]. gene2pubmed was downloaded from NCBI Gene on August 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz] (Maglott et al., 2007). PubTator gene annotations were downloaded from NIH-NLM on July 12, 2022 [https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/] (Maglott et al., 2007; Wei et al., 2019). PubMed was downloaded on January 14, 2023 [https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/]. Only using PMIDs annotated as primary research articles, a human gene was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed. The feature ‘Number of articles about gene (gene2pubmed)’ shows counts using just gene2pubmed articles with fewer than 100 annotated genes. gene2pubmed will annotate genes even if the article is not about the gene, but the gene is mentioned somewhere in the article or in its associated datasets.
Functional annotations
Mapping of genes to Gene Ontology / Protein Interaction Database / WikiPathways / Reactome / Kyoto Encyclopedia of Genes and Genomes / Human Phenotype Ontology / BioCarta categories was derived from MSigDB v7.5 Entrez ID.gmt files, downloaded on April 12, 2022 [http://www.gsea-msigdb.org/gsea/downloads_archive.jsp].
Between-species homology
Homologene Build 68 was used to determine interspecies homology [https://ftp.ncbi.nih.gov/pub/HomoloGene/build68/]. Human = taxid:9606, mouse = taxid:10090, rat = taxid:10116, c. elegans = taxid:6239, d. melanogaster = taxid:7227, yeast = taxid:559292, zebrafish = taxid:7955.
Primate specificity
Human genes were considered primate-specific if the only other members of their homology group belonged to primate genomes. Primate taxonomy ids were downloaded from NCBI Taxonomy on September 20, 2022 [https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid9443[Subtree]].
Number of publications in model organisms
Gene information was downloaded from NCBI Gene on August 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz]. Only using PMIDs annotated as primary research articles, genes was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed. Genes in model organisms were mapped to human genes and the number of articles on those mapping to human genes were counted. If a model organism’s gene had homology to human but no associated publications, the number of publications was resolved to zero. Otherwise, counts were listed as NA.
Mouse phenotype hits
International Mouse Phenotyping Consortium data release 17.0 was downloaded on August 18, 2022 [https://www.mousephenotype.org/data/releasehttps://www.mousephenotype.org/data/release]. Mouse genes were matched to human genes with Homologene.
Gene expression atlas
EBI-GXA release 36 was downloaded on September 15, 2020 [https://web.archive.org/web/20201022184159/https://www.ebi.ac.uk/gxa/download]. This is the most recent release of EBI-GXA available as a bulk download. For probability of DE, only RNA-seq comparisons were considered and DE was called at Benjamini-Hochberg q<0.05.
Global RNA expression
RNA consensus tissue gene data from HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/about/download]. Global RNA expression was estimated by taking the median expression (nTPM) across tissues for each gene and the proportion of tissues with detectable (≥1 nTPM) expression for each gene.
Expression in HeLa cells
RNA cell line gene data from HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/about/download]. Expression is in nTPM.
Previous patent activity
Genes with patent activity were defined from Table S1 of Rosenfeld and Mason, 2013. Genes were mapped with their HGNC symbol. This analysis aligned sequences in patents to the human genome to estimate patent coverage of human coding sequences. Although this does not necessarily reflect whether the mapped genes were claimed directly by the patent holder, as noted by others (Tu et al., 2014), this analysis remains the most comprehensive available for determining patent coverage of the human genome.
Druggability
Druggable genes were identified from Table S1 of Finan et al., 2017. Genes were mapped with their Ensembl identifier.
Gene length
GenBank was downloaded in spring 2017 (genome version GRCh38.p10). Gene length is defined here as the span of the longest transcript on the chromosome. This aligns with the model of gene length used in Stoeger et al., 2018.
Solubility
SwissProt protein sequences and mapping tables to Entrez GeneIDs were downloaded from Uniprot in spring 2017. Protein GRAVY score (ignoring Pyrrolysine and Selenocysteine) was estimated with BioPython (Cock et al., 2009).
Loss-of-function intolerance
Data was obtained from Karczewski et al., 2020. pLI scores >0.9 on main transcripts, as flagged by authors, were considered as highly loss-of-function intolerant as described by Lek et al., 2016.
Number of GWAS hits
EBI GWAS catalog (Buniello et al., 2019; associations and studies) was download on August 17, 2022 [https://www.ebi.ac.uk/gwas/docs/file-downloads]. Loci were mapped to the nearest gene.
Status as an understudied protein
The Illuminating the Druggable Genome understudied protein list was downloaded on September 20, 2022 [https://github.com/druggablegenome/IDGTargets/blob/master/IDG_TargetList_CurrentVersion.json].
Human protein atlas
HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/search]. Evidence for a protein’s existence, as determined by NeXtProt, HPA, or UniProt was resolved as True if the respective evidence entry was annotated as ‘Evidence at protein level’. Status as a membrane protein was determined by whether the ‘Protein class’ column contained the string ‘membrane protein’. Antibodies were considered available for each protein if the protein’s entry in the ‘Antibody’ column was not null.
Availability of plasmids
The AddGene plasmid catalog was downloaded on August 12, 2022 [https://www.addgene.org/browse/gene/gene-list-data/?_=1666368044314].
Availability of compounds
The catalog of gene targets was downloaded from ChEMBL on September 20, 2022 [https://www.ebi.ac.uk/chembl/g/#browse/targets]. UniProt IDs were converted to Entrez IDs to identify which human genes were affected by any compound.
Mendelian inheritance
Gene-phenotype associations were downloaded from the Human Phenotype Ontology website on May 23, 2024 [https://hpo.jax.org/data/annotations] Genes associated with autosomal dominant [https://hpo.jax.org/app/browse/term/HP:0000006] and autosomal recessive [https://hpo.jax.org/app/browse/term/HP:0000007] inheritance were considered to have evidence of Mendelian inheritance.
MeSH terms
Using our dataset of genes in the titles/abstracts of articles, we applied MeSH terms and picked out subject-specific gene article counts for the 200 most popular diseases annotations (now including COVID-19).
Evolutionary constraints
Using Table S2 from Sun et al. 2024, any genes with a heterozygous selection coefficient greater that 0.073 (the mean across all tested genes) was labelled as evolutionarily constrained. ‘Evolutionarily constrained’ is analogous to ‘Loss of function intolerant’, just by a different measure.
PDB availability
All structures annotated as being from human (taxid=9606) were downloaded from the RCSB PDB sequence search on May 23, 2024 [https://www.rcsb.org/search/advanced]. UniProt identifiers were converted to NCBI gene identifiers using ID mappings from UniProt version 2022_03 [https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/]. A gene may be annotated to a PDB entry despite that PDB entry only containing a single domain from that gene.
v1.2.0 (April 5, 2024)
Genes Information
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz]. Only genes with an unambiguous mapping of Entrez ID to Ensembl ID were used (n=36,035). Number of gene synonyms, protein-coding status, and official gene symbol were derived from this dataset. A gene symbol was considered undefined if the gene’s entry for HGNC gene symbol was ‘-’.
Genes in title/abstract of primary research articles
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz]. gene2pubmed was downloaded from NCBI Gene on August 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz] (Maglott et al., 2007). PubTator gene annotations were downloaded from NIH-NLM on July 12, 2022 [https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/] (Maglott et al., 2007; Wei et al., 2019). PubMed was downloaded on December 17, 2021 [https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/]. Only using PMIDs annotated as primary research articles, a human gene was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed.
Functional annotations
Mapping of genes to Gene Ontology / Protein Interaction Database / WikiPathways / Reactome / Kyoto Encyclopedia of Genes and Genomes / Human Phenotype Ontology / BioCarta categories was derived from MSigDB v7.5 Entrez ID.gmt files, downloaded on April 12, 2022 [http://www.gsea-msigdb.org/gsea/downloads_archive.jsp].
Between-species homology
Homologene Build 68 was used to determine interspecies homology [https://ftp.ncbi.nih.gov/pub/HomoloGene/build68/]. Human = taxid:9606, mouse = taxid:10090, rat = taxid:10116, c. elegans = taxid:6239, d. melanogaster = taxid:7227, yeast = taxid:559292, zebrafish = taxid:7955.
Primate specificity
Human genes were considered primate-specific if the only other members of their homology group belonged to primate genomes. Primate taxonomy ids were downloaded from NCBI Taxonomy on September 20, 2022 [https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid9443[Subtree]].
Number of publications in model organisms
Gene information was downloaded from NCBI Gene on August 16, 2022 [https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz]. Only using PMIDs annotated as primary research articles, genes was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed. Genes in model organisms were mapped to human genes and the number of articles on those mapping to human genes were counted. If a model organism’s gene had homology to human but no associated publications, the number of publications was resolved to zero. Otherwise, counts were listed as NA.
Mouse phenotype hits
International Mouse Phenotyping Consortium data release 17.0 was downloaded on August 18, 2022 [https://www.mousephenotype.org/data/releasehttps://www.mousephenotype.org/data/release]. Mouse genes were matched to human genes with Homologene.
Gene expression atlas
EBI-GXA release 36 was downloaded on September 15, 2020 [https://web.archive.org/web/20201022184159/https://www.ebi.ac.uk/gxa/download]. This is the most recent release of EBI-GXA available as a bulk download. For probability of DE, only RNA-seq comparisons were considered and DE was called at Benjamini-Hochberg q<0.05.
Global RNA expression
RNA consensus tissue gene data from HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/about/download]. Global RNA expression was estimated by taking the median expression (nTPM) across tissues for each gene and the proportion of tissues with detectable (≥1 nTPM) expression for each gene.
Expression in HeLa cells
RNA cell line gene data from HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/about/download]. Expression is in nTPM.
Previous patent activity
Genes with patent activity were defined from Table S1 of Rosenfeld and Mason, 2013. Genes were mapped with their HGNC symbol. This analysis aligned sequences in patents to the human genome to estimate patent coverage of human coding sequences. Although this does not necessarily reflect whether the mapped genes were claimed directly by the patent holder, as noted by others (Tu et al., 2014), this analysis remains the most comprehensive available for determining patent coverage of the human genome.
Druggability
Druggable genes were identified from Table S1 of Finan et al., 2017. Genes were mapped with their Ensembl identifier.
Gene length
GenBank was downloaded in spring 2017 (genome version GRCh38.p10). Gene length is defined here as the span of the longest transcript on the chromosome. This aligns with the model of gene length used in Stoeger et al., 2018.
Solubility
SwissProt protein sequences and mapping tables to Entrez GeneIDs were downloaded from Uniprot in spring 2017. Protein GRAVY score (ignoring Pyrrolysine and Selenocysteine) was estimated with BioPython (Cock et al., 2009).
Loss-of-function intolerance
Data was obtained from Karczewski et al., 2020. pLI scores >0.9 on main transcripts, as flagged by authors, were considered as highly loss-of-function intolerant as described by Lek et al., 2016.
Number of GWAS hits
EBI GWAS catalog (Buniello et al., 2019; associations and studies) was download on August 17, 2022 [https://www.ebi.ac.uk/gwas/docs/file-downloads]. Loci were mapped to the nearest gene.
Status as an understudied protein
The Illuminating the Druggable Genome understudied protein list was downloaded on September 20, 2022 [https://github.com/druggablegenome/IDGTargets/blob/master/IDG_TargetList_CurrentVersion.json].
Human protein atlas
HPA release 21.1 was downloaded on September 20, 2022 [https://www.proteinatlas.org/search]. Evidence for a protein’s existence, as determined by NeXtProt, HPA, or UniProt was resolved as True if the respective evidence entry was annotated as ‘Evidence at protein level’. Status as a membrane protein was determined by whether the ‘Protein class’ column contained the string ‘membrane protein’. Antibodies were considered available for each protein if the protein’s entry in the ‘Antibody’ column was not null.
Availability of plasmids
The AddGene plasmid catalog was downloaded on August 12, 2022 [https://www.addgene.org/browse/gene/gene-list-data/?_=1666368044314].
Availability of compounds
The catalog of gene targets was downloaded from ChEMBL on September 20, 2022 [https://www.ebi.ac.uk/chembl/g/#browse/targets]. UniProt IDs were converted to Entrez IDs to identify which human genes were affected by any compound.
Mendelian inheritance
Autosomal dominant [https://hpo.jax.org/app/browse/term/HP:0000006] and autosomal recessive [https://hpo.jax.org/app/browse/term/HP:0000007] inherited disease-gene associations were downloaded from the Human Phenotype Ontology on September 20, 2022. Genes were considered to have evidence of Mendelian inheritance if they appeared in these lists of associations.