Overview of SeedMatchR Annotation Databases
Tareian Cazares
SeedMatchR_Annotation_Databases.Rmd
Introduction
This entry covers which annotations are available for the 3 model
species used in SeedMatchR
(human, mouse, rat).
SeedMatchR
requires information on the GTF for gene >
transcript annotations as well as the sequences that correspond to the
features of interest being searched. These can be generated
independently by the user or they can be loaded with the
SeedMatchR::load_annotations
function. This function will
load the predefined annotations based on the species of interest, the
filtering criteria, and features of interest.
NOTES:
- In order to use
rnor7
, you must have R 4.3.0 or greater installed. This is becauseAnnotationHub
≥ 3.8.0 is required. -
rnor6
andmm10
cannot be used with theensembldb::TxIsCanonicalFilter
therefore thecanonical
argument forSeedMatchR::load_annotations
must be set toFALSE
. -
rnor6
andrnor7
cannot be used withensembldb::TxSupportLevelFilter
. Therefore this value should not be passed to the functionSeedMatchR::load_annotations
when working with the rat transcriptome.
Current Annotation Options
Species | SeedMatchR arg | Reference Build | ENSEMBL Build | Release Date | 2bit ID | ensembldb |
---|---|---|---|---|---|---|
Human | hg38 | GRCh38.p13 | 109 | February 2023 | AH106283 | AH109606 |
Rat | rnor6 | Rnor_v6.0 | 104 | May 2011 | AH93578 | AH95846 |
Rat | rnor7 | Rnor_v7.2 | 109 | February 2023 | AH106786 | AH109732 |
Rat | rnor7.113 | Rnor_v7.2 | 113 | October 2024 | AH106786 | AH119437 |
Mouse | mm39 | GRCm39 | 109 | February 2023 | AH106440 | AH109655 |
Mouse | mm10 | GRCm38 | 102 | October 2020 | AH88475 | AH89211 |
Functions for working with annotations and building transcriptomes for queries
Load data for Rnor7
annodb <- load_annotations("rnor7", feature.type = "exons", protein.coding = FALSE, canonical = F, return_gene_name = F)
#> Build AnnotationFilter for transcript features based on the following parameters:
#> Keep only standard chroms: TRUE
#> Remove rows with NA in transcript ID: TRUE
#> Keep only protein coding genes and transcripts: FALSE
#> Filtering for transcripts with support level: FALSE
#> Keep only the ENSEMBL canonical transcript: FALSE
#> Filtering for specific genes: FALSE
#> Filtering for specific transcripts: FALSE
#> Filtering for specific gene symbols: FALSE
#> Filtering for specific entrez id: FALSE
#> Loading annotations from AnnotationHub for rnor7
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("rtracklayer")
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> require("ensembldb")
#> Extracting exons from ensembldb object.
#> Extracting sequences for each feature.
#> Keeping sequences that are >= 8
How annotations were retrieved
Selecting GTF
The GTF files were selected to have the full DNA sequence visible
without any type of masking of repeats. These files have names like
Mus_musculus.GRCm39.109.gtf
.
Selecting .2bit
DNA sequences
The .2bit
files were selected to have the full DNA
sequence visible without any type of masking of repeats. These files
have names like
Homo_sapiens.GRCh38.dna.primary_assembly.2bit
. For Rat,
there was no file that was named as the
dna.primary_assembly
, instead we had to choose from
dna_sm.primary_assembly
which is the soft-masked version of
the genome that accounts for repeats.
Code for querying AnnotationHub
for references
Hg38
.2bit
AnnotationHub::query(ah, c("GRCh38", "2bit"))
#> AnnotationHub with 120 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens, homo sapiens
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH49722"]]'
#>
#> title
#> AH49722 | Homo_sapiens.GRCh38.cdna.all.2bit
#> AH49723 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#> AH49724 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#> AH49725 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#> AH49726 | Homo_sapiens.GRCh38.ncrna.2bit
#> ... ...
#> AH106282 | Homo_sapiens.GRCh38.cdna.all.2bit
#> AH106283 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit
#> AH106284 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
#> AH106285 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
#> AH106286 | Homo_sapiens.GRCh38.ncrna.2bit
ensembldb
version 110 for GRCh38.p13
AnnotationHub::query(ah, c("GRCh38", "110"))
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH113665
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # $rdatadateadded: 2023-04-25
#> # $title: Ensembl 110 EnsDb for Homo sapiens
#> # $description: Gene and protein annotations for Homo sapiens based on Ensem...
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: ensembl
#> # $sourceurl: http://www.ensembl.org
#> # $sourcesize: NA
#> # $tags: c("110", "Annotation", "AnnotationHubSoftware", "Coverage",
#> # "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#> # "Transcript")
#> # retrieve record with 'object[["AH113665"]]'
Version 80 for GRCh38.p2
AnnotationHub::query(ah, c("GRCh38", "80"))
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH47066
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # $rdatadateadded: 2015-05-22
#> # $title: Homo_sapiens.GRCh38.80.gtf
#> # $description: Gene Annotation for Homo sapiens
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: GTF
#> # $sourceurl: ftp://ftp.ensembl.org/pub/release-80/gtf/homo_sapiens/Homo_sap...
#> # $sourcesize: 44733199
#> # $tags: c("GTF", "ensembl", "Gene", "Transcript", "Annotation")
#> # retrieve record with 'object[["AH47066"]]'
Rat
.2bit
Rnor6
query(ah, c("Rnor", "release-104"))
#> AnnotationHub with 7 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: rattus norvegicus
#> # $rdataclass: TwoBitFile, GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH92418"]]'
#>
#> title
#> AH92418 | Rattus_norvegicus.Rnor_6.0.104.abinitio.gtf
#> AH92419 | Rattus_norvegicus.Rnor_6.0.104.chr.gtf
#> AH92420 | Rattus_norvegicus.Rnor_6.0.104.gtf
#> AH93576 | Rattus_norvegicus.Rnor_6.0.cdna.all.2bit
#> AH93577 | Rattus_norvegicus.Rnor_6.0.dna_rm.toplevel.2bit
#> AH93578 | Rattus_norvegicus.Rnor_6.0.dna_sm.toplevel.2bit
#> AH93579 | Rattus_norvegicus.Rnor_6.0.ncrna.2bit
Rnor7
query(ah, c("Rattus", ".2bit"))
#> AnnotationHub with 106 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl, UCSC
#> # $species: Rattus norvegicus, rattus norvegicus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH14021"]]'
#>
#> title
#> AH14021 | rn6.2bit
#> AH14022 | rn5.2bit
#> AH14023 | rn4.2bit
#> AH49867 | Rattus_norvegicus.Rnor_6.0.cdna.all.2bit
#> AH49868 | Rattus_norvegicus.Rnor_6.0.dna_rm.toplevel.2bit
#> ... ...
#> AH104348 | Rattus_norvegicus.mRatBN7.2.ncrna.2bit
#> AH106784 | Rattus_norvegicus.mRatBN7.2.cdna.all.2bit
#> AH106785 | Rattus_norvegicus.mRatBN7.2.dna_rm.toplevel.2bit
#> AH106786 | Rattus_norvegicus.mRatBN7.2.dna_sm.toplevel.2bit
#> AH106787 | Rattus_norvegicus.mRatBN7.2.ncrna.2bit
ensembldb
Rnor6
query(ah, c("rat", 104, "ensdb"))
#> AnnotationHub with 6 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Sparus aurata, Rattus norvegicus, Mesocricetus auratus, Kryptole...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH95677"]]'
#>
#> title
#> AH95677 | Ensembl 104 EnsDb for Carassius auratus
#> AH95725 | Ensembl 104 EnsDb for Echeneis naucrates
#> AH95749 | Ensembl 104 EnsDb for Kryptolebias marmoratus
#> AH95762 | Ensembl 104 EnsDb for Mesocricetus auratus
#> AH95846 | Ensembl 104 EnsDb for Rattus norvegicus
#> AH95850 | Ensembl 104 EnsDb for Sparus aurata
Rnor7
query(ah, c("rat", 109, "ensdb"))
#> AnnotationHub with 6 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Sparus aurata, Rattus norvegicus, Mesocricetus auratus, Kryptole...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH109518"]]'
#>
#> title
#> AH109518 | Ensembl 109 EnsDb for Carassius auratus
#> AH109583 | Ensembl 109 EnsDb for Echeneis naucrates
#> AH109611 | Ensembl 109 EnsDb for Kryptolebias marmoratus
#> AH109625 | Ensembl 109 EnsDb for Mesocricetus auratus
#> AH109732 | Ensembl 109 EnsDb for Rattus norvegicus
#> AH109736 | Ensembl 109 EnsDb for Sparus aurata
Mouse
.2bit
mm39 or GCRm39
query(ah, c("GRCm39", ".2bit"))
#> AnnotationHub with 25 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: mus musculus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH90962"]]'
#>
#> title
#> AH90962 | Mus_musculus.GRCm39.cdna.all.2bit
#> AH90963 | Mus_musculus.GRCm39.dna.primary_assembly.2bit
#> AH90964 | Mus_musculus.GRCm39.dna_rm.primary_assembly.2bit
#> AH90965 | Mus_musculus.GRCm39.dna_sm.primary_assembly.2bit
#> AH90966 | Mus_musculus.GRCm39.ncrna.2bit
#> ... ...
#> AH106439 | Mus_musculus.GRCm39.cdna.all.2bit
#> AH106440 | Mus_musculus.GRCm39.dna.primary_assembly.2bit
#> AH106441 | Mus_musculus.GRCm39.dna_rm.primary_assembly.2bit
#> AH106442 | Mus_musculus.GRCm39.dna_sm.primary_assembly.2bit
#> AH106443 | Mus_musculus.GRCm39.ncrna.2bit
mm10 or GRCm38
query(ah, c("GRCm38", ".2bit"))
#> AnnotationHub with 95 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Mus musculus, mus musculus
#> # $rdataclass: TwoBitFile
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH49772"]]'
#>
#> title
#> AH49772 | Mus_musculus.GRCm38.cdna.all.2bit
#> AH49773 | Mus_musculus.GRCm38.dna.primary_assembly.2bit
#> AH49774 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit
#> AH49775 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit
#> AH49776 | Mus_musculus.GRCm38.ncrna.2bit
#> ... ...
#> AH88474 | Mus_musculus.GRCm38.cdna.all.2bit
#> AH88475 | Mus_musculus.GRCm38.dna.primary_assembly.2bit
#> AH88476 | Mus_musculus.GRCm38.dna_rm.primary_assembly.2bit
#> AH88477 | Mus_musculus.GRCm38.dna_sm.primary_assembly.2bit
#> AH88478 | Mus_musculus.GRCm38.ncrna.2bit
ensembldb
mm39 or GRCm39
query(ah, c("mus", 104, "ensdb"))
#> AnnotationHub with 11 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Ursus maritimus, Scophthalmus maximus, Prolemur simus, Periophth...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH95670"]]'
#>
#> title
#> AH95670 | Ensembl 104 EnsDb for Balaenoptera musculus
#> AH95763 | Ensembl 104 EnsDb for Mus caroli
#> AH95775 | Ensembl 104 EnsDb for Mus musculus
#> AH95778 | Ensembl 104 EnsDb for Mus pahari
#> AH95779 | Ensembl 104 EnsDb for Mus spicilegus
#> ... ...
#> AH95790 | Ensembl 104 EnsDb for Neogobius melanostomus
#> AH95825 | Ensembl 104 EnsDb for Periophthalmus magnuspinnatus
#> AH95836 | Ensembl 104 EnsDb for Prolemur simus
#> AH95861 | Ensembl 104 EnsDb for Scophthalmus maximus
#> AH95879 | Ensembl 104 EnsDb for Ursus maritimus
mm10 or GRCm38
query(ah, c("mus", 109, "ensdb"))
#> AnnotationHub with 27 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Mus musculus, Ursus maritimus, Scophthalmus maximus, Prolemur si...
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH109509"]]'
#>
#> title
#> AH109509 | Ensembl 109 EnsDb for Balaenoptera musculus
#> AH109626 | Ensembl 109 EnsDb for Mus caroli
#> AH109640 | Ensembl 109 EnsDb for Mus musculus
#> AH109641 | Ensembl 109 EnsDb for Mus musculus
#> AH109642 | Ensembl 109 EnsDb for Mus musculus
#> ... ...
#> AH109671 | Ensembl 109 EnsDb for Neogobius melanostomus
#> AH109709 | Ensembl 109 EnsDb for Periophthalmus magnuspinnatus
#> AH109721 | Ensembl 109 EnsDb for Prolemur simus
#> AH109750 | Ensembl 109 EnsDb for Scophthalmus maximus
#> AH109783 | Ensembl 109 EnsDb for Ursus maritimus
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ensembldb_2.32.0 AnnotationFilter_1.32.0 GenomicFeatures_1.60.0
#> [4] AnnotationDbi_1.70.0 Biobase_2.68.0 rtracklayer_1.68.0
#> [7] GenomicRanges_1.60.0 GenomeInfoDb_1.44.0 IRanges_2.42.0
#> [10] S4Vectors_0.46.0 SeedMatchR_2.0.0 AnnotationHub_3.16.0
#> [13] BiocFileCache_2.16.0 dbplyr_2.5.0 BiocGenerics_0.54.0
#> [16] generics_0.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4
#> [3] blob_1.2.4 filelock_1.0.3
#> [5] Biostrings_2.76.0 bitops_1.0-9
#> [7] fastmap_1.2.0 lazyeval_0.2.2
#> [9] RCurl_1.98-1.17 GenomicAlignments_1.44.0
#> [11] XML_3.99-0.18 digest_0.6.37
#> [13] mime_0.13 lifecycle_1.0.4
#> [15] ProtGenerics_1.40.0 KEGGREST_1.48.1
#> [17] RSQLite_2.4.1 magrittr_2.0.3
#> [19] compiler_4.5.1 rlang_1.1.6
#> [21] sass_0.4.10 tools_4.5.1
#> [23] yaml_2.3.10 knitr_1.50
#> [25] S4Arrays_1.8.1 htmlwidgets_1.6.4
#> [27] bit_4.6.0 curl_6.4.0
#> [29] DelayedArray_0.34.1 abind_1.4-8
#> [31] BiocParallel_1.42.1 withr_3.0.2
#> [33] purrr_1.1.0 desc_1.4.3
#> [35] grid_4.5.1 SummarizedExperiment_1.38.1
#> [37] cli_3.6.5 rmarkdown_2.29
#> [39] crayon_1.5.3 ragg_1.4.0
#> [41] httr_1.4.7 rjson_0.2.23
#> [43] DBI_1.2.3 cachem_1.1.0
#> [45] parallel_4.5.1 BiocManager_1.30.26
#> [47] XVector_0.48.0 restfulr_0.0.16
#> [49] matrixStats_1.5.0 vctrs_0.6.5
#> [51] Matrix_1.7-3 jsonlite_2.0.0
#> [53] bit64_4.6.0-1 systemfonts_1.2.3
#> [55] jquerylib_0.1.4 glue_1.8.0
#> [57] pkgdown_2.1.3 codetools_0.2-20
#> [59] BiocVersion_3.21.1 BiocIO_1.18.0
#> [61] UCSC.utils_1.4.0 tibble_3.3.0
#> [63] pillar_1.11.0 rappdirs_0.3.3
#> [65] htmltools_0.5.8.1 GenomeInfoDbData_1.2.14
#> [67] R6_2.6.1 textshaping_1.0.1
#> [69] lattice_0.22-7 evaluate_1.0.4
#> [71] png_0.1-8 Rsamtools_2.24.0
#> [73] memoise_2.0.1 bslib_0.9.0
#> [75] SparseArray_1.8.0 xfun_0.52
#> [77] fs_1.6.6 MatrixGenerics_1.20.0
#> [79] pkgconfig_2.0.3