As I was browsing the GRCh38 patch 4 build of the human genome the other day on NCBI, I became confused over the differences between the Genbank FTP and the RefSeq FTP site. I saw very similar filenames and they sizes of some of the files were very different. To help resolve my confusion, I sent an email to NCBI. Below is their (partially edited) response and my questions that I asked are in bold. I hope this helps other people with similar questions.
What is the difference between GCA_$ and GCF _$ ?
- The GCA indicates the GenBank copy of the assembly, and the GCF indicates the RefSeq copy of the assembly.
- The GenBank copy is the assembly that was provided by the submitter to GenBank. The RefSeq assembly is a copy of the GenBank copy that is used as the basis for RefSeq annotation. (This is a historical precedent: RefSeq does not annotate GenBank sequences, they only annotate RefSeq sequences: http://www.ncbi.nlm.nih.gov/books/NBK50679/).
- The GenBank copy of the human reference assembly is devoid of annotation. The RefSeq copy contains the NCBI-provided annotation.
- There are no sequence differences between GenBank and RefSeq versions of the equivalent human assembly release (e.g. GRCh38.p4).
What is the difference between GCA_000001405.19 and GCF_000001405.30?
- These are both GRCh38.p4: the former is the GenBank copy; the latter is the RefSeq copy
- This sequence report provides the mappings between the sequence identifiers in both assemblies (e.g. chr. 1 in GenBank is called CM000663.2 and in RefSeq it is called NC_000001.11)
Why are there differences in the sizes of the GFF files?
- The GFF file associated with the GenBank version of the assembly is small b/c it contains only the annotations on the GenBank sequences. The GRC does not provide gene annotation for the reference assembly.
- The RefSeq GFF file is much larger b/c it contains the annotation for the reference assembly that is provided by RefSeq.
Can you explain generally what is the difference between the Genbank and RefSeq FTP sites?
- The GRC points to the GenBank version of the assembly b/c it is the assembly that the GRC submitted to GenBank. The RefSeq annotation is an NCBI product. Other resources (e.g. Ensembl, GENCODE, UCSC) may also provide assembly annotations that can be accessed via their web or FTP sites. The GRC does not promote a particular annotation (GRC is comprised of members from multiple institutions); thus, we point to our GenBank submission.
- So, the use of separate accessions for the GenBank and RefSeq copies of the assembly does cause user confusion. If one is only interested in the bare sequences, then the GenBank FTP of the assembly will suffice. However, if one is looking for NCBI-provided annotation, you will want the RefSeq FTP. Both can be accessed via the NCBI Assembly resource, where you will also find additional information.