重测序数据构建bin(snpbinner, Gonda et al, 2019)

SNPbinner

SNPbinner is a Python 2.7 package and command line utility for the generation of genotype binmaps based on SNP genotype data across populations of recombinant inbred lines (RILs). Analysis using SNPbinner is performed in three parts: crosspoints, bins, and visualize.

Citing

SNPbinner can be cited as:

Gonda, I., H. Ashrafi, D.A. Lyon, S.R. Strickler, A.M. Hulse-Kemp, Q. Ma, H. Sun, K. Stoffel, A.F. Powell, S. Futrell, T.W. Thannhauser, Z. Fei, A.E. Van Deynze, L.A. Mueller, J.J. Giovannoni, and M.R. Foolad. 2019. Sequencing-based bin map construction of a tomato mapping population, facilitating high-resolution quantitative trait loci detection. Plant Genome 12:180010. doi:10.3835/plantgenome2018.02.0010

Table of Contents

Installation and Usage

Installation and Usage

SNPbinner requires Python 2.7. Python 3 is currently not supported.

The only non‑standard dependency of SNPbinner is Pillow, a PIL fork.

To install the SNPbinner utility, download or clone the repository and run

$ pip install REPO-PATH

Once installed, one can execute any of the commands below like so

$ snpbinner COMMAND [ARGS...]

Alternatively, without installing the package, one can execute any of the commands below using

$ python REPO-PATH/snpbinner COMMAND [ARGS...]

第一步 crosspoints

crosspoints uses genotyped SNP data to identify likely crossover points. First, the script uses a pair of hidden Markov models (HMM) to predict genotype regions along the chromosome both with (3‑state) and without (2‑state) heterozygous regions. Then, the script identifies groupings of regions which are too short (based on a minimum distance between crosspoints set by the user). After that it follows the rules below to find crosspoints and merge away regions which are too short. The script then outputs the crosspoints for each RIL and the genotyped regions between them to a CSV file.

Running the crosspoints command requires an input path, output path, and a minimum size argument. There are also three optional arguments which can be found in the table below.

$ snpbinner crosspoints --input PATH --output PATH (--min-length INT | --min-ratio FLOAT) [optional args]

水稻中的r=0.01,m=5kb (本人研究使用）

## snpbinner crosspoints -i /.../in.file -o /.../out.file -m 5000 -r 0.01

Required Arguments

TypeDescription

‑i‑‑inputPATHPath to a SNP TSV, multiple paths, or a glob (e.g. myGenome.chr*.tsv).

‑o‑‑outputPATHPath for the output CSV when there is a single input, or for a folder when there are multiple.

‑m‑‑min‑lengthINTMinimum distance between crosspoints in basepairs. Cannot be used withmin‑ratio.

‑r‑‑min‑ratioFLOATMinimum distance between crosspoints as a ratio. (0.01 would be 1% of the chromosome.) Cannot be used with min‑length.

Input Format

Sample input file

Input should be formatted as a tab‑separated value (TSV) file with the following columns.

0The SNP marker ID.

1The position of the marker in base pairs from the start of the chromosome.

2+RIL ID (header) and the called genotype of the RIL at each position.

Output Format

Sample output file

Output is formatted as a comma‑separated value (CSV) file with the following columns.

0The RIL ID

OddLocation of a crosspoint. (Empty after the chromosome ends.)

EvenGenotype in between the surrounding crosspoints. (Empty after the chromosome ends.)

第二步 bins

## snpbinner bins -i /.../in.file -l 5kb -o /.../out.file

bins takes the crosspoints predicted for each RIL and combines similar crosspoint locations to create a combined map of all crossover points across the RILs at a specified resolution. It then projects the genotype regions of the RIL back onto the map and outputs the average genotype of each RIL in each bin on the map. The procedure is as follows.It should be noted that, to insure the changes are obvious, the illustrations below are showing a map with very low resolution (bin size) and therefore there is significant loss of information. A smaller bin size would create a more accurate map.

Running the bins command requires an input path, output path, and a minimum size argument. Optionally, a binmap ID may also be provided.

$ snpbinner bins --input PATH --output PATH --min-bin-size INT [--binmap-id ID]

Required Arguments

TypeDescription

‑i ‑‑input PATH Path to a crosspoints CSV, multiple paths, or a glob (e.g. myGenome.chr*.crosp.csv).

‑o ‑‑output PATH Path for the output CSV when there is a single input, or for a folder when there are multiple.

‑l ‑‑min‑bin‑size INT Sets the minimum size (in bp) of each bin.

Optional Arguments

TypeDescription

‑n‑‑binmap‑idIDIf a binmap ID is provided, a header row will be added and each column labeled with the given string.

Input Format

bins uses the output from crosspoints.

For details, see the crosspointsOutput Format.

Output Format

Sample output file

Output is formatted as a comma‑separated value (CSV) file and has the following rows.

0(Optional) The binmap ID

1The start of each bin (in base pairs).

2The end of each bin (in base pairs).

3The center of each bin (in base pairs).

4+RIL ID in the first cell, then the genotypes of each bin for that RIL.

第三步，可视化

Description

visualize plots the inputs and outputs of bins and crosspoints. It can be used to visually check the results of the above commands to help determine the best values for each of the parameters. It can accept three filetypes (SNP input TSV, crosspoint CSV, and bin CSV). It then parses the files and groups the data by RIL, creating an image for each. In each row of the resulting images, regions are colored red, green, or blue, for genotype a, heterozygous, or genotypeb, respectively. The binmap is represented in gray with adjacent bins alternating dark and light. The script can accept any combination or number of files for each of the different filetypes.

$ snpbinner visualize --out PATH [--bins PATH]... [--crosspoints PATH]... [--snps PATH]...

Required Arguments

TypeDescription

‑o‑‑outPATHFolder to which the resulting images should be saved.

Optional Arguments

TypeDescription

‑b‑‑binsPATHbins output file to be added to the visualization.

‑c‑‑crosspointsPATHcrosspoints output file to be added to the visualization.

‑s‑‑snpsPATHSNP (crosspoints input file) file to be added to the visualization.

重测序数据构建bin(snpbinner, Gonda et al, 2019)

推荐阅读更多精彩内容