Command line interface

New in pyfastx 0.5.0

$ pyfastx -h

usage: pyfastx COMMAND [OPTIONS]

A command line tool for FASTA/Q file manipulation

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program version number and exit

Commands:

    index        build index for fasta/q file
    stat         show detailed statistics information of fasta/q file
    split        split fasta/q file into multiple files
    fq2fa        convert fastq file to fasta file
    subseq       get subsequences from fasta file by region
    sample       randomly sample sequences from fasta or fastq file
    extract      extract full sequences or reads from fasta/q file

Build index

New in pyfastx 0.6.10

$ pyfastx index -h

usage: pyfastx index [-h] [-f] fastx [fastx ...]

positional arguments:
  fastx       fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit
  -f, --full  build full index, base composition will be calculated

The –full option was used to count bases in FASTA/Q file and speedup calculation of GC content.

Show statistics information

$ pyfastx stat -h

usage: pyfastx stat [-h] fastx [fastx ...]

positional arguments:
  fastx       input fasta or fastq file, gzip support

optional arguments:
  -h, --help  show this help message and exit

For example:

$ pyfastx info tests/data/*.fa*

fileName        seqType seqCounts       totalBases         GC%   avgLen medianLen       maxLen  minLen  N50     L50
protein.fa      protein        17             2265           -   133.24      80.0          419      23  263       4
rna.fa              RNA         2              720      65.283    360.0     360.0          360     360  360       1
test.fa             DNA       211            86262      43.529   408.82     386.0          821     118  516      66
test.fa.gz          DNA       211            86262      43.529   408.82     386.0          821     118  516      66

seqType: sequence type (DNA, RNA, or protein); seqCounts: total sequence counts; totalBases: total number of bases; GC%: GC content; avgLen: average sequence length; medianLen: median sequence length; maxLen: maximum sequence length; minLen: minimum sequence length; N50: N50 length; L50: L50 sequence counts.

$ pyfastx info tests/data/*.fq*

fileName    readCounts  totalBases     GC%  avgLen  maxLen  minLen  maxQual minQual                     qualEncodingSystem
test.fq            800      120000  66.175   150.0     150     150       70      35 Sanger Phred+33,Illumina 1.8+ Phred+33
test.fq.gz         800      120000  66.175   150.0     150     150       70      35 Sanger Phred+33,Illumina 1.8+ Phred+33

readCounts: total read counts; totalBases: total number of bases; GC%: GC content; avgLen: average sequence length; maxLen: maximum sequence length; minLen: minimum sequence length; maxQual: maximum quality score; minQual: minimum quality score; qualEncodingSystem: quality encoding system.

Split FASTA/Q file

$ pyfastx split -h

usage: pyfastx split [-h] (-n int | -c int) [-o str] fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                split a fasta/q file into N new files with even size
  -c int                split a fasta/q file into multiple files containing the same sequence counts
  -o str, --out-dir str
                        output directory, default is current folder

Convert FASTQ to FASTA file

$ pyfastx fq2fa -h

usage: pyfastx fq2fa [-h] [-o str] fastx

positional arguments:
  fastx                 input fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -o str, --out-file str
                        output file, default: output to stdout

Get subsequence with region

$ pyfastx subseq -h

usage: pyfastx subseq [-h] [-r str | -b str] [-o str]
                      fastx [region [region ...]]

positional arguments:
  fastx                 input fasta file, gzip support
  region                format is chr:start-end, start and end position is
                        1-based, multiple regions were separated by space

optional arguments:
  -h, --help            show this help message and exit
  -r str, --region-file str
                        tab-delimited file, one region per line, both start
                        and end position are 1-based
  -b str, --bed-file str
                        tab-delimited BED file, 0-based start position and
                        1-based end position
  -o str, --out-file str
                        output file, default: output to stdout

Sample sequences

$ pyfastx sample -h

usage: pyfastx sample [-h] (-n int | -p float) [-s int] [--sequential-read]
                      [-o str]
                      fastx

positional arguments:
  fastx                 fasta or fastq file, gzip support

optional arguments:
  -h, --help            show this help message and exit
  -n int                number of sequences to be sampled
  -p float              proportion of sequences to be sampled, 0~1
  -s int, --seed int    random seed, default is the current system time
  --sequential-read     start sequential reading, particularly suitable for
                        sampling large numbers of sequences
  -o str, --out-file str
                        output file, default: output to stdout

Extract sequences

New in pyfastx 0.6.10

$ pyfastx extract -h

usage: pyfastx extract [-h] [-l str] [--reverse-complement] [--out-fasta]
                       [-o str] [--sequential-read]
                       fastx [name [name ...]]

positional arguments:
  fastx                 fasta or fastq file, gzip support
  name                  sequence name or read name, multiple names were
                        separated by space

optional arguments:
  -h, --help            show this help message and exit
  -l str, --list-file str
                        a file containing sequence or read names, one name per
                        line
  --reverse-complement  output reverse complement sequence
  --out-fasta           output fasta format when extract reads from fastq,
                        default output fastq format
  -o str, --out-file str
                        output file, default: output to stdout
  --sequential-read     start sequential reading, particularly suitable for
                        extracting large numbers of sequences