API Reference

pyfastx.version

pyfastx.version(debug=False)

Get current version of pyfastx

Parameters:: debug (bool) – if true, return versions of pyfastx, zlib, sqlite3 and zran.
Returns:: version of pyfastx
Return type:: str

pyfastx.gzip_check(file_name)

New in pyfastx 0.5.4

Check file is gzip compressed or not

Parameters:: file_name (str) – the path of input file
Returns:: Ture if file is gzip compressed else False
Return type:: bool

pyfastx.reverse_complement(seq)

New in pyfastx 2.0.0

get reverse complement sequence of given DNA sequence

Parameters:: seq (str) – DNA sequence
Returns:: reverse complement sequence
Return type:: str

pyfastx.Fasta

class pyfastx.Fasta(file_name, index_file=None, uppercase=True, build_index=True, full_index=False, full_name=False, memory_index=False, key_func=None)

Read and parse fasta files. Fasta can be used as dict or list, you can use index or sequence name to get a sequence object, e.g. fasta[0], fasta['seq1']

Parameters:

file_name (str) – the file path of input FASTA file
index_file (str) – the index file of FASTA file, default using index file with extension of .fxi in the same directory of FASTA file, New in 2.0.0
uppercase (bool) – always output uppercase sequence, default: True
build_index (bool) – build index for random access to FASTA sequence, default: True. If build_index is False, iteration will return a tuple (name, seq); If build_index is True, iteration will return a sequence object.
full_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: False
full_name (bool) – use the full header line instead of the part before first whitespace as the identifier of sequence, even in mode without building index. New in 0.6.14, default: False
memory_index (bool) – if memory_index is True, the fasta index will be kept in memory and do not generate a index file, default: False
key_func (function) – new in 0.5.1, key function is generally a lambda expression to split header and obtain a shortened identifer, default: None

Returns:

Fasta object

file_name: FASTA file path

size: total length of sequences in FASTA file

type

New in pyfastx 0.5.4

get fasta type, return DNA, RNA, protein, or unknown

is_gzip

New in pyfastx 0.5.0

return True if fasta is gzip compressed else return False

gc_content: GC content of whole sequences in FASTA file, return a float value

gc_skew

GC skew of whole sequences in FASTA file, learn more about GC skew

New in pyfastx 0.3.8

composition: nucleotide composition in FASTA file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

longest

get longest sequence in FASTA file, return a Sequence object

New in pyfastx 0.3.0

shortest

get shortest sequence in FASTA file, return a Sequence object

New in pyfastx 0.3.0

mean

get average length of sequences in FASTA file

New in pyfastx 0.3.0

median

get median length of sequences in FASTA file

New in pyfastx 0.3.0

fetch(chrom, intervals, strand='+')

truncate subsequences from a given sequence by a start and end coordinate or a list of coordinates. This function will cache the full sequence into memory, and is suitable for extracting large numbers of subsequences from specified sequence.

Parameters:

chrom (str) – chromosome name or sequence name
intervals (list/tuple) – list of [start, end] coordinates
strand (str) – sequence strand, + indicates sense strand, - indicates antisense strand, default: ‘+’

Note

intervals can be a list or tuple with start and end position e.g. (10, 20). intervals also can be a list or tuple with multiple coordinates e.g. [(10, 20), (50,70)]

Returns:: sliced subsequences
Return type:: str

flank(chrom, start, end, flank_length=50, use_cache=False)

Get the flank sequence of given subsequence with start and end. New in 0.7.0

Parameters:

chrom (str) – chromosome name or sequence name
start (int) – 1-based start position of subsequence on chrom
end (int) – 1-based end position of subsequence on chrom
flank_length (int) – length of flank sequence, default 50
use_cache (bool) – cache the whole sequence

Note

If you want to extract flank sequence for large numbers of subsequences from the same sequence. Use use_cache=True will greatly improve the speed

Returns:: left flank and right flank sequence
Return type:: tuple

build_index(): build index for FASTA file

keys()

get all names of sequences

Returns:: an FastaKeys object

count(n)

get counts of sequences whose length >= n bp

New in pyfastx 0.3.0

Parameters:: n (int) – number of bases
Returns:: sequence counts
Return type:: int

nl(quantile)

calculate assembly N50 and L50, learn more about N50,L50

New in pyfastx 0.3.0

Parameters:: quantile (int) – a number between 0 and 100, default 50
Returns:: (N50, L50)
Return type:: tuple

pyfastx.Sequence

class pyfastx.Sequence

Readonly sequence object generated by fasta object, Sequence can be treated as a list and support slicing e.g. seq[10:20]

id: sequence id or order number in FASTA file

name: sequence name

description

Get sequence description after name in sequence header

New in pyfastx 0.3.1

start: start position of sequence

end: end position of sequence

gc_content: GC content of current sequence, return a float value

gc_skew: GC skew of current sequence, learn more about GC skew

composition: nucleotide composition of sequence, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

raw

get the raw string (with header line and sequence lines) of sequence as it appeared in file

New in pyfastx 0.6.3

seq: get the string of sequence in sense strand

reverse: get the string of reversed sequence

complement: get the string of complement sequence

antisense: get the string of sequence in antisense strand, corresponding to reversed and complement sequence

search(subseq, strand='+')

Search for subsequence from given sequence and get the start position of the first occurrence

New in pyfastx 0.3.6

Parameters:

subseq (str) – a subsequence for search
strand (str) – sequence strand + or -, default +

Returns:

if found subsequence return one-based start position, if not return None

Return type:

int or None

pyfastx.Fastq

New in pyfastx 0.4.0

class pyfastx.Fastq(file_name, index_file=None, phred=0, build_index=True, full_index=False)

Read and parse fastq file

Parameters:

file_name (str) – input FASTQ file path
index_file (str) – the index file of FASTQ file, default using the index file with extension of .fxi in the same directory of FASTQ file. New in 2.0.0
build_index (bool) – build index for random access to FASTQ reads, default: True. If build_index is False, iteration will return a tuple (name, seq, qual); If build_index is True, iteration will return a read object
full_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: False
phred (int) – phred was used to convert quality ascii to quality int value, usually is 33 or 64, default 33

Returns:

Fastq object

file_name: FASTQ file path

size: total bases in FASTQ file

is_gzip

New in pyfastx 0.5.0

return True if fasta is gzip compressed else return False

gc_content: GC content of whole FASTQ file

avglen

New in pyfastx 0.6.10

get average length of reads

maxlen

New in pyfastx 0.6.10

get maximum length of reads

minlen

New in pyfastx 0.6.10

get minimum length of reads

maxqual

New in pyfastx 0.6.10

get maximum quality score of bases

minqual

New in pyfastx 0.6.10

get minimum quality score of bases

composition: base composition in FASTQ file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

phred: get phred value

encoding_type

New in pyfastx 0.4.1

Guess the quality encoding type used by FASTQ sequence file

build_index(): Build index for fastq file when build_index set to False

keys()

New in pyfastx 0.8.0

Get all the names of reads in fastq file

Returns:: an FastqKeys object

pyfastx.Read

New in pyfastx 0.4.0

class pyfastx.Read

Readonly read object for obtaining read information, generated by fastq object

id: read id or order number in FASTQ file

name: read name excluding ‘@’

description: get the full header line of read

raw

get the raw string (with header, sequence, comment and quality lines) of read as it appeared in file

New in pyfastx 0.6.3

seq: get read sequence string

qual: get read quality ascii string

quali: get read quality integer value (ascii - phred), return a list

pyfastx.Fastx

class pyfastx.Fastx(file_name, format='auto', uppercase=False)

New in pyfastx 0.8.0. A python binding of kseq.h, provide a simple api for iterating over sequences in fasta/q file

Parameters:

file_name (str) – input fasta or fastq file path
format (str) – the input file format, can be “fasta” or “fastq”, default: “auto”, automatically detect the format of sequence file
uppercase (bool) – always output uppercase sequence, only work for fasta file, default: False

Returns:

Fastx object

pyfastx.FastaKeys

class pyfastx.FastaKeys

FastaKeys is a readonly and list-like object, contains all names of sequences

sort(by='id', reverse=False)

Sort keys by sequence id, name or length for iteration

New in pyfastx 0.5.0

Parameters:

by (str) – order by id, name, or length, default is id
reverse (bool) – used to flag descending sorts, default is False

Returns:

FastaKeys object itself

filter(*filters)

Filter keys by sequence name and length for iteration

Parameters:: filters (list) – filters generated by comparison like ids > 500 or ids % ‘seq1’, where ids is a Identifier object
Returns:: FastaKeys object itself

reset()

Clear all filters and sort order

Returns:: FastaKeys object itself

pyfastx.FastqKeys

class pyfastx.FastqKeys: New in pyfastx 0.8.0. FastqKeys is a readonly and list-like object, contains all names of reads