API Reference

pyfastx.version

pyfastx.version(debug=False)

Get current version of pyfastx

Parameters:

debug (bool) – if true, return versions of pyfastx, zlib, sqlite3 and zran.

Returns:

version of pyfastx

Return type:

str

pyfastx.gzip_check(file_name)

New in pyfastx 0.5.4

Check file is gzip compressed or not

Parameters:

file_name (str) – the path of input file

Returns:

Ture if file is gzip compressed else False

Return type:

bool

pyfastx.reverse_complement(seq)

New in pyfastx 2.0.0

get reverse complement sequence of given DNA sequence

Parameters:

seq (str) – DNA sequence

Returns:

reverse complement sequence

Return type:

str

pyfastx.Fasta

class pyfastx.Fasta(file_name, index_file=None, uppercase=True, build_index=True, full_index=False, full_name=False, memory_index=False, key_func=None)

Read and parse fasta files. Fasta can be used as dict or list, you can use index or sequence name to get a sequence object, e.g. fasta[0], fasta['seq1']

Parameters:
  • file_name (str) – the file path of input FASTA file

  • index_file (str) – the index file of FASTA file, default using index file with extension of .fxi in the same directory of FASTA file, New in 2.0.0

  • uppercase (bool) – always output uppercase sequence, default: True

  • build_index (bool) – build index for random access to FASTA sequence, default: True. If build_index is False, iteration will return a tuple (name, seq); If build_index is True, iteration will return a sequence object.

  • full_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: False

  • full_name (bool) – use the full header line instead of the part before first whitespace as the identifier of sequence, even in mode without building index. New in 0.6.14, default: False

  • memory_index (bool) – if memory_index is True, the fasta index will be kept in memory and do not generate a index file, default: False

  • key_func (function) – new in 0.5.1, key function is generally a lambda expression to split header and obtain a shortened identifer, default: None

Returns:

Fasta object

file_name

FASTA file path

size

total length of sequences in FASTA file

type

New in pyfastx 0.5.4

get fasta type, return DNA, RNA, protein, or unknown

is_gzip

New in pyfastx 0.5.0

return True if fasta is gzip compressed else return False

gc_content

GC content of whole sequences in FASTA file, return a float value

gc_skew

GC skew of whole sequences in FASTA file, learn more about GC skew

New in pyfastx 0.3.8

composition

nucleotide composition in FASTA file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

longest

get longest sequence in FASTA file, return a Sequence object

New in pyfastx 0.3.0

shortest

get shortest sequence in FASTA file, return a Sequence object

New in pyfastx 0.3.0

mean

get average length of sequences in FASTA file

New in pyfastx 0.3.0

median

get median length of sequences in FASTA file

New in pyfastx 0.3.0

fetch(chrom, intervals, strand='+')

truncate subsequences from a given sequence by a start and end coordinate or a list of coordinates. This function will cache the full sequence into memory, and is suitable for extracting large numbers of subsequences from specified sequence.

Parameters:
  • chrom (str) – chromosome name or sequence name

  • intervals (list/tuple) – list of [start, end] coordinates

  • strand (str) – sequence strand, + indicates sense strand, - indicates antisense strand, default: ‘+’

Note

intervals can be a list or tuple with start and end position e.g. (10, 20). intervals also can be a list or tuple with multiple coordinates e.g. [(10, 20), (50,70)]

Returns:

sliced subsequences

Return type:

str

flank(chrom, start, end, flank_length=50, use_cache=False)

Get the flank sequence of given subsequence with start and end. New in 0.7.0

Parameters:
  • chrom (str) – chromosome name or sequence name

  • start (int) – 1-based start position of subsequence on chrom

  • end (int) – 1-based end position of subsequence on chrom

  • flank_length (int) – length of flank sequence, default 50

  • use_cache (bool) – cache the whole sequence

Note

If you want to extract flank sequence for large numbers of subsequences from the same sequence. Use use_cache=True will greatly improve the speed

Returns:

left flank and right flank sequence

Return type:

tuple

build_index()

build index for FASTA file

keys()

get all names of sequences

Returns:

an FastaKeys object

count(n)

get counts of sequences whose length >= n bp

New in pyfastx 0.3.0

Parameters:

n (int) – number of bases

Returns:

sequence counts

Return type:

int

nl(quantile)

calculate assembly N50 and L50, learn more about N50,L50

New in pyfastx 0.3.0

Parameters:

quantile (int) – a number between 0 and 100, default 50

Returns:

(N50, L50)

Return type:

tuple

pyfastx.Sequence

class pyfastx.Sequence

Readonly sequence object generated by fasta object, Sequence can be treated as a list and support slicing e.g. seq[10:20]

id

sequence id or order number in FASTA file

name

sequence name

description

Get sequence description after name in sequence header

New in pyfastx 0.3.1

start

start position of sequence

end

end position of sequence

gc_content

GC content of current sequence, return a float value

gc_skew

GC skew of current sequence, learn more about GC skew

composition

nucleotide composition of sequence, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

raw

get the raw string (with header line and sequence lines) of sequence as it appeared in file

New in pyfastx 0.6.3

seq

get the string of sequence in sense strand

reverse

get the string of reversed sequence

complement

get the string of complement sequence

antisense

get the string of sequence in antisense strand, corresponding to reversed and complement sequence

search(subseq, strand='+')

Search for subsequence from given sequence and get the start position of the first occurrence

New in pyfastx 0.3.6

Parameters:
  • subseq (str) – a subsequence for search

  • strand (str) – sequence strand + or -, default +

Returns:

if found subsequence return one-based start position, if not return None

Return type:

int or None

pyfastx.Fastq

New in pyfastx 0.4.0

class pyfastx.Fastq(file_name, index_file=None, phred=0, build_index=True, full_index=False)

Read and parse fastq file

Parameters:
  • file_name (str) – input FASTQ file path

  • index_file (str) – the index file of FASTQ file, default using the index file with extension of .fxi in the same directory of FASTQ file. New in 2.0.0

  • build_index (bool) – build index for random access to FASTQ reads, default: True. If build_index is False, iteration will return a tuple (name, seq, qual); If build_index is True, iteration will return a read object

  • full_index (bool) – calculate character (e.g. A, T, G, C) composition when building index, this will improve the speed of GC content extracting. However, it will take more time to build index, default: False

  • phred (int) – phred was used to convert quality ascii to quality int value, usually is 33 or 64, default 33

Returns:

Fastq object

file_name

FASTQ file path

size

total bases in FASTQ file

is_gzip

New in pyfastx 0.5.0

return True if fasta is gzip compressed else return False

gc_content

GC content of whole FASTQ file

avglen

New in pyfastx 0.6.10

get average length of reads

maxlen

New in pyfastx 0.6.10

get maximum length of reads

minlen

New in pyfastx 0.6.10

get minimum length of reads

maxqual

New in pyfastx 0.6.10

get maximum quality score of bases

minqual

New in pyfastx 0.6.10

get minimum quality score of bases

composition

base composition in FASTQ file, a dict contains counts of A, T, G, C and N (unkown nucleotide base)

phred

get phred value

encoding_type

New in pyfastx 0.4.1

Guess the quality encoding type used by FASTQ sequence file

build_index()

Build index for fastq file when build_index set to False

keys()

New in pyfastx 0.8.0

Get all the names of reads in fastq file

Returns:

an FastqKeys object

pyfastx.Read

New in pyfastx 0.4.0

class pyfastx.Read

Readonly read object for obtaining read information, generated by fastq object

id

read id or order number in FASTQ file

name

read name excluding ‘@’

description

get the full header line of read

raw

get the raw string (with header, sequence, comment and quality lines) of read as it appeared in file

New in pyfastx 0.6.3

seq

get read sequence string

qual

get read quality ascii string

quali

get read quality integer value (ascii - phred), return a list

pyfastx.Fastx

class pyfastx.Fastx(file_name, format='auto', uppercase=False)

New in pyfastx 0.8.0. A python binding of kseq.h, provide a simple api for iterating over sequences in fasta/q file

Parameters:
  • file_name (str) – input fasta or fastq file path

  • format (str) – the input file format, can be “fasta” or “fastq”, default: “auto”, automatically detect the format of sequence file

  • uppercase (bool) – always output uppercase sequence, only work for fasta file, default: False

Returns:

Fastx object

pyfastx.FastaKeys

class pyfastx.FastaKeys

FastaKeys is a readonly and list-like object, contains all names of sequences

sort(by='id', reverse=False)

Sort keys by sequence id, name or length for iteration

New in pyfastx 0.5.0

Parameters:
  • by (str) – order by id, name, or length, default is id

  • reverse (bool) – used to flag descending sorts, default is False

Returns:

FastaKeys object itself

filter(*filters)

Filter keys by sequence name and length for iteration

Parameters:

filters (list) – filters generated by comparison like ids > 500 or ids % ‘seq1’, where ids is a Identifier object

Returns:

FastaKeys object itself

reset()

Clear all filters and sort order

Returns:

FastaKeys object itself

pyfastx.FastqKeys

class pyfastx.FastqKeys

New in pyfastx 0.8.0. FastqKeys is a readonly and list-like object, contains all names of reads