Skip to content

Mean QScore calculation can be much quicker. #336

@rhpvorderman

Description

@rhpvorderman

Hi,

I had a look at the QScore calculation as defined in the code

def mean_qscore_from_qstring(qstring):

I see you use numpy to leverage the power of C to do multiple calculations. I wrote a fastq filter and I also went down this route. However after some time I realized that there are only 94 distinct phred values in the FASTQ format. So basically you are recalculating the same 94 values over and over again (and in practice, a lot less than 94).

I found that it was much quicker to create a lookup table:

import math
QSCORE_LOOKUP = [10 ** -((i - 33)/10) for i in range(128)]

def mean_qscore_from_qstring(qstring):
    """
    Convert qstring into a mean qscore
    """
    length = len(qstring)
    if len == 0: return 0.0
    qscore_values = qstring.encode('ASCII')  # This returns a bytes object which when iterated over returns integers
    sum_error = sum(QSCORE_LOOKUP[q] for q in qscore_values)
    return -10 * math.log10(sum_error / length)

In C this is even quicker of course. You can simply create a const double QSCORE_LOOKUP[128] = {...} in a generated header file or use a constexpr in C++ (I believe, I do not use C++ myself). I have a generated header file here: https://github.com/LUMC/fastq-filter/blob/develop/src/fastq_filter/score_to_error_rate.h. Feel free to use. Keep in mind that I did not do the substracting of the 33 offset in the C lookup as q-33 in C compiles to an instruction that only takes one clocktick anyway and there is no notable speed difference. So in C I do QSCORE_LOOKUP[q-33] .

If you want to use the very fast C implementation in python. I already made it in C in my fastq-filter. So a from fastq_filter import qualmean suffices.

No need to recalculate 94 distinct phred values all over again is what I am saying. Hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions