Examples¶

Example 1: Basics¶

The source file can be downloaded here: example1_basics.py

This example illustrates the basic operations of the StrainPycon package. First we draw random strain barcodes and frequencies, and then we generate two measurement vectors: one without noise and one with small additive Gaussian noise. The reconstruction, i.e., “MAP estimate”, is computed for both vectors. We also compute the misfit values, i.e., negative log-likelihoods, when the number of strains in the reconstruction varies.

We start by importing the necessary modules and setting the random seed for reproducibility:

import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)

We choose the number of SNP sites and the number of strains for our synthetic measurements:

m = 24 # number of SNP sites
n = 4 # number of strains

The binary strain barcode matrix has shape (n, m). We draw it from uniform distribution. The frequency vector has shape (n,) and it has positive entries that sum up to one. StrainPycon provides a utility function for drawing uniformly random frequency vectors. For illustration purposes, we sort the vector in descending order, because the reconstruction method always returns the vector in descending order. The noiseless measurement is obtained by simply taking a weighted sum of the binary barcodes, which can be written as a vector-matrix multiplication. For the noisy measurement, we add independent Gaussian random variables with standard deviation 0.01:

strains = np.random.randint(0, 2, (n, m), dtype=int) # shape (n,m)
freq = strainpycon.psimplex.rand_simplex(n, sort=-1) # shape (n,)
meas1 = np.dot(freq, strains) # shape (m,)
meas2 = meas1 + 0.01 * np.random.normal(size=meas1.shape)

Now we are ready to test the strain identification functions in StrainPycon package. Most of the useful functions are methods of StrainRecon class, so first we create an object of that class:

S = strainpycon.StrainRecon()

The MAP estimates are computed as follows. Here, the correct number of strains is assumed to be known:

(mat1, vec1) = S.compute(meas1, n) # without noise
(mat2, vec2) = S.compute(meas2, n) # with noise

Finally, we examine the misfits, i.e., negative log-likelihoods, when n in the reconstruction varies between 1 and 7:

nrange = range(1,8)
misfits1 = S.misfits(meas1, nrange) # without noise
misfits2 = S.misfits(meas2, nrange) # with noise

The matrices, frequency vectors, and misfits can be visualized as follows:

# Show the original barcode matrix and the reconsructions.
fig, ax = plt.subplots(nrows=3)
plt.sca(ax[0])
plt.imshow(strains)
plt.title('Ground truth')
plt.sca(ax[1])
plt.imshow(mat1)
plt.title('Reconstruction from noiseless data')
plt.sca(ax[2])
plt.imshow(mat2)
plt.title('Reconstruction from noisy data')
for idx in range(3):
    plt.sca(ax[idx])
    plt.xlabel('SNP site')
    plt.ylabel('Strain')
plt.tight_layout()
plt.show()

# Display frequencies.
print('Strain frequencies:')
print('Ground truth:\n{}'.format(freq))
print('Reconstruction from noiseless data:\n{}'.format(vec1))
print('Reconstruction from noisy data:\n{}'.format(vec2))

# Plot misfits.
fig = plt.figure()
plt.semilogy(nrange, misfits1, label='Noiseless data')
plt.semilogy(nrange, misfits2, label='Noisy data')
plt.legend()
plt.title('Misfit')
plt.xlabel('Number of strains')
plt.ylabel('Reconstruction misfit')
plt.show()

Example 2: Uncertainty quantification¶

The source file can be downloaded here: example2_uncertainty.py

In this example, we consider experimental data and visualize the posterior statistics for the barcode matrix. Essentially, we reproduce the matrix part of the image shown as Figure 3 in https://doi.org/10.1088/1361-6420/aad7cd . See the corresponding section in the text for details about the measurement data.

We import the modules as above and provide the data. The number of SNP sites is 16 and there are three strains:

import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)

strains = np.array([[0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1],
                    [1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1]])

n = np.size(strains, 0)

meas = np.array([0.305732484076433,
                 0.144654088050314,
                 1.000000000000000,
                 0.872727272727273,
                 0.161073825503356,
                 0.178571428571429,
                 1.000000000000000,
                 1.000000000000000,
                 0.131034482758621,
                 0.838509316770186,
                 0.283132530120482,
                 0.200000000000000,
                 0.679802955665025,
                                 0,
                                 0,
                 1.000000000000000])

Next, we compute the reconstruction with uncertainty information. We assume the knowledge of the number of strains, and the standard deviation of noise is set to 0.1 as in the article:

S = strainpycon.StrainRecon()
(mat, vec, meanmat, meanvec, devmat, devvec) = S.compute(meas, n, gamma=0.1, uncertainty=True)

Lastly, we create the figure:

fig, ax = plt.subplots(nrows=4)
plt.sca(ax[0])
plt.imshow(strains, vmin=0, vmax=1)
plt.title('Ground truth')
plt.sca(ax[1])
plt.imshow(mat, vmin=0, vmax=1)
plt.title('Reconstruction (MAP estimate)')
plt.sca(ax[2])
plt.imshow(meanmat, vmin=0, vmax=1)
plt.title('Posterior mean')
plt.sca(ax[3])
plt.imshow(devmat, vmin=0, vmax=1)
plt.title('Posterior standard deviation')
for idx in range(4):
    plt.sca(ax[idx])
    plt.xlabel('SNP site')
    plt.ylabel('Strain')
plt.tight_layout()
cbar_ax = fig.add_axes([0.85, 0.1, 0.05, 0.8])
plt.colorbar(cax=cbar_ax)
plt.show()

Example 3: Multiple categories¶

The source file can be downloaded here: example3_multicat.py

Here we look at the case where the SNP sites are not binary but can have more than two (here, four) possible values:

import strainpycon
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)

m = 24
n = 3
cats = 4

For convenience, the package provides a function that we can use to generate synthetic data. The last argument (here, 0.1) is the standard deviation of the Gaussian noise:

S = strainpycon.StrainRecon()
(meas, strains, freq) = S.random_data(m, n, cats, 0.1)

The shape of meas is now (cats, m) instead of (m,). However, we can use the compute method as before:

(mat, vec) = S.compute(meas, n)

Visualization:

fig, ax = plt.subplots(nrows=2)

plt.sca(ax[0])
plt.imshow(strains)
plt.title('Ground truth')
plt.xlabel('SNP site')
plt.ylabel('Strain')

plt.sca(ax[1])
plt.imshow(mat)
plt.title('Reconstruction')
plt.xlabel('SNP site')
plt.ylabel('Strain')

plt.tight_layout()
plt.show()