%matplotlib inline

Analyze Xenium Breast Cancer data

This tutorial shows how to apply Bering to 10x Xenium Ductal Carcinoma In Situ (DCIS) data.

Xenium DCIS [Janesick et al., 2022] data was derived from: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The data covers 313 unique genes. More details can be found in the bioRxiv preprint: https://www.biorxiv.org/content/10.1101/2022.10.06.510405v2

Import packages & data

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import Bering as br
# load data
df_spots_all = br.datasets.xenium_dcis_janesick()
df_spots_seg = df_spots_all[df_spots_all['labels'] != 'background'] # foreground nodes
df_spots_unseg = df_spots_all[df_spots_all['labels'] == 'background'] # background nodes
img = None
channels = None

df_spots_seg.head()
Downloading dataset `xenium_dcis_janesick` from `https://figshare.com/ndownloader/files/41409108` as `None.tsv`
x y z features segmented labels
transcript_id
281616710639100 2100.0261 2350.3684 19.269732 LUM 74245 Stromal
281616710639104 2100.2126 2178.9346 18.475246 HAVCR2 74924 Stromal
281616710639107 2100.3599 2261.9116 20.490784 RUNX1 74265 Stromal
281616710639109 2100.3918 2155.2341 20.006990 TCIM 74911 Stromal
281616710639110 2100.5117 2353.7695 20.749512 CXCR4 74244 B_Cells
df_spots_unseg.head() # visualize unsegmented data
x y z features segmented labels
transcript_id
281616710639136 2101.5410 2207.5933 21.504648 LUM -1 background
281616710639187 2104.3992 2216.3367 23.524862 LUM -1 background
281616710639217 2105.7534 2207.2124 23.446508 LUM -1 background
281616710639245 2107.0217 2203.3608 17.956509 DAPK3 -1 background
281616710639283 2108.5515 2226.5203 22.951225 NegControlCodeword_0505 -1 background
# visualize the spots
x, y = df_spots_seg['x'].values, df_spots_seg['y'].values
cell_types = df_spots_seg['labels'].values

fig, ax = plt.subplots(figsize = (6, 6))
for idx, cell_type in enumerate(np.unique(cell_types)):
    
    xc = x[np.where(cell_types == cell_type)[0]]
    yc = y[np.where(cell_types == cell_type)[0]]

    ax.scatter(xc, yc, s = 0.03, label = cell_type, color = np.random.rand(3,))

xb, yb = df_spots_unseg['x'].values, df_spots_unseg['y'].values
ax.scatter(xb, yb, color = '#DCDCDC', alpha = 0.2, s = 0.015, label = 'background')

h, l = ax.get_legend_handles_labels()
plt.legend(h, l, loc = 'upper right', fontsize = 8, markerscale = 15)
<matplotlib.legend.Legend at 0x2b9d068869d0>
../../_images/f23836c99f7b93dd022c1d1177e5f4d381b5c2cbd3efdd68b4ca6222641f9c79.png

Create Bering object

# image-dependent segmentation
bg = br.BrGraph(df_spots_seg, df_spots_unseg, img, channels)
bg
<Bering.objects.bering.Bering_Graph at 0x2b9d069a42b0>
bg.segmented.head() # summary of cells
cx cy cz dx dy dz d labels
segmented
0 2100.7908 2347.0320 19.633743 2.8401 6.3408 1.886088 6.3408 Stromal
1 2104.6345 2175.2102 19.264338 14.1880 19.3207 8.264509 19.3207 Stromal
2 2106.9426 2265.7950 18.958155 20.9680 32.9023 6.240995 32.9023 Stromal
3 2102.4075 2155.4941 20.838385 7.8511 15.8475 6.636699 15.8475 Stromal
4 2103.3418 2354.0996 19.103582 13.3095 8.6716 4.977818 13.3095 B_Cells

Create training data

# Build graphs for GCN training purpose
br.graphs.BuildWindowGraphs(
    bg, 
    n_cells_perClass = 12, 
    window_width = 15.0, 
    window_height = 15.0, 
    n_neighbors = 30, 
)
br.graphs.CreateData(
    bg, 
    batch_size = 16, 
    training_ratio = 0.8, 
)

Training

br.train.Training(bg)
Training node classifier:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████  | 49/50 [01:44<00:01,  1.68s/it]
../../_images/620b74d72362125ea6db73297d97ba180c7eb99fe529901bb3aefe42b4e31a34.png
Training node classifier: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:52<00:00,  2.25s/it]
Training edge classifier:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████  | 49/50 [01:41<00:01,  1.95s/it]
../../_images/83548f7dd458b9fe700e299dd10cc64344715956a79f60e28d005e31beade8ea.png
Training edge classifier: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:46<00:00,  2.14s/it]
# save the trained model
import pickle
with open('xenium_dcis_janesick.pkl', 'wb') as f:
    pickle.dump(bg, f)

Visualizing model

# randomly select a cell
random_cell = cells = random.sample(bg.segmented.index.values.tolist(), 1)[0]

# plot node classification
df_window_raw, df_window_pred, predictions = br.pl.Plot_Classification(
    bg, 
    cell_name = random_cell,
    n_neighbors = 30, 
    zoomout_scale = 8,
)
../../_images/155946f102d007907e1391636754b1f075f67022e450e6ac461be68fde4b123b.png
# plot cell segmentation
br.pl.Plot_Segmentation(
    bg, 
    cell_name = random_cell,
    df_window_raw = df_window_raw, 
    df_window_pred = df_window_pred,
    predictions = predictions,
    n_neighbors = 30, 
    zoomout_scale = 8,
    use_image = True,
    pos_thresh = 0.6,
    resolution = 0.05,
    num_edges_perSpot = 100,
    min_prob_nodeclf = 0.3,
    n_iters = 20,
)
../../_images/fc94c063c690899d950551477647580ddab841d7388819664792b46c650eda67.png

After the model is trained, we can use the trained model to predict the cell types and segment all spots on the whole slice. After node classification and cell segmentation is completed, we generate single cell matrix in the end.

Node classification

Conduct node classification on the whole slice.

br.tl.node_classification(
    bg, bg.spots_all.copy(), 
    n_neighbors = 30, 
)
bg.spots_all.to_csv('spots_all.txt', sep = '\t')

Cell segmentation

br.tl.cell_segmentation(bg)

Get single cells

df_results, adata_ensembl, adata_segmented = br.tl.cell_annotation(bg)
df_results.to_csv('results.txt', sep = '\t')
adata_ensembl.write('results_cells_ensembled.h5ad')
adata_segmented.write('results_cells_segmented.h5ad')

print(f'Ensembled anndata: {adata_ensembl.shape}')
print(f'Segmented anndata: {adata_segmented.shape}')
... storing 'predicted_labels' as categorical
Ensembled anndata: (764, 540)
Segmented anndata: (6986, 540)

Single cell data analysis

import scanpy as sc
sc.settings.set_figure_params(dpi=80)

# run standard analysis on the ensembled anndata
adata_ensembl = br.tl.cell_analyze(adata_ensembl, min_counts = 5)

# first check the data quality of segmented cells
sc.pl.umap(adata_ensembl, color = ['n_counts','n_genes'])
2023-08-22 15:33:47.837666: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 15:33:48.076345: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
../../_images/ccbd7dee04f3e073dbfb0f2a96ed083c956b14b0931569613505a99fa11c97e8.png
# plot leiden clustering and predicted labels from Bering
fig, axes = plt.subplots(1, 2, figsize = (15, 5))
axes[0] = sc.pl.umap(adata_ensembl, color = ['leiden'], ax = axes[0], show = False)
axes[1] = sc.pl.umap(adata_ensembl, color = ['predicted_labels'], ax = axes[1], show = False)

plt.subplots_adjust(wspace = 0.5)
plt.tight_layout()
plt.show()
../../_images/de6d4394073a999aa6fab564948e04e3c2352b68d29e970da81b4a01cd97876e.png