%matplotlib inline

Analyze Xenium Breast Cancer data

This tutorial shows how to apply Bering to 10x Xenium Ductal Carcinoma In Situ (DCIS) data.

Xenium DCIS [Janesick et al., 2022] data was derived from: https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast. The data covers 313 unique genes. More details can be found in the bioRxiv preprint: https://www.biorxiv.org/content/10.1101/2022.10.06.510405v2

Import packages & data

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

import Bering as br

# load data
df_spots_all = br.datasets.xenium_dcis_janesick()
df_spots_seg = df_spots_all[df_spots_all['labels'] != 'background'] # foreground nodes
df_spots_unseg = df_spots_all[df_spots_all['labels'] == 'background'] # background nodes
img = None
channels = None

df_spots_seg.head()

Downloading dataset `xenium_dcis_janesick` from `https://figshare.com/ndownloader/files/41409108` as `None.tsv`

	x	y	z	features	segmented	labels
transcript_id
281616710639100	2100.0261	2350.3684	19.269732	LUM	74245	Stromal
281616710639104	2100.2126	2178.9346	18.475246	HAVCR2	74924	Stromal
281616710639107	2100.3599	2261.9116	20.490784	RUNX1	74265	Stromal
281616710639109	2100.3918	2155.2341	20.006990	TCIM	74911	Stromal
281616710639110	2100.5117	2353.7695	20.749512	CXCR4	74244	B_Cells

df_spots_unseg.head() # visualize unsegmented data

	x	y	z	features	segmented	labels
transcript_id
281616710639136	2101.5410	2207.5933	21.504648	LUM	-1	background
281616710639187	2104.3992	2216.3367	23.524862	LUM	-1	background
281616710639217	2105.7534	2207.2124	23.446508	LUM	-1	background
281616710639245	2107.0217	2203.3608	17.956509	DAPK3	-1	background
281616710639283	2108.5515	2226.5203	22.951225	NegControlCodeword_0505	-1	background

# visualize the spots
x, y = df_spots_seg['x'].values, df_spots_seg['y'].values
cell_types = df_spots_seg['labels'].values

fig, ax = plt.subplots(figsize = (6, 6))
for idx, cell_type in enumerate(np.unique(cell_types)):
    
    xc = x[np.where(cell_types == cell_type)[0]]
    yc = y[np.where(cell_types == cell_type)[0]]

    ax.scatter(xc, yc, s = 0.03, label = cell_type, color = np.random.rand(3,))

xb, yb = df_spots_unseg['x'].values, df_spots_unseg['y'].values
ax.scatter(xb, yb, color = '#DCDCDC', alpha = 0.2, s = 0.015, label = 'background')

h, l = ax.get_legend_handles_labels()
plt.legend(h, l, loc = 'upper right', fontsize = 8, markerscale = 15)

<matplotlib.legend.Legend at 0x2b9d068869d0>

../../_images/f23836c99f7b93dd022c1d1177e5f4d381b5c2cbd3efdd68b4ca6222641f9c79.png

Create Bering object

# image-dependent segmentation
bg = br.BrGraph(df_spots_seg, df_spots_unseg, img, channels)
bg

<Bering.objects.bering.Bering_Graph at 0x2b9d069a42b0>

bg.segmented.head() # summary of cells

	cx	cy	cz	dx	dy	dz	d	labels
segmented
0	2100.7908	2347.0320	19.633743	2.8401	6.3408	1.886088	6.3408	Stromal
1	2104.6345	2175.2102	19.264338	14.1880	19.3207	8.264509	19.3207	Stromal
2	2106.9426	2265.7950	18.958155	20.9680	32.9023	6.240995	32.9023	Stromal
3	2102.4075	2155.4941	20.838385	7.8511	15.8475	6.636699	15.8475	Stromal
4	2103.3418	2354.0996	19.103582	13.3095	8.6716	4.977818	13.3095	B_Cells

Create training data

# Build graphs for GCN training purpose
br.graphs.BuildWindowGraphs(
    bg, 
    n_cells_perClass = 12, 
    window_width = 15.0, 
    window_height = 15.0, 
    n_neighbors = 30, 
)

br.graphs.CreateData(
    bg, 
    batch_size = 16, 
    training_ratio = 0.8, 
)

Training

br.train.Training(bg)

Training node classifier:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████  | 49/50 [01:44<00:01,  1.68s/it]

../../_images/620b74d72362125ea6db73297d97ba180c7eb99fe529901bb3aefe42b4e31a34.png

Training node classifier: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:52<00:00,  2.25s/it]
Training edge classifier:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████  | 49/50 [01:41<00:01,  1.95s/it]

../../_images/83548f7dd458b9fe700e299dd10cc64344715956a79f60e28d005e31beade8ea.png

Training edge classifier: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:46<00:00,  2.14s/it]

# save the trained model
import pickle
with open('xenium_dcis_janesick.pkl', 'wb') as f:
    pickle.dump(bg, f)

Visualizing model

# randomly select a cell
random_cell = cells = random.sample(bg.segmented.index.values.tolist(), 1)[0]

# plot node classification
df_window_raw, df_window_pred, predictions = br.pl.Plot_Classification(
    bg, 
    cell_name = random_cell,
    n_neighbors = 30, 
    zoomout_scale = 8,
)

../../_images/155946f102d007907e1391636754b1f075f67022e450e6ac461be68fde4b123b.png

# plot cell segmentation
br.pl.Plot_Segmentation(
    bg, 
    cell_name = random_cell,
    df_window_raw = df_window_raw, 
    df_window_pred = df_window_pred,
    predictions = predictions,
    n_neighbors = 30, 
    zoomout_scale = 8,
    use_image = True,
    pos_thresh = 0.6,
    resolution = 0.05,
    num_edges_perSpot = 100,
    min_prob_nodeclf = 0.3,
    n_iters = 20,
)

../../_images/fc94c063c690899d950551477647580ddab841d7388819664792b46c650eda67.png

After the model is trained, we can use the trained model to predict the cell types and segment all spots on the whole slice. After node classification and cell segmentation is completed, we generate single cell matrix in the end.

Node classification

Conduct node classification on the whole slice.

br.tl.node_classification(
    bg, bg.spots_all.copy(), 
    n_neighbors = 30, 
)
bg.spots_all.to_csv('spots_all.txt', sep = '\t')

Cell segmentation

br.tl.cell_segmentation(bg)

Get single cells

df_results, adata_ensembl, adata_segmented = br.tl.cell_annotation(bg)
df_results.to_csv('results.txt', sep = '\t')
adata_ensembl.write('results_cells_ensembled.h5ad')
adata_segmented.write('results_cells_segmented.h5ad')

print(f'Ensembled anndata: {adata_ensembl.shape}')
print(f'Segmented anndata: {adata_segmented.shape}')

... storing 'predicted_labels' as categorical

Ensembled anndata: (764, 540)
Segmented anndata: (6986, 540)

Single cell data analysis

import scanpy as sc
sc.settings.set_figure_params(dpi=80)

# run standard analysis on the ensembled anndata
adata_ensembl = br.tl.cell_analyze(adata_ensembl, min_counts = 5)

# first check the data quality of segmented cells
sc.pl.umap(adata_ensembl, color = ['n_counts','n_genes'])

2023-08-22 15:33:47.837666: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 15:33:48.076345: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

../../_images/ccbd7dee04f3e073dbfb0f2a96ed083c956b14b0931569613505a99fa11c97e8.png

# plot leiden clustering and predicted labels from Bering
fig, axes = plt.subplots(1, 2, figsize = (15, 5))
axes[0] = sc.pl.umap(adata_ensembl, color = ['leiden'], ax = axes[0], show = False)
axes[1] = sc.pl.umap(adata_ensembl, color = ['predicted_labels'], ax = axes[1], show = False)

plt.subplots_adjust(wspace = 0.5)
plt.tight_layout()
plt.show()

../../_images/de6d4394073a999aa6fab564948e04e3c2352b68d29e970da81b4a01cd97876e.png