Basic usage¶

[1]:

import gnt
import pandas as pd

Data¶

First, we load data from Najm et al. “Orthologous CRISPR–Cas9 enzymes for combinatorial genetic screens.” These data should be in the format: guide 1, guide 2, gene 1, gene 2, followed by different conditions that were screened.

[2]:

lfcs = pd.read_csv('https://raw.githubusercontent.com/PeterDeWeirdt/bigpapi/master/data/processed/bigpapi_lfcs.csv')
lfcs

[2]:

	U6 Sequence	H1 Sequence	U6 gene	H1 gene	Day 21_786O	Day 21_A375	Day 21_A549	Day 21_HT29	Day 21_Meljuso	Day 21_OVCAR8
0	AAAGTGGAACTCAGGACATG	AAAAAAAGAGTCGAATGTTTT	HPRT intron	6T	0.421135	0.250043	0.725424	0.635972	0.127104	0.245427
1	AAAGTGGAACTCAGGACATG	AAAGAGTCCACTCTGCACTTG	HPRT intron	UBC	0.040784	0.125369	0.343278	0.524569	-0.175984	0.371689
2	AAAGTGGAACTCAGGACATG	AACAGCTCCGTGTACTGAGGC	HPRT intron	CD81	0.711486	0.857567	1.513217	0.970841	0.630948	0.675330
3	AAAGTGGAACTCAGGACATG	AAGACGAAATTGAAGACGAAG	HPRT intron	CD81	0.451992	0.588394	1.283543	0.771713	0.409791	0.643640
4	AAAGTGGAACTCAGGACATG	AAGCGTACTGCTCATCATCGT	HPRT intron	HSP90AA1	0.477678	-0.652709	0.442170	0.021827	0.187209	-0.120412
...	...	...	...	...	...	...	...	...	...	...
9179	TTCTGACTACAACATCCAGA	TTGCTTTCATTTAATGCTACA	UBB	PARP2	-0.228266	-0.371034	-0.014272	0.570256	-0.437522	-0.717872
9180	TTCTGACTACAACATCCAGA	TTGGGACGAGTCCTGTGAGAA	UBB	IMPDH1	-0.178963	-0.024237	-0.323317	0.630812	-0.421810	0.197531
9181	TTCTGACTACAACATCCAGA	TTTAGGAATTGCTGTTGGGAC	UBB	HPRT intron	-0.266031	-0.429865	-0.145153	0.147415	-0.454209	-0.266580
9182	TTCTGACTACAACATCCAGA	TTTCCATCACTTGGTTGAATA	UBB	BCL2A1	-0.295739	-0.221819	-0.173578	0.231540	-0.430689	-0.377972
9183	TTCTGACTACAACATCCAGA	TTTGCCTGTGAGGTGGCCGCC	UBB	CD81	-0.001195	0.419874	0.815933	0.678013	0.211364	-0.333385

9184 rows × 10 columns

Calculating residuals¶

From the log fold changes, we calculate residuals at the guide level. We reason that interactors for a given “anchor” guide deviates from the expected range of LFCs of its “target” pairs. We fit a linear model between the median LFC of targets paired with controls and the average LFC of constructs with both an anchor and target guide. Negative residuals from this line indicate a synthetic lethal relationship, whereas positive residuals represent a buffering interaction.

[3]:

guide_residuals, model_info = gnt.get_guide_residuals(lfcs, ['CD81', 'HPRT intron'])
guide_residuals.sort_values('residual_z')

[3]:

	anchor_guide	target_guide	anchor_gene	target_gene	condition	lfc	lfc_target	residual	residual_z
3822	AAGCACCAGATCATGCACCG	TCGTGGGCACAAGGTCCTACA	MAP2K2	MAP2K1	Day 21_HT29	-3.339721	-1.462993	-1.855250	-7.024817
76126	GGGTAGTTCTCTCTTCATGGA	TCATCAACAGAGCCCGCCAA	WEE1	WEE1	Day 21_HT29	-5.676895	-1.874498	-2.627443	-6.821507
70479	GGCGATCTGGGGGAGGGGTGC	GCTGTCAGGAGTATTCTGAC	CD81	CHEK1	Day 21_HT29	-5.731891	-4.358044	-1.587411	-6.419596
2106	AACAGCTCCGTGTACTGAGGC	TGGTATTGGAATAACTCACA	CD81	CHEK1	Day 21_HT29	-5.921955	-4.476059	-1.468627	-6.151146
49800	GAAAGTATCTCGTTACTGGAA	TGGTATTGGAATAACTCACA	BRCA1	CHEK1	Day 21_HT29	-5.573397	-4.476059	-1.366310	-6.052855
...	...	...	...	...	...	...	...	...	...
68740	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_HT29	2.950638	0.978768	1.750542	6.790943
57533	GATCACGCCTCCAGCCAGCTG	GGATCACGCCTCCAGCCAGC	CD81	CD81	Day 21_786O	3.943046	0.835260	2.536494	8.249168
68452	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_786O	3.943046	1.040996	2.460461	8.347438
68932	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_OVCAR8	4.011292	1.315501	2.334258	8.510013
58013	GATCACGCCTCCAGCCAGCTG	GGATCACGCCTCCAGCCAGC	CD81	CD81	Day 21_OVCAR8	4.011292	1.005960	2.393177	8.561875

110208 rows × 9 columns

Model info¶

We can also look at the fit of the linear model for each guide, by considering its R2. A low R2 can represent a phenotypically dominant guide.

[4]:

model_info.sort_values('R2')

[4]:

	model	R2	f_pvalue	const	beta	anchor_guide	anchor_gene	condition
855	linear	0.148726	1.237419e-04	-4.285076	0.358985	GTTGTGCGCGTGCTCGAAGG	EEF2	Day 21_HT29
1064	linear	0.157745	6.744733e-05	-4.926558	0.501217	TGGTGTGCAAGGCGGGCATCA	EEF2	Day 21_A549
1059	linear	0.161153	5.042642e-05	-4.805119	0.403164	TGGTATTGGAATAACTCACA	CHEK1	Day 21_HT29
1062	linear	0.246355	3.133113e-07	-4.415317	0.388307	TGGTGTGCAAGGCGGGCATCA	EEF2	Day 21_786O
1017	linear	0.277469	3.529957e-08	-3.829045	0.540500	TGCCAACCTCCGACAAAGGT	EEF2	Day 21_HT29
...	...	...	...	...	...	...	...	...
406	linear	0.991998	2.324465e-100	0.294490	1.003528	CCTGCACTCGGAGAAGAACG	AKT1	Day 21_Meljuso
347	linear	0.992036	2.092961e-99	-0.003417	1.037822	CCAGGAATCCCAGTGCCTGC	CD81	Day 21_OVCAR8
403	linear	0.992262	4.794280e-101	0.101981	0.998951	CCTGCACTCGGAGAAGAACG	AKT1	Day 21_A375
430	linear	0.992594	6.112498e-102	-0.017872	1.011639	CGCCAACAACGCCAAGGCTG	CD81	Day 21_Meljuso
745	linear	0.993189	1.191829e-103	-0.013714	0.995770	GGGAAGGACAGGTCTAACAG	HPRT intron	Day 21_A375

1152 rows × 8 columns

Combining scores at the gene level¶

We then combine the residual for a gene pair by taking the z-score

\((\bar x - \mu)/(\sigma / \sqrt{n})\)

Where \(\bar x\), \(\mu\), \(\sigma\) are the sample mean, population mean, and population standard deviation of residuals, and \(n\) is the number of guide pairs.

[5]:

gene_scores = gnt.get_gene_residuals(guide_residuals, 'residual_z')
gene_scores.sort_values('pair_z_score')

[5]:

	condition	gene_a	gene_b	lfc	base_lfc_a	base_lfc_b	guide_pairs	pair_z_score_gene_a_anchor	pair_z_score_gene_b_anchor	pair_z_score
1556	Day 21_HT29	MAPK1	MAPK3	-1.754885	-0.272907	0.279881	18	-12.525889	-12.505035	-17.699536
2426	Day 21_OVCAR8	MAPK1	MAPK3	-0.616311	0.171391	0.511401	18	-12.116484	-11.518992	-16.712805
1991	Day 21_Meljuso	MAPK1	MAPK3	-1.194016	-0.318689	0.575274	18	-9.259513	-12.882693	-15.656904
1121	Day 21_A549	MAPK1	MAPK3	-1.291028	0.565824	0.293042	18	-10.960474	-9.671142	-14.588755
1999	Day 21_Meljuso	BCL2L1	MCL1	-2.643591	-1.173048	-0.004209	18	-8.671315	-11.529244	-14.283952
...	...	...	...	...	...	...	...	...	...	...
462	Day 21_A375	BCL2L1	BCL2L1	-0.544783	0.092347	0.092347	9	10.978989	10.978989	15.526634
2015	Day 21_Meljuso	MCL1	MCL1	-0.561327	-0.004209	-0.004209	9	11.730678	11.730678	16.589684
1494	Day 21_HT29	MAP2K1	MAP2K1	-1.443300	-0.864259	-0.864259	8	12.365253	12.365253	17.487108
1767	Day 21_Meljuso	BCL2L1	BCL2L1	-2.519740	-1.173048	-1.173048	9	12.901715	12.901715	18.245780
275	Day 21_786O	MCL1	MCL1	0.054351	0.031233	0.031233	9	13.223069	13.223069	18.700243

2610 rows × 10 columns

Other models: spline, fixed slope and quadratic¶

There are options to implement different models for calculating residuals at the guide level, including spline, fixed slope and quadratic

[6]:

spline_residuals, spline_model_info = gnt.get_guide_residuals(lfcs, ['CD81', 'HPRT intron'], model='spline')

[7]:

spline_residuals.sort_values('residual_z')

[7]:

	anchor_guide	target_guide	anchor_gene	target_gene	condition	lfc	lfc_target	residual	residual_z
3822	AAGCACCAGATCATGCACCG	TCGTGGGCACAAGGTCCTACA	MAP2K2	MAP2K1	Day 21_HT29	-3.339721	-1.462993	-1.863573	-7.072779
76126	GGGTAGTTCTCTCTTCATGGA	TCATCAACAGAGCCCGCCAA	WEE1	WEE1	Day 21_HT29	-5.676895	-1.874498	-2.429127	-6.594250
3726	AAGCACCAGATCATGCACCG	TCGTGGGCACAAGGTCCTACA	MAP2K2	MAP2K1	Day 21_A549	-2.394606	-0.333970	-2.073076	-5.994032
70479	GGCGATCTGGGGGAGGGGTGC	GCTGTCAGGAGTATTCTGAC	CD81	CHEK1	Day 21_HT29	-5.731891	-4.358044	-1.456496	-5.989739
6639	ACACCACCTGAAGTGACTCG	CACAGGTCTTTCCTTATGGGA	CHEK1	CHEK1	Day 21_HT29	-3.539920	-0.914817	-1.715034	-5.936140
...	...	...	...	...	...	...	...	...	...
68740	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_HT29	2.950638	0.978768	1.714901	6.907205
68932	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_OVCAR8	4.011292	1.315501	2.082427	7.943309
57533	GATCACGCCTCCAGCCAGCTG	GGATCACGCCTCCAGCCAGC	CD81	CD81	Day 21_786O	3.943046	0.835260	2.403812	8.161444
68452	GGATCACGCCTCCAGCCAGC	GATCACGCCTCCAGCCAGCTG	CD81	CD81	Day 21_786O	3.943046	1.040996	2.412874	8.223310
58013	GATCACGCCTCCAGCCAGCTG	GGATCACGCCTCCAGCCAGC	CD81	CD81	Day 21_OVCAR8	4.011292	1.005960	2.331811	8.407050

110208 rows × 9 columns

[8]:

spline_gene_scores = gnt.get_gene_residuals(spline_residuals, 'residual_z')
spline_gene_scores.sort_values('pair_z_score')

[8]:

	condition	gene_a	gene_b	lfc	base_lfc_a	base_lfc_b	guide_pairs	pair_z_score_gene_a_anchor	pair_z_score_gene_b_anchor	pair_z_score
1556	Day 21_HT29	MAPK1	MAPK3	-1.754885	-0.272907	0.279881	18	-12.806409	-11.354713	-17.084493
2426	Day 21_OVCAR8	MAPK1	MAPK3	-0.616311	0.171391	0.511401	18	-12.012608	-11.788136	-16.829668
1991	Day 21_Meljuso	MAPK1	MAPK3	-1.194016	-0.318689	0.575274	18	-9.445925	-12.361523	-15.420194
1121	Day 21_A549	MAPK1	MAPK3	-1.291028	0.565824	0.293042	18	-10.916840	-9.657316	-14.548126
1739	Day 21_HT29	WEE1	WEE1	-2.240272	-0.363547	-0.363547	8	-9.741398	-9.741398	-13.776417
...	...	...	...	...	...	...	...	...	...	...
2450	Day 21_OVCAR8	MCL1	MCL1	0.318145	0.345607	0.345607	9	11.242772	11.242772	15.899681
462	Day 21_A375	BCL2L1	BCL2L1	-0.544783	0.092347	0.092347	9	12.121925	12.121925	17.142990
2015	Day 21_Meljuso	MCL1	MCL1	-0.561327	-0.004209	-0.004209	9	12.724680	12.724680	17.995415
275	Day 21_786O	MCL1	MCL1	0.054351	0.031233	0.031233	9	13.243036	13.243036	18.728482
1767	Day 21_Meljuso	BCL2L1	BCL2L1	-2.519740	-1.173048	-1.173048	9	14.375425	14.375425	20.329920

2610 rows × 10 columns

[9]:

merged_predictions = spline_gene_scores.merge(gene_scores, how = 'inner', on = ['condition' ,'gene_a', 'gene_b'], suffixes=['_spline', '_linear'])

[10]:

merged_predictions.plot.scatter('pair_z_score_spline', 'pair_z_score_linear', alpha=0.4)

[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9bd68e370>

The spline fit does not give as much weight to essential gene EEF2, suggesting it can avoid this source of bias

[12]:

merged_predictions['difference'] = merged_predictions['pair_z_score_linear'] - merged_predictions['pair_z_score_spline']
merged_predictions.sort_values('difference').tail(5)

[12]:

	condition	gene_a	gene_b	lfc_spline	base_lfc_a_spline	base_lfc_b_spline	guide_pairs_spline	pair_z_score_gene_a_anchor_spline	pair_z_score_gene_b_anchor_spline	pair_z_score_spline	lfc_linear	base_lfc_a_linear	base_lfc_b_linear	guide_pairs_linear	pair_z_score_gene_a_anchor_linear	pair_z_score_gene_b_anchor_linear	pair_z_score_linear	difference
989	Day 21_A549	EEF2	EEF2	-4.524111	-2.250681	-2.250681	9	2.729341	2.729341	3.859871	-4.524111	-2.250681	-2.250681	9	4.711166	4.711166	6.662595	2.802724
554	Day 21_A375	EEF2	EEF2	-4.689237	-1.987910	-1.987910	9	-1.620841	-1.620841	-2.292216	-4.689237	-1.987910	-1.987910	9	0.819324	0.819324	1.158700	3.450915
2294	Day 21_OVCAR8	EEF2	EEF2	-4.712816	-2.330618	-2.330618	9	1.902640	1.902640	2.690739	-4.712816	-2.330618	-2.330618	9	4.855800	4.855800	6.867139	4.176399
1859	Day 21_Meljuso	EEF2	EEF2	-4.994169	-2.360364	-2.360364	9	0.870522	0.870522	1.231104	-4.994169	-2.360364	-2.360364	9	4.250344	4.250344	6.010894	4.779790
27	Day 21_786O	BCL2L1	BCL2L1	-4.341688	-2.433808	-2.433808	9	1.640050	1.640050	2.319381	-4.341688	-2.433808	-2.433808	9	5.684257	5.684257	8.038754	5.719373

[ ]: