Analysis of Kinase-based Viral Drug Connectivity Mapping¶

Intro¶

Kinases are pretty exciting to me, not only because protein phosphorylation plays a role in regulating most aspects of cell life, but also because abnormal phosphorylation is often the cause or consequence of disease. Accordingly, they've become an important group of drug targets that I wanted to take a look at.

In an attempt to unearth some computational biology skills, review some recent Pandas exposure, and get back into the spirit of biological research and discovery, I wanted to explore the relationship between kinase differential expression and viral infection. To do so, I ran a general query on the Gene Expression Atlas for all experiments that had to deal with viruses and for genes somehow related to kinases. Following this, the ultimate goal was to use connectivity mapping to find possible drug candidates correlated with a reversion in differential expression.

Some Background + Motivation¶

Recently, one of five 2016 Breakthrough Prizes in the Life Sciences was awarded to Professor Stephen Elledge, who among other incredible things is particularly known for his work in eludicating the human DNA damage response. Excitingly, this can basically be summarized as a protein kinase cascade, and so one could imagine directing their drug discovery search in regards to diseases that heavily involve DNA damage towards kinase inhibitors or activators. Cancers are probably what come to mind the most, and I had also done some Pandas notebook exploration into that, but many cells also undergo virus infection-associated DDR.

Picking the right virus infection targets¶

Another motivating idea is that going after viruses is difficult, particularly because of their rapid mutation cycle frequently rendering drugs obsolete. While we typically think of targeting foreign proteins such as external envelope proteins, another path to take might involve looking elsewhere from the virus entirely, and instead at host targets. While this sounds kind of like targeting your own cells with a therapeutic meant to kill viruses, viral infection might cause the upregulation of certain genes in host cells that allow them to propogate. If we can disrupt these then, the virus has a much harder time of getting by, and can't quite rely on its typical mutation frequency to escape dependency as fast. The idea is that drugs that disrupt host pathways that the virus relies on can't be as easily overcome by the virus through mutation because that would require the virus to evolve an entirely different functionality, which is a much lower probability event then changing a few coat protein residues.

More motivation¶

Along with talking to Dr. Andrew Taube at DESRES and learning about their group's work in MD with kinases, phosphatases, and viruses, and having my own probably farfetched ideas about one day being able to engineer a sort of hyper DDR and immunity against the introduction of viral DNA, I was interested in seeing what I could find through a short exploration with publically available datasets and Pandas tools.

I had also just taken MCB60: Cellular Biology and Molecular Medicine, which involved a lab component exploring DDR in yeast (slides viewable here). Three months of intermittent web lab work is kind of naturally inconclusive, so I was eager to see if I could get some more definite results computationally.

While in the short time putting this together I could only get as far as generating possibly connected drugs that would work towards reversing host cell expression after virus infection, and there's still a ton of analysis to be done with more time, the exercise was pretty fun and interesting. The later parts of the notebook could also definitely be cleaned up.

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

Load data of all virus experiments with differential expression of kinase-associated genes¶

df_diff_exp = pd.read_table('differentialResults-kinaseAndVirus.tsv')
df_diff_exp_up = df_diff_exp.loc[df_diff_exp['log2foldchange'] >= 0]
df_diff_exp_down = df_diff_exp.loc[df_diff_exp['log2foldchange'] < 0]

View all experiments¶

df_diff_exp['Comparison'].drop_duplicates().head()

0       'PR8 influenza virus' vs 'none' in 'wild type'
1    'Newcastle disease virus' at '10 hour' vs 'non...
2    'Newcastle disease virus' at '12 hour' vs 'non...
3    'Newcastle disease virus' at '10 hour' vs 'non...
4    'Newcastle disease virus' at '14 hour' vs 'non...
Name: Comparison, dtype: object

Isolating one experiment¶

As a first workthrough example, we'll look at the 'PR8 influenza virus' vs 'none' in 'wild type' experiment and assign the data to the df_diff_exp_0 dataframe.

df_diff_exp_0 = df_diff_exp.loc[df_diff_exp['Comparison']==df_diff_exp['Comparison'][0]]
df_diff_exp_0.head()

Separating genes into up and down regulation dataframes¶

df_diff_exp_0_up = df_diff_exp_0.loc[df_diff_exp_0['log2foldchange'] > 0]
df_diff_exp_0_down = df_diff_exp_0.loc[df_diff_exp_0['log2foldchange'] < 0]

df_diff_exp_0_up['Experiment Accession'] = df_diff_exp_0['Experiment Accession']
df_diff_exp_0_down['Experiment Accession'] = df_diff_exp_0['Experiment Accession']

Connecting to Biomart¶

Connects to a pretty versatile server for annotating and querying for various components (CMAP tags, Ensembl gene names, etc.)

Install module, define function to retrieve and query database

from biomart import BiomartServer

bm_server = BiomartServer('http://www.ensembl.org/biomart')
# bm_server.show_datasets()

# Use the GRCh38.p7 human ensemble dataset
hs_ensembl = bm_server.datasets['hsapiens_gene_ensembl']

Defining a get_probeset function¶

Returns dataframe with Ensembl Gene ID and Affy HG U133A probeset from the Biomart Ensembl server

Requires a dataframe with a Gene column name specifying the Ensembl code, and the Biomart Ensembl server previously specified.

def get_probeset(df, server):
    response = server.search({
            'filters': {
                'with_affy_hg_u133a': True,
                'ensembl_gene_id': df['Gene'].tolist()
            },
            'attributes': [
                'ensembl_gene_id', 'affy_hg_u133a'
            ]
        })
    try:
        df_probeset = pd.DataFrame([line.decode('utf-8').split('\t') for line in response.iter_lines()],
                                   columns=['Gene ID', 'Affy HG U133A probeset'])
        df_probeset['Experiment Accession'] = df['Experiment Accession']
        return df_probeset
    except:
        # Inserts placeholder gene (involved in red-green color blindness) if gene set is empty
        df_empty_probeset = pd.DataFrame({'Gene ID': ['ENSG00000102076'], 'Affy HG U133A probeset': ['221327_s_at']})
        df_empty_probeset['Experiment Accession'] = df['Experiment Accession']
        print('Empty dataset encountered. 0 probes returned')
        return df_empty_probeset

Getting probe sets for first experiment¶

df_probeset_0_up = get_probeset(df_diff_exp_0_up, hs_ensembl)
df_probeset_0_down = get_probeset(df_diff_exp_0_down, hs_ensembl)

Convert to .grp files for CMap querying¶

The Broad Institute Connectivity Map generates gene expression signatures associated with perturbations by certain small molecule drugs.

To query the service, the project builds signatures based on lists of HG-U133A probe sets associated with the up- and down-regulated genes stored in .grp files.

Function to do all of the above¶

Generates .grp files ready for CMap querying

*Note: make sure to have a directory named whatever you specify the dirname parameter in the same path level as the notebook directory

def save_cmap_tags(df, server, filename, dirname, lim=0):
    df_probeset = get_probeset(df, server)
    if lim > 0:
        df_probeset = df_probeset[:lim]
    numtags = df_probeset.shape[0]
    if numtags > 200:
        print('Recommended to limit max number of tags to 200')
    df_probeset['Affy HG U133A probeset'].drop_duplicates().to_csv(path=str(dirname)+'/'+str(filename)+'.grp', sep='\t', index=None)

save_cmap_tags(df_diff_exp_0_up, hs_ensembl, 'probesets_exp_0_up', 'probesets')
save_cmap_tags(df_diff_exp_0_down, hs_ensembl, 'probesets_exp_0_down', 'probesets')

Querying CMap¶

The above function save_cmap_tags will save a .grp files to the notebook directory. These can then be used to build a query at CMap

Function to generate CMap query tags from Gene Expression Atlas upload¶

i.e. doing almost all of the above in one method for all experiments

Because automation is good.

Possibly helper functions are noted below. (Example follows afterward).

*Note: make sure to have a directory named whatever you specify the dirname parameter in the same path level as the notebook directory

def save_cmap_tags_from_atlas(df, reg, server, dirname, lim=0):
    if reg == 0:
        diff_id = '-down'
    elif reg == 1:
        diff_id = '-up'
    df.groupby('Experiment Accession') \
        .apply(lambda df: save_cmap_tags(split_diff_exp(df, reg), server, str(df['Experiment Accession'] \
                                                                              .iloc[0])+diff_id, dirname))

Function to isolate experiment by Experiment Accession¶

def get_single_diff_exp(df, access_id):
    df_single_diff_exp = df.loc[df['Experiment Accession'] == access_id]
    return df_single_diff_exp

get_single_diff_exp(df_diff_exp, 'E-GEOD-13637').head(2)

Function to isolate data into up- and down-regulated gene sets¶

Parameter reg should be an integer 0 or 1

0 denotes downregulated genes
1 denotes upregulated genes

def split_diff_exp(df, reg):
    if reg == 0:
        return df.loc[df['log2foldchange'] < 0]
    elif reg == 1:
        return df.loc[df['log2foldchange'] > 0]

# Example building on above. Note the return of all positive log2 fold changes
split_diff_exp(get_single_diff_exp(df_diff_exp, 'E-GEOD-13637'), 1).sort_values('log2foldchange', ascending=True).head()

Saving CMap tags¶

# Hacky way to get both down- and up-regulated tag probe sets
for i in range(0, 2):
    save_cmap_tags_from_atlas(df=df_diff_exp, reg=i, server=hs_ensembl, dirname='probesets')

Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned

The result:¶

*Note: last 2 files are artifacts of the example workthrough.

Analysis of connectivity with CMAP¶

We now have .grp probe set files that can be used to generate a signature and query for CMAP. After doing so, we can download the resulting table to identify perturbations that have antagonistic gene expression effects denoted by negative enrichment scores (which could be good for identifying possible drugs).

Example Analysis with E-GEOD-8717¶

Querying CMap with .grp files¶

Query complete!¶

Loading the connectivity mapping dataframe¶

df_E_GEOD_8717 = pd.read_excel('permutedResults-E-GEOD-8717.xls')
df_E_GEOD_8717.head()

Identifying possible connected drug candidates¶

def find_candidates(df):
    df_cand = df.loc[df['enrichment'] < 0].sort_values(['rank', 'enrichment'], ascending=True)
    return df_cand

find_candidates(df_E_GEOD_8717).head()

Loading all CMap dataframes¶

df_E_GEOD_13637 = pd.read_excel('permutedResults-E-GEOD-13637.xls')
df_E_GEOD_17156 = pd.read_excel('permutedResults-E-GEOD-17156.xls')
df_E_GEOD_17400 = pd.read_excel('permutedResults-E-GEOD-17400.xls')
df_E_GEOD_18791 = pd.read_excel('permutedResults-E-GEOD-18791.xls')
df_E_GEOD_18816 = pd.read_excel('permutedResults-E-GEOD-18816.xls')
df_E_GEOD_19665 = pd.read_excel('permutedResults-E-GEOD-19665.xls')
df_E_GEOD_22522 = pd.read_excel('permutedResults-E-GEOD-22522.xls')
df_E_GEOD_31193 = pd.read_excel('permutedResults-E-GEOD-31193.xls')
df_E_GEOD_37571 = pd.read_excel('permutedResults-E-GEOD-37571.xls')
df_E_GEOD_40844 = pd.read_excel('permutedResults-E-GEOD-40844.xls')
df_E_MEXP_1274 = pd.read_excel('permutedResults-E-MEXP-1274.xls')
df_E_MEXP_3582 = pd.read_excel('permutedResults-E-MEXP-3582.xls')

find_candidates(df_E_GEOD_13637)[:100]

	Gene	Organism	Experiment Accession	Comparison	log2foldchange	pValue	tStatistics
0	ENSG00000169245	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	9.9	4.069023e-10	64.458303
11	ENSG00000271503	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	8.6	4.331436e-10	62.713264
26	ENSG00000134326	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	7.6	4.069023e-10	69.999440
35	ENSG00000271503	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	7.4	4.346290e-08	31.419356
44	ENSG00000138646	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	7.2	4.069023e-10	65.716153

	Gene	Organism	Experiment Accession	Comparison	log2foldchange	pValue	tStatistics
0	ENSG00000169245	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	9.9	4.069023e-10	64.458303
11	ENSG00000271503	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	8.6	4.331436e-10	62.713264

	Gene	Organism	Experiment Accession	Comparison	log2foldchange	pValue	tStatistics
999	ENSG00000137752	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	2.7	7.423240e-07	20.756294
908	ENSG00000204264	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	2.7	7.830226e-06	15.050237
998	ENSG00000121858	homo sapiens	E-GEOD-13637	'H5N1 influenza virus' vs 'none' in 'wild type'	2.7	3.086127e-04	8.774171
984	ENSG00000047410	homo sapiens	E-GEOD-13637	'pEGZ-transfected (empty vector)' vs 'wild typ...	2.7	1.316403e-04	12.403622
983	ENSG00000023445	homo sapiens	E-GEOD-13637	'PR8 influenza virus' vs 'none' in 'wild type'	2.7	5.217151e-06	16.040361

	rank	cmap name	mean	n	enrichment	specificity	percent non-null
0	1	ouabain	0.777	4	0.979	0.0051	100
1	2	digoxin	0.778	4	0.978	0.0047	100
2	3	emetine	0.749	4	0.955	0.007	100
3	4	anisomycin	0.748	4	0.947	0.0412	100
4	5	cicloheximide	0.652	4	0.941	0.0226	100

	rank	cmap name	mean	n	enrichment	p	specificity	percent non-null
24	25	naloxone	-0.379	6	-0.680	0.00288	0.0155	66
37	38	talampicillin	-0.301	4	-0.748	0.00802	0.0214	50
38	39	phensuximide	-0.433	4	-0.748	0.00808	0.0191	75
44	45	NU-1025	-0.612	2	-0.926	0.01131	0.0107	100
45	46	ganciclovir	-0.275	4	-0.721	0.01229	0.0113	50

	rank	cmap name	mean	n	enrichment	p	specificity	percent non-null
1	2	H-7	-0.713	4	-0.944	0	0.0909	100
10	11	neostigmine bromide	-0.574	4	-0.831	0.00147	0.0115	100
11	12	sulfamethizole	-0.728	4	-0.821	0.00193	0	100
13	14	bambuterol	-0.530	4	-0.795	0.00348	0	75
14	15	mexiletine	-0.302	6	-0.652	0.00481	0	50
15	16	decamethonium bromide	-0.318	4	-0.775	0.00527	0.0059	50
17	18	lisinopril	-0.657	3	-0.854	0.00619	0.0216	100
22	23	bufexamac	-0.299	4	-0.739	0.00905	0.0177	50
23	24	metolazone	-0.492	5	-0.670	0.00907	0	80
24	25	diltiazem	-0.398	5	-0.670	0.00907	0.0313	60
25	26	carbenoxolone	-0.397	4	-0.736	0.00975	0.0126	75
28	29	papaverine	-0.470	4	-0.729	0.01104	0.0126	75
29	30	pivmecillinam	-0.267	4	-0.728	0.01118	0.0221	50
30	31	epitiostanol	-0.404	4	-0.726	0.01154	0.037	75
34	35	alprenolol	-0.417	4	-0.713	0.01373	0.01	75
35	36	metaraminol	-0.504	4	-0.702	0.01643	0.0112	75
37	38	lymecycline	-0.469	4	-0.694	0.01846	0.0352	75
38	39	dilazep	-0.458	5	-0.627	0.01868	0.0345	80
42	43	ethotoin	-0.291	6	-0.573	0.02161	0.1414	66
43	44	riluzole	-0.415	5	-0.616	0.02289	0.019	80
44	45	alfaxalone	-0.487	3	-0.776	0.02299	0.0319	66
48	49	raubasine	-0.317	4	-0.670	0.02638	0	50
50	51	canavanine	-0.351	3	-0.756	0.02981	0.0412	66
51	52	2,6-dimethylpiperidine	-0.297	5	-0.597	0.03038	0.0108	60
52	53	paroxetine	-0.319	4	-0.660	0.03044	0.0398	50
54	55	oxprenolol	-0.497	4	-0.656	0.03195	0.0396	75
55	56	sitosterol	-0.532	4	-0.656	0.03213	0.1071	75
56	57	labetalol	-0.534	4	-0.655	0.03251	0.036	75
58	59	eldeline	-0.305	6	-0.543	0.03613	0.0756	50
67	68	nicotinic acid	-0.259	4	-0.637	0.04156	0.0599	50
...	...	...	...	...	...	...	...	...
153	154	acetazolamide	-0.234	4	-0.534	0.13698	0.0939	50
154	155	phenylpropanolamine	-0.230	4	-0.533	0.13724	0.0488	50
156	157	chlormezanone	-0.154	4	-0.532	0.13985	0.1035	50
157	158	xylazine	-0.186	4	-0.531	0.14011	0.1421	50
158	159	ipratropium bromide	-0.442	3	-0.600	0.14141	0.2935	66
159	160	ondansetron	-0.310	4	-0.530	0.14223	0.2481	50
160	161	aminoglutethimide	-0.248	3	-0.599	0.14262	0.1369	66
163	164	chrysin	-0.486	3	-0.595	0.14796	0.3788	66
164	165	citalopram	-0.321	4	-0.525	0.14886	0.1736	50
165	166	cefsulodin	-0.336	4	-0.523	0.15156	0.1264	50
166	167	doxorubicin	-0.463	3	-0.592	0.15239	0.5263	66
168	169	flucytosine	-0.261	4	-0.522	0.15324	0.2023	50
169	170	proglumide	-0.325	5	-0.472	0.15358	0.1312	60
170	171	procyclidine	-0.346	4	-0.521	0.15481	0.1437	50
172	173	indapamide	-0.259	6	-0.431	0.15742	0.1364	50
173	174	bephenium hydroxynaphthoate	-0.396	5	-0.468	0.15889	0.2542	60
174	175	hydroflumethiazide	-0.379	5	-0.467	0.16041	0.1804	60
175	176	tetramisole	-0.161	4	-0.516	0.16145	0.1136	50
176	177	mecamylamine	-0.444	3	-0.585	0.16293	0.1694	66
177	178	BCB000038	-0.361	4	-0.514	0.16408	0.2011	50
178	179	trioxysalen	-0.342	4	-0.513	0.16509	0.4873	50
179	180	fluticasone	-0.112	4	-0.513	0.16541	0.2675	50
181	182	mefenamic acid	-0.274	5	-0.463	0.16762	0.0692	60
182	183	alsterpaullone	-0.287	3	-0.582	0.16771	0.4929	66
184	185	propidium iodide	-0.245	4	-0.511	0.16867	0.0861	50
185	186	scoulerine	-0.211	4	-0.510	0.17003	0.1597	50
188	189	ioxaglic acid	-0.454	3	-0.578	0.17388	0.1526	66
191	192	cinoxacin	-0.335	4	-0.501	0.18353	0.2512	50
192	193	exisulind	-0.267	2	-0.696	0.18466	0.1322	50
194	195	trimethoprim	-0.335	5	-0.453	0.18756	0.1859	60