Analysis of Kinase-based Viral Drug Connectivity Mapping

Intro

Kinases are pretty exciting to me, not only because protein phosphorylation plays a role in regulating most aspects of cell life, but also because abnormal phosphorylation is often the cause or consequence of disease. Accordingly, they've become an important group of drug targets that I wanted to take a look at.

In an attempt to unearth some computational biology skills, review some recent Pandas exposure, and get back into the spirit of biological research and discovery, I wanted to explore the relationship between kinase differential expression and viral infection. To do so, I ran a general query on the Gene Expression Atlas for all experiments that had to deal with viruses and for genes somehow related to kinases. Following this, the ultimate goal was to use connectivity mapping to find possible drug candidates correlated with a reversion in differential expression.

Some Background + Motivation

Recently, one of five 2016 Breakthrough Prizes in the Life Sciences was awarded to Professor Stephen Elledge, who among other incredible things is particularly known for his work in eludicating the human DNA damage response. Excitingly, this can basically be summarized as a protein kinase cascade, and so one could imagine directing their drug discovery search in regards to diseases that heavily involve DNA damage towards kinase inhibitors or activators. Cancers are probably what come to mind the most, and I had also done some Pandas notebook exploration into that, but many cells also undergo virus infection-associated DDR.

Picking the right virus infection targets

Another motivating idea is that going after viruses is difficult, particularly because of their rapid mutation cycle frequently rendering drugs obsolete. While we typically think of targeting foreign proteins such as external envelope proteins, another path to take might involve looking elsewhere from the virus entirely, and instead at host targets. While this sounds kind of like targeting your own cells with a therapeutic meant to kill viruses, viral infection might cause the upregulation of certain genes in host cells that allow them to propogate. If we can disrupt these then, the virus has a much harder time of getting by, and can't quite rely on its typical mutation frequency to escape dependency as fast. The idea is that drugs that disrupt host pathways that the virus relies on can't be as easily overcome by the virus through mutation because that would require the virus to evolve an entirely different functionality, which is a much lower probability event then changing a few coat protein residues.

More motivation

Along with talking to Dr. Andrew Taube at DESRES and learning about their group's work in MD with kinases, phosphatases, and viruses, and having my own probably farfetched ideas about one day being able to engineer a sort of hyper DDR and immunity against the introduction of viral DNA, I was interested in seeing what I could find through a short exploration with publically available datasets and Pandas tools.

I had also just taken MCB60: Cellular Biology and Molecular Medicine, which involved a lab component exploring DDR in yeast (slides viewable here). Three months of intermittent web lab work is kind of naturally inconclusive, so I was eager to see if I could get some more definite results computationally.

While in the short time putting this together I could only get as far as generating possibly connected drugs that would work towards reversing host cell expression after virus infection, and there's still a ton of analysis to be done with more time, the exercise was pretty fun and interesting. The later parts of the notebook could also definitely be cleaned up.

In [290]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

Load data of all virus experiments with differential expression of kinase-associated genes

In [75]:
df_diff_exp = pd.read_table('differentialResults-kinaseAndVirus.tsv')
df_diff_exp_up = df_diff_exp.loc[df_diff_exp['log2foldchange'] >= 0]
df_diff_exp_down = df_diff_exp.loc[df_diff_exp['log2foldchange'] < 0]

View all experiments

In [76]:
df_diff_exp['Comparison'].drop_duplicates().head()
Out[76]:
0       'PR8 influenza virus' vs 'none' in 'wild type'
1    'Newcastle disease virus' at '10 hour' vs 'non...
2    'Newcastle disease virus' at '12 hour' vs 'non...
3    'Newcastle disease virus' at '10 hour' vs 'non...
4    'Newcastle disease virus' at '14 hour' vs 'non...
Name: Comparison, dtype: object

Isolating one experiment

As a first workthrough example, we'll look at the 'PR8 influenza virus' vs 'none' in 'wild type' experiment and assign the data to the df_diff_exp_0 dataframe.

In [157]:
df_diff_exp_0 = df_diff_exp.loc[df_diff_exp['Comparison']==df_diff_exp['Comparison'][0]]
df_diff_exp_0.head()
Out[157]:
Gene Organism Experiment Accession Comparison log2foldchange pValue tStatistics
0 ENSG00000169245 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 9.9 4.069023e-10 64.458303
11 ENSG00000271503 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 8.6 4.331436e-10 62.713264
26 ENSG00000134326 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 7.6 4.069023e-10 69.999440
35 ENSG00000271503 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 7.4 4.346290e-08 31.419356
44 ENSG00000138646 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 7.2 4.069023e-10 65.716153

Separating genes into up and down regulation dataframes

In [177]:
df_diff_exp_0_up = df_diff_exp_0.loc[df_diff_exp_0['log2foldchange'] > 0]
df_diff_exp_0_down = df_diff_exp_0.loc[df_diff_exp_0['log2foldchange'] < 0]

df_diff_exp_0_up['Experiment Accession'] = df_diff_exp_0['Experiment Accession']
df_diff_exp_0_down['Experiment Accession'] = df_diff_exp_0['Experiment Accession']

Connecting to Biomart

Connects to a pretty versatile server for annotating and querying for various components (CMAP tags, Ensembl gene names, etc.)

Install module, define function to retrieve and query database

In [79]:
from biomart import BiomartServer
In [80]:
bm_server = BiomartServer('http://www.ensembl.org/biomart')
# bm_server.show_datasets()

# Use the GRCh38.p7 human ensemble dataset
hs_ensembl = bm_server.datasets['hsapiens_gene_ensembl']

Defining a get_probeset function

Returns dataframe with Ensembl Gene ID and Affy HG U133A probeset from the Biomart Ensembl server

  • Requires a dataframe with a Gene column name specifying the Ensembl code, and the Biomart Ensembl server previously specified.
In [273]:
def get_probeset(df, server):
    response = server.search({
            'filters': {
                'with_affy_hg_u133a': True,
                'ensembl_gene_id': df['Gene'].tolist()
            },
            'attributes': [
                'ensembl_gene_id', 'affy_hg_u133a'
            ]
        })
    try:
        df_probeset = pd.DataFrame([line.decode('utf-8').split('\t') for line in response.iter_lines()],
                                   columns=['Gene ID', 'Affy HG U133A probeset'])
        df_probeset['Experiment Accession'] = df['Experiment Accession']
        return df_probeset
    except:
        # Inserts placeholder gene (involved in red-green color blindness) if gene set is empty
        df_empty_probeset = pd.DataFrame({'Gene ID': ['ENSG00000102076'], 'Affy HG U133A probeset': ['221327_s_at']})
        df_empty_probeset['Experiment Accession'] = df['Experiment Accession']
        print('Empty dataset encountered. 0 probes returned')
        return df_empty_probeset

Getting probe sets for first experiment

In [295]:
df_probeset_0_up = get_probeset(df_diff_exp_0_up, hs_ensembl)
df_probeset_0_down = get_probeset(df_diff_exp_0_down, hs_ensembl)

Convert to .grp files for CMap querying

The Broad Institute Connectivity Map generates gene expression signatures associated with perturbations by certain small molecule drugs.

To query the service, the project builds signatures based on lists of HG-U133A probe sets associated with the up- and down-regulated genes stored in .grp files.

Function to do all of the above

Generates .grp files ready for CMap querying

*Note: make sure to have a directory named whatever you specify the dirname parameter in the same path level as the notebook directory

In [204]:
def save_cmap_tags(df, server, filename, dirname, lim=0):
    df_probeset = get_probeset(df, server)
    if lim > 0:
        df_probeset = df_probeset[:lim]
    numtags = df_probeset.shape[0]
    if numtags > 200:
        print('Recommended to limit max number of tags to 200')
    df_probeset['Affy HG U133A probeset'].drop_duplicates().to_csv(path=str(dirname)+'/'+str(filename)+'.grp', sep='\t', index=None)
In [205]:
save_cmap_tags(df_diff_exp_0_up, hs_ensembl, 'probesets_exp_0_up', 'probesets')
save_cmap_tags(df_diff_exp_0_down, hs_ensembl, 'probesets_exp_0_down', 'probesets')

Querying CMap

The above function save_cmap_tags will save a .grp files to the notebook directory. These can then be used to build a query at CMap

Function to generate CMap query tags from Gene Expression Atlas upload

i.e. doing almost all of the above in one method for all experiments

Because automation is good.


Possibly helper functions are noted below. (Example follows afterward).

*Note: make sure to have a directory named whatever you specify the dirname parameter in the same path level as the notebook directory

In [268]:
def save_cmap_tags_from_atlas(df, reg, server, dirname, lim=0):
    if reg == 0:
        diff_id = '-down'
    elif reg == 1:
        diff_id = '-up'
    df.groupby('Experiment Accession') \
        .apply(lambda df: save_cmap_tags(split_diff_exp(df, reg), server, str(df['Experiment Accession'] \
                                                                              .iloc[0])+diff_id, dirname)) 

Function to isolate experiment by Experiment Accession

In [186]:
def get_single_diff_exp(df, access_id):
    df_single_diff_exp = df.loc[df['Experiment Accession'] == access_id]
    return df_single_diff_exp
In [191]:
get_single_diff_exp(df_diff_exp, 'E-GEOD-13637').head(2)
Out[191]:
Gene Organism Experiment Accession Comparison log2foldchange pValue tStatistics
0 ENSG00000169245 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 9.9 4.069023e-10 64.458303
11 ENSG00000271503 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 8.6 4.331436e-10 62.713264

Function to isolate data into up- and down-regulated gene sets

Parameter reg should be an integer 0 or 1

  • 0 denotes downregulated genes
  • 1 denotes upregulated genes
In [192]:
def split_diff_exp(df, reg):
    if reg == 0:
        return df.loc[df['log2foldchange'] < 0]
    elif reg == 1:
        return df.loc[df['log2foldchange'] > 0]
In [197]:
# Example building on above. Note the return of all positive log2 fold changes
split_diff_exp(get_single_diff_exp(df_diff_exp, 'E-GEOD-13637'), 1).sort_values('log2foldchange', ascending=True).head()
Out[197]:
Gene Organism Experiment Accession Comparison log2foldchange pValue tStatistics
999 ENSG00000137752 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 2.7 7.423240e-07 20.756294
908 ENSG00000204264 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 2.7 7.830226e-06 15.050237
998 ENSG00000121858 homo sapiens E-GEOD-13637 'H5N1 influenza virus' vs 'none' in 'wild type' 2.7 3.086127e-04 8.774171
984 ENSG00000047410 homo sapiens E-GEOD-13637 'pEGZ-transfected (empty vector)' vs 'wild typ... 2.7 1.316403e-04 12.403622
983 ENSG00000023445 homo sapiens E-GEOD-13637 'PR8 influenza virus' vs 'none' in 'wild type' 2.7 5.217151e-06 16.040361

Saving CMap tags

In [274]:
# Hacky way to get both down- and up-regulated tag probe sets
for i in range(0, 2):
    save_cmap_tags_from_atlas(df=df_diff_exp, reg=i, server=hs_ensembl, dirname='probesets')
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned
Empty dataset encountered. 0 probes returned

The result:

*Note: last 2 files are artifacts of the example workthrough.

Analysis of connectivity with CMAP

We now have .grp probe set files that can be used to generate a signature and query for CMAP. After doing so, we can download the resulting table to identify perturbations that have antagonistic gene expression effects denoted by negative enrichment scores (which could be good for identifying possible drugs).

Example Analysis with E-GEOD-8717

Querying CMap with .grp files

Query complete!

Loading the connectivity mapping dataframe

In [266]:
df_E_GEOD_8717 = pd.read_excel('permutedResults-E-GEOD-8717.xls')
df_E_GEOD_8717.head()
Out[266]:
rank cmap name mean n enrichment p specificity percent non-null
0 1 ouabain 0.777 4 0.979 0 0.0051 100
1 2 digoxin 0.778 4 0.978 0 0.0047 100
2 3 emetine 0.749 4 0.955 0 0.007 100
3 4 anisomycin 0.748 4 0.947 0 0.0412 100
4 5 cicloheximide 0.652 4 0.941 0 0.0226 100

Identifying possible connected drug candidates

In [282]:
def find_candidates(df):
    df_cand = df.loc[df['enrichment'] < 0].sort_values(['rank', 'enrichment'], ascending=True)
    return df_cand

find_candidates(df_E_GEOD_8717).head()
Out[282]:
rank cmap name mean n enrichment p specificity percent non-null
24 25 naloxone -0.379 6 -0.680 0.00288 0.0155 66
37 38 talampicillin -0.301 4 -0.748 0.00802 0.0214 50
38 39 phensuximide -0.433 4 -0.748 0.00808 0.0191 75
44 45 NU-1025 -0.612 2 -0.926 0.01131 0.0107 100
45 46 ganciclovir -0.275 4 -0.721 0.01229 0.0113 50

Loading all CMap dataframes

In [277]:
df_E_GEOD_13637 = pd.read_excel('permutedResults-E-GEOD-13637.xls')
df_E_GEOD_17156 = pd.read_excel('permutedResults-E-GEOD-17156.xls')
df_E_GEOD_17400 = pd.read_excel('permutedResults-E-GEOD-17400.xls')
df_E_GEOD_18791 = pd.read_excel('permutedResults-E-GEOD-18791.xls')
df_E_GEOD_18816 = pd.read_excel('permutedResults-E-GEOD-18816.xls')
df_E_GEOD_19665 = pd.read_excel('permutedResults-E-GEOD-19665.xls')
df_E_GEOD_22522 = pd.read_excel('permutedResults-E-GEOD-22522.xls')
df_E_GEOD_31193 = pd.read_excel('permutedResults-E-GEOD-31193.xls')
df_E_GEOD_37571 = pd.read_excel('permutedResults-E-GEOD-37571.xls')
df_E_GEOD_40844 = pd.read_excel('permutedResults-E-GEOD-40844.xls')
df_E_MEXP_1274 = pd.read_excel('permutedResults-E-MEXP-1274.xls')
df_E_MEXP_3582 = pd.read_excel('permutedResults-E-MEXP-3582.xls')
In [298]:
find_candidates(df_E_GEOD_13637)[:100]
Out[298]:
rank cmap name mean n enrichment p specificity percent non-null
1 2 H-7 -0.713 4 -0.944 0 0.0909 100
10 11 neostigmine bromide -0.574 4 -0.831 0.00147 0.0115 100
11 12 sulfamethizole -0.728 4 -0.821 0.00193 0 100
13 14 bambuterol -0.530 4 -0.795 0.00348 0 75
14 15 mexiletine -0.302 6 -0.652 0.00481 0 50
15 16 decamethonium bromide -0.318 4 -0.775 0.00527 0.0059 50
17 18 lisinopril -0.657 3 -0.854 0.00619 0.0216 100
22 23 bufexamac -0.299 4 -0.739 0.00905 0.0177 50
23 24 metolazone -0.492 5 -0.670 0.00907 0 80
24 25 diltiazem -0.398 5 -0.670 0.00907 0.0313 60
25 26 carbenoxolone -0.397 4 -0.736 0.00975 0.0126 75
28 29 papaverine -0.470 4 -0.729 0.01104 0.0126 75
29 30 pivmecillinam -0.267 4 -0.728 0.01118 0.0221 50
30 31 epitiostanol -0.404 4 -0.726 0.01154 0.037 75
34 35 alprenolol -0.417 4 -0.713 0.01373 0.01 75
35 36 metaraminol -0.504 4 -0.702 0.01643 0.0112 75
37 38 lymecycline -0.469 4 -0.694 0.01846 0.0352 75
38 39 dilazep -0.458 5 -0.627 0.01868 0.0345 80
42 43 ethotoin -0.291 6 -0.573 0.02161 0.1414 66
43 44 riluzole -0.415 5 -0.616 0.02289 0.019 80
44 45 alfaxalone -0.487 3 -0.776 0.02299 0.0319 66
48 49 raubasine -0.317 4 -0.670 0.02638 0 50
50 51 canavanine -0.351 3 -0.756 0.02981 0.0412 66
51 52 2,6-dimethylpiperidine -0.297 5 -0.597 0.03038 0.0108 60
52 53 paroxetine -0.319 4 -0.660 0.03044 0.0398 50
54 55 oxprenolol -0.497 4 -0.656 0.03195 0.0396 75
55 56 sitosterol -0.532 4 -0.656 0.03213 0.1071 75
56 57 labetalol -0.534 4 -0.655 0.03251 0.036 75
58 59 eldeline -0.305 6 -0.543 0.03613 0.0756 50
67 68 nicotinic acid -0.259 4 -0.637 0.04156 0.0599 50
... ... ... ... ... ... ... ... ...
153 154 acetazolamide -0.234 4 -0.534 0.13698 0.0939 50
154 155 phenylpropanolamine -0.230 4 -0.533 0.13724 0.0488 50
156 157 chlormezanone -0.154 4 -0.532 0.13985 0.1035 50
157 158 xylazine -0.186 4 -0.531 0.14011 0.1421 50
158 159 ipratropium bromide -0.442 3 -0.600 0.14141 0.2935 66
159 160 ondansetron -0.310 4 -0.530 0.14223 0.2481 50
160 161 aminoglutethimide -0.248 3 -0.599 0.14262 0.1369 66
163 164 chrysin -0.486 3 -0.595 0.14796 0.3788 66
164 165 citalopram -0.321 4 -0.525 0.14886 0.1736 50
165 166 cefsulodin -0.336 4 -0.523 0.15156 0.1264 50
166 167 doxorubicin -0.463 3 -0.592 0.15239 0.5263 66
168 169 flucytosine -0.261 4 -0.522 0.15324 0.2023 50
169 170 proglumide -0.325 5 -0.472 0.15358 0.1312 60
170 171 procyclidine -0.346 4 -0.521 0.15481 0.1437 50
172 173 indapamide -0.259 6 -0.431 0.15742 0.1364 50
173 174 bephenium hydroxynaphthoate -0.396 5 -0.468 0.15889 0.2542 60
174 175 hydroflumethiazide -0.379 5 -0.467 0.16041 0.1804 60
175 176 tetramisole -0.161 4 -0.516 0.16145 0.1136 50
176 177 mecamylamine -0.444 3 -0.585 0.16293 0.1694 66
177 178 BCB000038 -0.361 4 -0.514 0.16408 0.2011 50
178 179 trioxysalen -0.342 4 -0.513 0.16509 0.4873 50
179 180 fluticasone -0.112 4 -0.513 0.16541 0.2675 50
181 182 mefenamic acid -0.274 5 -0.463 0.16762 0.0692 60
182 183 alsterpaullone -0.287 3 -0.582 0.16771 0.4929 66
184 185 propidium iodide -0.245 4 -0.511 0.16867 0.0861 50
185 186 scoulerine -0.211 4 -0.510 0.17003 0.1597 50
188 189 ioxaglic acid -0.454 3 -0.578 0.17388 0.1526 66
191 192 cinoxacin -0.335 4 -0.501 0.18353 0.2512 50
192 193 exisulind -0.267 2 -0.696 0.18466 0.1322 50
194 195 trimethoprim -0.335 5 -0.453 0.18756 0.1859 60

100 rows × 8 columns

Future Analysis To Go Here

In [ ]: