Ramith's Space

An inspiring open science journey to remember 💙

Ramith Hettiarachchi — Tue, 13 Feb 2024 04:29:00 GMT

Aya (Model & Dataset) is going to be released in 6 hours from now. Meanwhile, I thought of writing my thoughts on my journey and the things I have learned while collaborating with so many people across the world who had a common set of values and goals. I've been part of this effort since June 2023, and time flew so fast that I forgot how I even joined this in the first place. So this is a reflection for me to look back again at this day, from sometime in the future.

We all know that data is a cornerstone of AI advancements. But apart from English, most languages in the world have negligible representation on the Internet. Can we change this with the power of community collaboration? This was what Aya was all about. It was a massive global effort to uplift many under-resourced languages in the context of the current natural language advancement landscape.

Before elaborating on how I joined this project and the specifics of it, I want to think out loud about open science efforts. During the duration of this project, I saw,

people from various age groups, various levels of expertise gather towards building datasets for under resourced languages
like minded individuals coming together for a common goal and inspiring more people to work on these research questions
people who are new to research leading this effort, carefully paying attention to very subtle details and writing a research paper encapsulating the effort in terms of a scientific article [1,2].

All this unfolding on Discord was so inspiring to watch. And to finally see a multilingual large language model (LLM), proficient in 101 languages – and more importantly, seeing Aya-101 [1] proficient in my native language, Sinhala (සිංහල) too was such a joy 🎉❤️.

I asked these two questions from Aya in Sinhala: 1) වෙහෙසකර අභියෝගයක් සාර්ථකව ජය ගත් පසු කුමක්ද කරන්න ඕනේ? 2) ලංකාවෙ නිදහස් දිනය කවද්ද?
While there is a long way to go for Sinhala LLMs I'm honestly suprised that it follows Sinhala instructions quite nicely!

(You can try the Aya-101 model here – https://huggingface.co/spaces/Tonic/Aya)

The Start

If I recall correctly, during my final year of undergrad, I joined C4AI (Cohere For AI$^\diamond$) discord server. I saw so many wonderful initiatives by this community to help students/researchers who wanted to learn about AI and ultimately contribute to research. I didn't follow these initiatives closely due to other time commitments. But every once in a while, I would check out the cool initiatives of this community.

$^\diamond$Cohere for AI is a non-profit driven by researchers all over the world with the mission of solving complex machine-learning problems.

Filling out a Google Form (May 2023)

On May 8th, I was getting ready for a morning run and saw this message on Discord.

(Suprisingly I had taken a screenshot of the Google Form (left picture) on that day, lucky for that, I can include it in this blog post!)

When filling something up, I always think whether it is something I can do given my other commitments$^\ddagger$ 😅

I knew I didn't have much time to dedicate, but the message above seemed reasonable so I filled it out. It seemed like only knowing the language was enough. Words like multilingual and underrepresented languages made it so tempting to fill out the Google form.

$^\ddagger$ During undergrad, there are many occasions when I took too many things to my plate..

During high-school I used to make apps for Sinhala transliteration- link. So I have a softspot for tool that advance the usage of my native language.

An Email ✉️

June, 2023 was a great month, I was finishing up some experiments for my ICML 2023 workshop paper, and writing the paper. And then I received this email..

it was sort of unexpected.. felt serious, but then again I felt like it's something I'm capable of doing during my free time (a weekly meeting, spreading info about this project sounded good and doable 🤷🏻‍♂️).

So I replied that I'd want to help out to represent my native language! $^\dagger$.

$^\dagger$ After all I was missing volunteering after completing my undergrad too... so felt that this would be nice..

Journey as a Language Ambassador

I attended the first meeting where I got introduced to Project Aya. The presentation was incredible... I thought to myself, "a great research group with a clear vision of what they want to achieve in 2023..". There was so much positivity and hope within this community, and I loved that energy!

Somewhat of a rough start for Sinhala 🤕

Now it was my turn to contribute to my language. I remember going to the Aya annotation platform$^\star$, and filling out my details (such as languages I speak).

I was surprised to see Sinhala prompts and completions 😃, but soon I realized that they are machine translations which most often did not make much sense 🤕 (see left side of the below picture for some context).

$^\star$ Where human contributions are collected.

So, we gradually started to refine those. Soon with the help of many Sinhala contributors, we got a massive pool of prompts and completions that we kept on refining.

Realizing that we need to speed up

From June to the end of September, we gained a lot of contributions, but soon, we realized our pace was too slow 😬. So what we did was to visualize $^\star$ our goal and work towards it. Furthermore, we spread out the message through many contributors (kudos to Jalina, Chamod, Nawoda, and Chanuka)

$^\star$ Our goal was to reach 5000 re-annotations and 4000 original annotations

(a discord message from the Aya Server)

So we accelerated our pace! 🇱🇰

For the next remaining 3 months, all Sinhala contributors helped immensely to reach our goals – we even surpassed our original goals!

The pictures below capture some of the initiatives we took to gather more contributors..

(a 1x1 inch chip fabricated with the names of Aya Sinhala Contributors as a memory)

📑 Papers accepted to workshops, Sampling and Optimization in Discrete Space (SODS) ፨ and Differentiable Almost Everything (DiffAE) 〆 at ICML 2023 🎉

Ramith Hettiarachchi — Sat, 24 Jun 2023 07:07:44 GMT

Paper TL;DR : We introduce a differentiable approach to search for phylogenetic trees. We optimize the tree and ancestral sequences to reduce the total evolutionary steps (parsimony cost).

Check out our work at ICML 2023 workshops - Sampling and Optimization in Discrete Space (SODS) ፨ (Saturday, July 29) and Differentiable Almost Everything (DiffAE) 〆 (Friday, July 28)

Poster - https://ramith.fyi/assets/pdf/Diff-Evol-Trees_ICML.pdf
Paper (On bioRxiv) - https://www.biorxiv.org/content/10.1101/2023.07.23.550206v1
Paper (On Open Review) - https://openreview.net/forum?id=K27V4MUYjV
Github - https://github.ramith.io/diff-evol-tree-search
Twitter - https://twitter.com/ramith__/status/1683073409952153600

Update 2023.09.19 : Eric J. Ma has written a very detailed article on our paper's key contribution : making the trees and sequences differentiable. You can read it here. It does a great job at explaining our method.

0:00

Guess it went well!

met new friends, those working on similar problems, got feedback, discussed ideas..

inspiring talks, exciting posters & workshops

Time to head back home (2) 🩵 https://t.co/t9lEtm1rJF pic.twitter.com/Ej5JdgThtV
— Ramith Hettiarachchi (@ramith__) July 30, 2023

Glad to meet people behind some libraries I frequently use, who are advancing the field of differentiability of unconventional stuff/algorithms ! ❤️ #JAXOpt @mblondel_ml @FHKPetersen @qberthet #DiffAE pic.twitter.com/iwVqxyC4Oc
— Ramith Hettiarachchi (@ramith__) July 29, 2023

Lion Optimizer (optimizer discovered through evolutionary search)

Ramith Hettiarachchi — Sun, 14 May 2023 18:54:13 GMT

I recently discussed the Lion Optimizer at the Journal Club of Wadduwage Lab. It’s an optimizer discovered through regularized (aging) evolution. Authors have dealt with the vast search space through clever tricks. You can find my slides and video below. (video audio quality is bit low though)

Slides - https://go.ramith.fyi/lion-s
Annotated paper - https://go.ramith.fyi/lion

ESM-2 (evolutionary-scale prediction of atomic level protein structure with a language model)

Ramith Hettiarachchi — Fri, 10 Feb 2023 22:42:19 GMT

✏️

I write these paper 'summaries' for me to clearly understand the paper by summarizing the paper and synthesizing the literature, I hope it might be helpful to some readers too. If you have any feedback please write to me helloramith.fyi or comment below.

Highlights of the ESM-2 Paper

💡

• Train protein language models upto 15B parameters$^\bigstar$

• Infer structure directly from primary sequence using LLM

• LM leverages the evolutionary patterns captured in the LM to produce atomic level predictions

• Order of magnitude faster (60x) in high res structure prediction

• Present a ESM metagenomic atlas (structural characterization of of more than 617 million$^\ddagger$ metagenomic proteins$^\dagger$)

$\bigstar$ Largest LM for proteins upto date

$\ddagger$ 225M of them are high confidence predictions. (https://esmatlas.com/)

$\dagger$ Metagenomics is a term used to describe both a technique of sequencing of DNA purified directly from a natural environment and the research field focusing on studying microbial communities in their natural state" excerpt from Godzik 2011

1. Introduction

1.1 - Structure and Function is Hidden in Sequences.

Biological properties of proteins influence which position/(s) in its sequence can undergo mutations. Therefore, based on these observations, we can define types of evolutionary functions that has happened such as coevolution and conservation of amino acids. These observations can lead to infer properties regarding the function and structure of proteins.$^\star$

$\star$ In essence, information about structure and function of proteins is hidden in sequences.

Usually we rely on aligning sequences before we can draw conclusions into the function and structure. This intermediate representation known as multiple sequence alignment (MSA) has a high time complexity as we have to 1) search for related sequences first$^\star$, and 2) align them.

$\star$ Search process in the pipeline of Alphafold and RosettaFold can take more than 10 minutes.

What if we can get rid of this intermediate representation? That's one aspect this paper accomplishes.

1.2 - Large language models (LLMs)

Historically LMs were pretrained by using techniques such as predicting the next word in a sentence. But Devlin et al. [BERT] showed that just by masking some words in the input and trying to predict it ("masked language model objective - MLM") is a better pretraining strategy $^\star$.

$\star$ "unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer"
- an excerpt from BERT

1.3 - Contributions

Inspired by this widely adopted strategy, the authors of this paper hypothesise that filling missing amino acids might result in learning things which are valuable enough to infer the structure. Thus, they scale protein language models from 8 million parameters upto 15 billion parameters. Doing so reveals the following,

Enable of atomic level structure prediction directly from sequence.
strong correlation in perplexity and accuracy (of structure prediction)
60x speed improvement on inference
No need for search process of related sequences

Because of this one to two orders of speed improvement and the fact that MSA is not needed, they expand structure prediction to metagenomic proteins which is much greater in extent and diverse as well. Therefore, in summary they,

Predict structures for all sequences (over 617M) in MGnify90 $\dagger$
- Out of 617M proteins, 225M structures have high confidence.
  - Out of high confidence ones 76.8% are disjoint from the UniRef90 dataset by atleast 90% of sequence identity.
  - 12.6% have no experimental groundtruth.

$\dagger$ took 2 weeks on 2000 GPUs 🤯

2. Method

2.1 - How does structure emerge from LM trained on sequences?

ESM-2 language model is trained with ~65 million unique sequences $^\bigstar$. Because of the MLM objective, we ask the model to predict missing pieces (amino acids) of the sequences using the neighbouring amino acid context. Therefore, the assumption is that the model needs to learn inter-dependencies of amino acids. In previous work [1] and [2], it was shown$^\dagger$ that transformer models trained with MLM on protein sequences develop attention patterns which corresponds to the residue-residue contact map.

$^\bigstar$ Sampled with even weighting across ~43million UniRef50 training clusters. $^\dagger$ I will write a separate article about that paper.

After training the LM, in order to compute the contact map from attention patterns, the authors use the approach in [2], where they use logistic regression to identify contacts as follows.

(Source: Rao, Roshan, et al. "Transformer protein language models are unsupervised structure learners." Biorxiv (2020))

2.2 - Ok, what about atomic level structure? (Enter ESM-Fold)

While authors extract the contact map from the attention maps, in order to extract spatial coordinates of the atoms, they use an equivariant transformer. This is the structure module introduced in AlphaFold. This equivariant transformer makes it possible to project out the spatial coordinates of the atoms just by using the internal language model representation. This architecture is referred to in the paper as ESMFold.

Steps in ESMFold

Process sequence through ESM-2

Pass representation learnt by ESM-2 to a series of folding blocks
2.1 - Each block sequentially updates a sequence representation and a pairwise representation
Pass the output to the structure module
(repeat with 3 steps of recycling) (view code)

Click the below thumbnail to learn more about the structure module ↩

(ESMFold Architecture : img credit - Lin et al.)

Training : To train the structure model to obtain spatial coordinates, they use, experimentally determined structures from PDB (~25K clusters covering a total of ~325K structures).
This is augmented with 12M structures predicted$^\ddagger$ with AlphaFold2.

$^\ddagger$ Predicted structures from AF are sampled 75% of the time, real structures 25% of the time during training. Evaluation : 194 CAMEO Proteins and 51 CASP14 protiens

This language model based approach vastly simplifies the usual SOTA structure prediction process by eliminating the need for the following $^\dagger$,

External evolutionary databases
Multiple sequence alignments (MSAs)
Templates

$^\dagger$ Eg: AlphaFold requires access to these

3. Results

3.1 - How well does it predict structures ?

As mentioned before, they evaluate performance on CAMEO and CASP14 proteins and check how well the structure was predicted using the TM-Score.

In predicting the structure just by single sequences, ESMFold achieves very good performance compared to AlphaFold and RoseTTAFold.

(Figure 2B from the ESM-2 paper)

3.2 - How important is the language model in the pipeline ?

The key question that arises is how important is the representation learnt by the LM for the task of structure prediction. To quantify this we need several metrics.

First, we need to characterize how good the understanding of the language model (ESM-2) is. This is where perplexity comes in. We already have the TM-score to determine how well a structure matches the groundtruth.

Thus, the graph to the right in Fig. 2B shows that,

High ESMFold TM-Scores have low perplexity scores (numerically speaking, on CAMEO, Pearson correlation coefficient is -0.55 and in CASP14 it's -0.67)

What's Perplexity?
How well language model can predict a protein sequence.
"ranges from 1 for a perfect model to 20 for a model that makes predictions at random. Intuitively, the perplexity describes the number of amino acids the model is choosing between for each prediction"

How can we achieve better perplexity?

Okay, now we know that having better language model representation (lower perplexity) leads to better structure prediction.
So how can we achieve better language model representation ? 🤔 Is scaling all you need ?

To answer this question, authors explore the effect of scaling and look at what happens to the following :

Precision @ L
Change in perplexity

(Figure 1D of the paper)

So they plot how the long range precision @ L changes once we move from a smaller model (x axis) to a larger one (y axis). From the points above the diagonal, it seems that scaling does help to achieve better long range precision @ L (some proteins show improvement).

So is scaling 'the' answer ?

It seems it's not that simple. While certainly for some proteins P@L increases as we scale, if we look at the how many evolutionary related sequences were there, it tells another story. It seems that LMs cannot perform well when there is less number of relevant training data to the query we are asking. This seems intuitive though right? more studying more results? But the thing is it seems it's more memorization than understanding the subject?

I think that authors could've used a different color scheme though (white points means no change has happened in perplexity, while a point on the diagonal says that no improvement in long range precision @ L). So, it would be nice to see the proportion of prtoeins which obtained better performance by the plot itself 🤔. Maybe there's lot of peopls which makes it clutterd?

3.2 - Ok, what about prediction speed ?

Protein with 384 residues on 1, NVIDIA V100 GPU => 14.2 Seconds$^\star$
Shorter sequences ~60x speedup

$^\star$ 6x speedup compared to AlphaFold2

(ESMFold performs better for shorter sequences; However since pairwise representations have a complexity of $\mathcal{O}(n^3)$, performance gets worse for large sequence lengths. Also other methods need to search and construct the MSA. This could take additional > 10 minutes as well.

3.3 - Comparison with other protein language models

4. Conclusion

It's remarkable that these authors scale protein language models and it has resulted in learning structure hidden through databases of sequences, and thus we do not need to depend onto the MSA.

Is it because, the model has learnt to obtain the signal which we previously obtained through MSAs? What can we tell about the performance of sequences that had less number of evolutionary sequences in training data? why does it still struggle to obtain decent performance. It would be very interesting to analyze these directions.

Thanks for reading this, hope you found it useful. If you have any suggestions/ comments please share below.

References

[Invited Talk] - Northeast Symposium on Biomedical Optics

Ramith Hettiarachchi — Fri, 11 Nov 2022 19:30:47 GMT

NESBO 2022 | OCT Research

Northeast symposium on biomedical optics 2022

OCT Research

Abstract

Computational imaging performs intelligent measurements with a “brain” made of programmed optics like meta-surfaces. These programmable optics – though packed with billions of linear operations in a cubic millimeter – often performs poorly due to fabrication constraints. Here we propose a physics-informed quantization-aware training framework that accounts for these constraints and achieves robust designs. We discuss two types of metasurfaces, a learnable Fourier filter, and a diffractive deep neural network for applications such as phase imaging and phase object classification while accounting for the aforementioned fabrication constraints.

Bio

Ramith is a Joint Post-Bac Fellow affiliated to Wadduwage Lab and So Lab in the Division of Science at Harvard University. He completed his B.Sc. degree from University of Moratuwa, Sri Lanka in Electronic Engineering. As a Post-Bac Fellow at Wadduwage Lab, he is currently working with Dr. Dushan Wadduwage on making learnable optical systems robust and realizable by factoring in practical considerations such as fabrication constraints. His research interests include using machine learning for scientific discovery and focusing on the robustness, interpretability, and equitability of machine learning algorithms.

[Invited talk] - Nano-SymBioSys workshop at UiT, The Arctic University of Norway 🇳🇴

Ramith Hettiarachchi — Mon, 26 Sep 2022 07:13:12 GMT

Shared some aspects of the work @khpiyumantha & I are doing at Wadduwage Lab, during the Nano-SymBioSys workshop held at UiT, The Arctic University of Norway (@UiTNorgesarktis)

Workshop Day 1

📄 Paper accepted to IEEE Signal Processing Magazine

Ramith Hettiarachchi — Tue, 09 Aug 2022 01:09:00 GMT

Deep optical coding design in computational imaging

Selected to Princeton Pathways to Graduate School Program

Ramith Hettiarachchi — Mon, 08 Aug 2022 18:38:00 GMT

Link - https://engineering.princeton.edu/graduate-studies/academic-pathways/prospective-graduate-students

"Princeton Engineering’s annual program Pathways to Graduate School for Rising College Seniors invites high-achieving students in science, engineering and math for a series of interactive workshops aimed at breaking down barriers and boosting success in applying for doctoral programs."

"Princeton Engineering brings together people from across academic disciplines, from industry, non-profits and government, and from all nations and backgrounds in a collaborative culture to achieve breakthroughs of benefit to humanity. Thus the Pathways program especially seeks candidates with strong potential to contribute – through their future research, teaching, and service – to the diversity and excellence of our academic community and STEM fields as a whole. Women and other historically underrepresented groups in STEM disciplines are encouraged to apply."

Princeton Engineering - Ramith Hettiarachchi

Princeton Engineering

Started work as a Post Baccalaureate Fellow at Harvard

Ramith Hettiarachchi — Fri, 01 Jul 2022 02:50:00 GMT

📄 Internship project research paper accepted to ICCAR 2022

Ramith Hettiarachchi — Wed, 13 Apr 2022 04:09:00 GMT

Design and Development of a Research Oriented Low Cost Robotics Platform with a Novel Dynamic Global Path Planning Approach

Introduction to Adam Optimizer & advancements leading to it

Ramith Hettiarachchi — Wed, 09 Mar 2022 19:36:00 GMT

This is a presentation I did in one of the Journal Club meetings in the computational imaging group at Harvard (wadduwagelab).
I made a presentation on how gradient (full batch) descent [1] has evolved into the Adam optimizer [2] by tackling the optimization challenges that exist.

You can download the slides below. 👇

Adam-Review-2.pdf

Google Docs

Access Harvard FASRC through vscode without entering password & two step code every time

Ramith Hettiarachchi — Tue, 08 Mar 2022 11:10:50 GMT

I usually get annoyed when I need to enter the password and two step verification every time when I need to connect to the server via vscode.

Worst thing is, I need to go through the same process when,

Opening a new folder on vscode
If connection gets disrupted

Two step verification to connect through SSH

Luckily there's a handy solution as documented in the FASRC Docs. I found that doc page through stackoverflow $^\star$. $^\star$ I didn't know the correct keywords to get to that doc page

The trick is to authentication only once, then send our extra SSH sessions (vscode) through that connection.

Step 1 - Modify the `~/.ssh/config` file

Step 2 - Open a background ssh connection

ssh -CX -o ServerAliveInterval=30 -fN harvard

🎉 Once you've done step #2 one time, vscode won't ask again for credentials because we have already setup a background ssh connection

I encourage you to read the original documentation for more comprehensive detail about these two steps.

Update : 2023.01.22 : For some reason this doesn't work if I try to initiate the first connection form the terminal. However if I do on vscode first then it works.

How to setup a JAX/Tensorflow 1.15 environment in the FASRC Cluster

Ramith Hettiarachchi — Fri, 25 Feb 2022 02:29:19 GMT

Update 2023.07.17 - Due to a cluster update, some of the packages here does not exist, @cschesch kindly shared the process that worked for him in the comment below. The general process is the same, feel free to get to know about FASRC modules below.

Note : This guide is only for setting up TF in the FASRC Cluster. I followed the official documentation listed in the references. Skip to that section if you want to learn more.

Background info

I had a lot of trouble trying to setting up JAX/old tensorflow versions on FASRC cluster. What I later realized was that, that since there are lots of diverse projects being done in FAS, there are many modules supported in the cluster which can be imported from a single command. 😆❤️

Ok, now let's proceed with installing tensorflow 1.15.

Identify which CUDA and cuDNN versions are required by the tensorflow version you need to install. (in our specific case, we need CUDA 10.0 and cuDNN 7.4)

Build from source | TensorFlow

So now we know that tensorflow_gpu-1.15 needs CUDA 10.0 and cuDNN 7.4

1. Identify FASRC Modules to load

In FAS-RC we can load additional runtime libraries (cublas, cufftw, …). To see what's available, you can run the command module-query cuda. After that we can identify that we need, "A module load command enables a particular application in the environment, mainly by adding the application to your PATH variable and pulling in dependencies." To learn more, checkout the official docs

cuda/10.0.130-fasrc01
cudnn/7.4.1.5_cuda10.0-fasrc01

Identify which versions are available

[ramith@xxxxxxx ~]$ module-query cuda

-----------------------------------------------------------------------------------------------------------------------------
  cuDNN
-----------------------------------------------------------------------------------------------------------------------------
    Description:
      The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep
      neural networks.

    Versions:
      HeLmod CentOS 7
            cudnn/5.1_cuda8.0-fasrc01............... x86-64 binary built against cuda 8.0
            cudnn/6.0_cuda7.5-fasrc01............... x86-64 binary built against cuda 7.5
            cudnn/6.0_cuda8.0-fasrc01............... x86-64 binary built against cuda 8.0
            cudnn/7.0.5_cuda8.0-fasrc01............. x86-64 binary built against cuda 8.0
            cudnn/7.0.5_cuda9.1-fasrc01............. x86-64 binary built against cuda 9.1
            cudnn/7.0_cuda9.0-fasrc01............... x86-64 binary built against cuda 9.0
            cudnn/7.1_cuda9.0-fasrc01............... x86-64 binary built against cuda 9.0
            cudnn/7.3.1.20_cuda10.0-fasrc01......... x86-64 binary built against cuda 10
            cudnn/7.4.1.5_cuda10.0-fasrc01.......... x86-64 binary built against cuda 10
            cudnn/7.4.1.5_cuda9.0-fasrc01........... x86-64 binary built against cuda 9.0
            cudnn/7.4.1.5_cuda9.2-fasrc01........... x86-64 binary built against cuda 9.2
            cudnn/7.6.5.32_cuda10.0-fasrc01......... x86-64 binary built against cuda 10.0
            cudnn/7.6.5.32_cuda10.1-fasrc01......... x86-64 binary built against cuda 10.1
            cudnn/7.6.5.32_cuda10.2-fasrc01......... x86-64 binary built against cuda 10.2
            cudnn/8.0.4.30_cuda11.0-fasrc01......... x86-64 binary built against cuda 11.0.3
            cudnn/8.0.4.30_cuda11.1-fasrc01......... x86-64 binary built against cuda 11.1
            cudnn/8.1.0.77_cuda11.2-fasrc01......... x86-64 binary built against cuda 11.2


    To find detailed information about a module, search the full name.

      module-query cudnn/8.1.0.77_cuda11.2-fasrc01

    You may need to specify the build "flavor" to get a single record

      module-query cudnn/8.1.0.77_cuda11.2-fasrc01 --flavor 'HeLmod CentOS 7'
      

    

-----------------------------------------------------------------------------------------------------------------------------
  CUDA
-----------------------------------------------------------------------------------------------------------------------------
    Description:
      Module that activates the CUDA libraries

    Versions:
      HeLmod CentOS 7
            cuda/7.5.18-fasrc01..................... install cuda toolkit and samples
            cuda/8.0.61-fasrc01..................... install cuda toolkit and samples
            cuda/9.0-fasrc02........................ install cuda toolkit and samples
            cuda/9.1.85-fasrc01..................... install cuda toolkit and samples
            cuda/9.2.88-fasrc01..................... install cuda toolkit and samples
            cuda/10.0.130-fasrc01................... install cuda toolkit and samples
            cuda/10.1.243-fasrc01................... install cuda toolkit and samples
            cuda/10.2.89-fasrc01.................... install cuda toolkit and samples
            cuda/11.0.3-fasrc01..................... install cuda toolkit and samples
            cuda/11.1.0-fasrc01..................... install cuda toolkit and samples
            cuda/11.4.2-fasrc01..................... install cuda toolkit and samples
      Easy Build
            CUDA/9.2.88.............................
            CUDA/10.0.130...........................


    To find detailed information about a module, search the full name.

      module-query CUDA/10.0.130

    You may need to specify the build "flavor" to get a single record

      module-query CUDA/10.0.130 --flavor 'Easy Build'

Load the selected CUDA and cuDNN version

module load cuda/10.0.130-fasrc01 cudnn/7.4.1.5_cuda10.0-fasrc01

2. Create Environment

conda create -n tf1.15_cuda10.0.130 python=3.6 numpy six wheel

3. Activate the conda environment & Install Tensorflow

source activate tf1.15_cuda10.0.130

pip install --upgrade tensorflow-gpu==1.15

4. Check if tensorflow uses GPU 👀

(tf1.15_cuda10.0.130) [ramith@xxxxxx ~]$ python
Python 3.6.13 |Anaconda, Inc.| (default, Jun  4 2021, 14:25:59) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
>>> import tensorflow as tf
>>> tf.test.is_gpu_available()
True

5. Add new environment to Jupyter Lab (so that we can select it)

conda install -c anaconda ipykernel -y
python -m ipykernel install --user --name=fyp_env

6. Working in JupyterLab ?

~~~As of now, even thought tensorflow used gpu, while it ran in the terminal, it didn't work in jupyter 😬, I'll update the guide if I find a solution.~~

Ok found the solution! So here's the thing. Before you start the Jupyter Lab instance, you can actually specify which modules to load!

(When creating the jupyter instance, you can include these module!! 😃)

(Working!)

7. JAX ?

Initially I had lots of issues like the following,

Unimplemented: DNN library is not found.
Couldn't invoke ptxas --version

The issue was that I couldn't get cuDNN to work. Tried various things, editing PATH variables etc 😆, nothing seemed to work. Ultimately I got it working by loading cudnn/8.1.0.77_cuda11.2-fasrc01 when creating the jupyter environment, which was pretty straightforward!! 😃

Important ❗️

Everytime you connect to the cluster, you will need to load those additional CUDA and cuDNN modules like shown below or when you create the notebook you need to specify the modules (as shown above).

[ramith@xxxxxxx ~]$ module load cuda/10.0.130-fasrc01 cudnn/7.4.1.5_cuda10.0-fasrc01
[ramith@xxxxxxx ~]$ source activate tf1.15_cuda10.0.130

References

Deep Residual Learning for Image Recognition

Ramith Hettiarachchi — Wed, 26 Jan 2022 21:25:38 GMT

Highlights of ResNet Paper

Present a residual learning framework to train very deep networks more easily.

Reformulate the layers as learning residual functions instead of learning unreferenced functions.

Training 8x deeper networks than VGG-net
3.57% error on the ImageNet test set, 28% relative improvement on COCO object detection dataset.
Topping the leader board in ILSVRC & COCO 2015 (ImageNet classification, detection, localization, COCO detection & segmentation)

Introduction

From the ImageNet Classification results in 2014-2015, it was evident that having deeper networks helps to learn greater levels of features. As shown in the figure below, we can see that VGG-Net and GoogLeNet have reduced the top-5 error rate further by having deeper networks.

So if we just stack more and more layers, does that help? Turns out it doesn't. One reason for this is the vanishing gradient problem which was studied by Sepp Hochreiter in 1991 [14] and discussed over the years [1], [8]. This problem makes it difficult for a network to converge from the start. This issue has been addressed through various initialization methods and through batch normalization [16].

Even when a network starts converging, in deeper networks, researchers have found that there is a degradation of accuracy. Particularly, when we start increasing the depth of a model the accuracy gets saturated, and then it degrades rapidly. As evident from the learning curves, this is not due to overfitting$^{\star}$.$^{\star}$Check the left figure : Once the depth is increased by increments of 12.

Credit : Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016 [Slides, Video]

The authors of the ResNet paper argue that, even if we increase the depth, theoretically there should be solution which gives the same accuracy. So it's basically the shallow network + layers with identity transform. 💡 Why can't the added layers become an identity mapping 🤷🏻 However, the problem seems that the optimizers cannot reach that solution. So can we do a trick and get there easily? That's what authors hypothesize.

Methodology - Deep Residual Learning

Fitting a residual mapping

$\mathcal{H}$ - Mapping that needs to be fit by few stacked layers

$\mathrm{x}$ - input to the first of those layers

Let's say we need to approximate the function $\mathcal{H}$ by some set of layers of a neural network. Authors propose that, rather than learning $\mathcal{H}$ can we let the few layers approximate a residual function ( $\mathcal{H}-\mathrm{x}$ ). 🤔

Let's denote this new function by $\mathcal{F}$. So, now we have rewritten original the function we need to approximate as, $\mathcal{H}(\mathrm{x})=\mathcal{F}(\mathrm{x})+\mathrm{x}$.

Ok, so what benefit does this give? 🤷🏻

By reformulating, we saw was that, $\mathcal{H}$ was split into an addition of a function $\mathcal{F}$ with the input. In the degradation problem that we saw earlier, the issue was that learning the identity function was hard $^{\star}$. However with this residual learning reformulation, it should be easy for the optimizer to drive the weights of the layers such that $\mathcal{F}$ becomes a zero mapping.$^{\star}$Otherwise deeper networks should've learnt identity function in its deep layers, giving out similar accuracies In this way, we are left with $\mathcal{F}(\mathrm{x})+\mathrm{x}$ which is the identity mapping.

ImageNet classification with deep convolutional neural networks (AlexNet)

Ramith Hettiarachchi — Wed, 19 Jan 2022 18:26:26 GMT

AlexNet paper by [1] Krizhevsky et al. was published in 2012. It is a highly influential paper in computer vision which showed that deep networks along with efficient utilization of GPUs for training can help build better models. They were able to achieve top-1 and top-5 test set error rates of 37.5% and 17.0%, while the best in ILSVRC 2010 was 47.1% and 28.2%.Which is a reduction of 9.6% & 11.2% in top-1 and top-5 error rates of SOTA back then.

Introduction

Before 2008, the computer vision researchers mostly evaluated their methods on datasets with tens of thousands of images. Thanks to the ImageNet dataset [2] which was published in 2009, researchers got the opportunity to move from small scale datasets such as MNIST, CIFAR-10/100 to a much larger dataset which has over 15 million labeled images from more than 22,000 categories. AlexNet paper evaluates their method for ILSVRC-2010 & 2012 datasets.

The ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [3] which started in 2010 focuses on a subset of ImageNet which has 1.2 million training images, 50,000 validation images, and 150,000 testing images.

Architecture

Key characteristics of the AlexNet architecture : 8 learned layers (5 Conv, 3 Fully-connected)
The architecture has 60 million parameters.

1. ReLU nonlinearity

Inspired by Nair and Hinton's work [4] , authors have utilized the ReLU non-linearity and have found that it makes the training process several times faster.

2. Multiple GPU Training

Authors employ a cross-GPU parallelisation approach to train the AlexNet and the communication between GPUs happen only when it is required by the certain layers. They utilize 2x GTX 580 GPUs.Authors mention that this parallism architecture is similar to "Columnar" CNN by Ciresan et al. [5]

Just to get an idea on how GPUs now and then (2012) compare with each other. Source : GadgetVersus

3. Local Response Normalization

Krizhevsky et al. has found that their local normalization methods helps achieve better generalization. For this procedure, they sum over $n$ adjacent kernel maps at the same spatial location as shown in the equation below. This normalization has been applied in certain layers of AlexNet.$N$ = total number of kernels
$b_{x,y}^i$ = response-normalized neuron activity

Local reponse normalization has reduced top-1 and top-5 error rates by 1.4% and 1.2%.

4. Overlapping Pooling

By overlapping pooling scheme has reduced top-1 and top-5 error rates by 0.4% and 0.3%. Furthermore, the authors have observed that models with overlapping pooling are slightly more difficult to overfit.

Reducing Overfitting

Authors have utilized 1) data augmentations, 2) dropout to reduce overfitting. Data augmentation has been done in two ways.

1) Generating image translations and horizontal reflections

2) Altering RGB channel intensities in training images

Authors cite their early work on Dropout [6] and mention that it has reduced overfitting substantially while roughly doubling the number of training iterations needed to converge.

Ramith's Space

An inspiring open science journey to remember 💙

The Start

Filling out a Google Form (May 2023)

An Email ✉️

Journey as a Language Ambassador

Somewhat of a rough start for Sinhala 🤕

Realizing that we need to speed up

So we accelerated our pace! 🇱🇰

📑 Papers accepted to workshops, Sampling and Optimization in Discrete Space (SODS) ፨ and Differentiable Almost Everything (DiffAE) 〆 at ICML 2023 🎉

Lion Optimizer (optimizer discovered through evolutionary search)

ESM-2 (evolutionary-scale prediction of atomic level protein structure with a language model)

Highlights of the ESM-2 Paper

1. Introduction

1.1 - Structure and Function is Hidden in Sequences.

1.2 - Large language models (LLMs)

1.3 - Contributions

2. Method

2.1 - How does structure emerge from LM trained on sequences?

2.2 - Ok, what about atomic level structure? (Enter ESM-Fold)

3. Results

3.1 - How well does it predict structures ?

3.2 - How important is the language model in the pipeline ?

How can we achieve better perplexity?

So is scaling 'the' answer ?

3.2 - Ok, what about prediction speed ?

3.3 - Comparison with other protein language models

4. Conclusion

References

[Invited Talk] - Northeast Symposium on Biomedical Optics

Abstract

Bio

[Invited talk] - Nano-SymBioSys workshop at UiT, The Arctic University of Norway 🇳🇴

📄 Paper accepted to IEEE Signal Processing Magazine

Selected to Princeton Pathways to Graduate School Program

Started work as a Post Baccalaureate Fellow at Harvard

📄 Internship project research paper accepted to ICCAR 2022

Introduction to Adam Optimizer & advancements leading to it

Access Harvard FASRC through vscode without entering password & two step code every time

Step 1 - Modify the ~/.ssh/config file

Step 2 - Open a background ssh connection

🎉 Once you've done step #2 one time, vscode won't ask again for credentials because we have already setup a background ssh connection

How to setup a JAX/Tensorflow 1.15 environment in the FASRC Cluster

Background info

1. Identify FASRC Modules to load

Identify which versions are available

Load the selected CUDA and cuDNN version

2. Create Environment

3. Activate the conda environment & Install Tensorflow

4. Check if tensorflow uses GPU 👀

5. Add new environment to Jupyter Lab (so that we can select it)

6. Working in JupyterLab ?

7. JAX ?

Important ❗️

References

Deep Residual Learning for Image Recognition

Highlights of ResNet Paper

Introduction

Methodology - Deep Residual Learning

Fitting a residual mapping

ImageNet classification with deep convolutional neural networks (AlexNet)

Introduction

Architecture

1. ReLU nonlinearity

2. Multiple GPU Training

3. Local Response Normalization

4. Overlapping Pooling

Reducing Overfitting

Step 1 - Modify the `~/.ssh/config` file