Hi, I'm Arnav!

Résumé
arnavshah73@gmail.com
ashah108@jhu.edu
I am a second year Master's student at Johns Hopkins University studying Computer Science.

I completed my undergraduate degree at Veermata Jijabai Technological Institute (VJTI), India in Information Technology.

I am currently advised by Prof. David Yarowsky and Prof. Paola Garcia as part of the Center for Language and Speech Processing at Johns Hopkins University where I work on NLP and speech recognition for low-resource languages

In my free time, I like to play the piano🎹, listen to classical music and watch horror/thriller movies (or re-watch/re-read Harry Potter⚡for the umpteenth time)

Work Experience


Johns Hopkins Bloomberg School of Public Health


Full-Stack Software Engineering Intern, Oct 2023 - Present
  • Collaborated with the Washington State Department of Licensing to develop an application assisting teen drivers in their path to mastering driving skills.
  • The application has been rolled out to over 2000 teen drivers in the state of Washington.
  • Worked on the front-end (interactive training and testing modules), back-end (API design, database design, user authentication) and application deployment.
  • Abbott


    Software Engineering Intern, Heart Failure, Jan 2024 - Present

    Data Science Intern, Global Data Science and Analytics, June 2023 - Dec 2023
  • Implemented a data analysis pipeline to identify proxy measurements for assessing the efficacy of Deep Brain Stimulation in Parkinson’s and Essential Tremor patients to reduce costs for post-market trials.
  • Developed interpretable and predictive machine learning models to forecast heart failure stages by leveraging both insurance claims and EHR data.
  • Center for Language and Speech Processing (CLSP), Johns Hopkins University


    Research Assistant, Nov 2022 - Present
  • Working on ASR systems for over 700 low-resource languages (to be integrated in a language learning app)
  • Building an open-source Universal Voice Activity Detection system trained on a variety of large-scale datasets.
  • Citicorp Services India Pvt. Ltd.


    Full-Stack Software Engineernig Intern, Equities Technology Division, May 2021 - July 2021
  • Built a Client Simulator, mimicing client-side trading applications to test Citi's Order Management System.
  • Implemented the FIX protocol stack for communication with external APIs and interfaces.
  • Built the JUnit Test Suite for various FIX message scenarios.


  • Projects


    Vis-Lock: Graphical Password Authentication


    A novel graphical password algorithm that would be immune to brute force attacks and at the same time, easy to remember.
    During registration, the user selects a sequence of images (and within each image, a sequence of pixels/cells) that would be encrypted and stored in the database. During log-in, the user would then select his/her own images from a sequence of random images. Once the correct sequence is entered, the user will be authenticated.
    Additionally, we use image captioning with named entity recognition (used to mask the keywords) to provide hints to the user in case the password was forgotten.
    This project was part of the Smart India Hackathon (SIH) 2022.

    Search Engine for COVID-19 literature


    A Semantic Search Engine for research papers on COVID-19 in various categories such as symptoms, influential factors, similar diseases and viruses, etc. This is a solution to the COVID-19 Open Research Dataset Challenge (CORD-19) on Kaggle.
    The BioBERT model is used to generate sentence embeddings to encode semantics into the query and data.
    Faiss, an open source library for efficient similarity search and clustering of dense vectors maintained by Facebook Research is used to efficiently compute the similarity scores of thousands of documents with respect to the query.

    Meeting Summarization :
    leveraging extractive summarization techniques
    for abstractive summarization.


    There exist a large number of meetings in the form of university lectures, business conferences, academic talks, interviews, etc. Each contain enormous amount of text in the form of conversations between multiple participants with varying levels of importance or influence. It is important to not only summarize the transcripts but also take into account which participant says what or what role they are assigned).
    The proposed model builds on the HMNet model by Microsoft Research. We use a transformer based architecture along with a clustering module to rank the top 'n' sentences and make the summary attend to it in the decoder.
    A simple attentive gating module is also introduced to suppress unimportant turn information.
    Universal sentences embeddings are used to represent the text and role vectors to represent each speaker uniquely.

    OCR using AWS Textract


    A platform, hosted using Amazon Web Services (AWS) to read and process any type of document, accurately extracting text, handwriting, tables and other data without any manual effort.
    For this, Textract was used which is an Optical Character Recognition (OCR) service offered by Amazon. It can extract structured and unstructured information from images, PDFs, etc.
    Amazon’s S3 bucket along with DynamoDB was used to store all the forms and associated data.

    Remaining Useful Life (RUL) prediction
    for degrading ball bearings


    This is a solution to the IEEE PHM 2012 Prognostic Challenge. It focuses on the estimation of the Remaining Useful Life (RUL) of ball bearings, a critical problem among industrial machines, strongly affecting availability, security and cost effectiveness of mechanical systems.
    We use a two-stage data-driven process for predicting the RUL using a single statistic derived from 14 time-domain features of the observed vibration signals. The degradation process is modeled as a Wiener process and the prior parameters of the underlying linear state-space model of degradation are continuously updated.

    Student Forum


    An online platform for students, teachers, professionals and enthusiasts to share educational resources, articles, ask and solve others' queries, so that they can learn and develop skills together. Users can chat among each other and create subforums and communities for a more collaborative discussion.

    Sentence Classification using
    a 1D CNN architecture


    Implementation of a 1D Convolutional Neural Network for Sentence Classification based on the paper by Yoon Kim[2014].



                Last updated Apr 15, 2024