Table of Contents

Presentations

RAGs to Richer Answers: Using ChatGPT to Query Documents & Limit Hallucinations, Bethesda Data Science Meetup, November 2023

Demo App: Ask Questions of Previous Bethesda Data Science Speakers with ChatGPT! Note: You need to input an OpenAI API Key to use it
As the adoption of Large Language Models (LLMs) like ChatGPT has increased over the past year, there’s been a growing interest in using these technologies to query existing documents and datasets. However, a notable challenge with ChatGPT is its tendency to hallucinate (aka “make stuff up”), leading to reliability issues. Furthermore, developing a custom chatbot is impossible for everyone except the largest tech companies. This has brought Retrieval-Augmented Generation (RAG) to the forefront as a solution to these issues. In this presentation, I provide an overview of RAG, explain how it operates, and discuss the essentials for creating your own RAG system.


NLP in Finance: Beyond Predicting Alpha, Data Science Salon, February 2022

As NLP has exploded within the world of Data Science and Machine Learning, it is now everywhere in finance as well. Many of us have heard about using NLP to try to outperform the market and predict stock prices. However, NLP is much more versatile than that, and has many other uses as well throughout the finance world. In this talk, I explore a number of methodologies within NLP, explaining how they are being used in finance to help professionals do their job more efficiently and effectively.


Categorical Embeddings: New Ways to Simplify Complex Data, rstudio::global(2021), January 2021

Categorical embeddings are a relatively new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.

When building a predictive model in R, many of the functions (such as lm(), glm(), randomForest, xgboost, or neural networks in keras) require that all input variables are numeric. If your data has categorical variables, you may have to choose between ignoring some of your data and too many new columns.

Categorical embeddings are a relatively new method, utilizing methods popularized in Natural Language Processing that help models solve this problem and can help you understand more about the categories themselves.

While there are a number of online tutorials on how to use Keras (usually in Python) to create these embeddings, this talk uses embed::step_embed(), an extension of the recipes package, to create the embeddings.


Machine Learning Interpretability: How to Understand what your ML Model is Doing, Data Science Salon, January 2021

When building predictive machine learning models, many data scientists feel a need to choose between a traditional regression model and a complex model that performs better but is more of an inscrutable black box. In this talk, I will show a number of methods that let us understand what drives the complex models without having to sacrifice accuracy.

Publications

Stocks move on surprises: Using sentiment information for active portfolio management, Risk & Reward, vol. Q3 2022, Invesco, 14 Oct. 2022, pp. 21-25


Q&A With Alan Feder // 5 Questions for a Data Scientist, Matt Stabile Blog, January 8, 2021