Rusheb local build #1

rusheb · 2023-01-25T14:09:09Z

Hi Michael,

Good news! I managed to get everything building on my mac. I made a few updates in the process.

The major thing I have done is added Poetry for package management. Here is my motivation for this:

quicker onboarding - new devs can clone the package and install dependencies with a single command (poetry install).
helps keep everyone's environments the same, making builds deterministic
Helps keep projects sandboxed, avoiding dependency conflics

However I'm not sure about your preferences for package managment or what the best practice is in PyTorch projects. If you have a standard approach then just let me know and I will change it over! Also -- maybe this is all overkill since this is just a pet project.

The main challenge was including the PyTorch dependency, since this platform dependent. I think this S/O question describes it better than I could, and this comment seems like a viable way we could change the dependency based on platform.

The solution I have used seems like it might be brittle for a couple of reasons:

The need copy the version every time you update PyTorch
I'm not sure how we would support different hardware on the same OS, if this is a requirement

Other approaches I considered:

Venv and requirements.text (cons: relies on global PyTorch, not deterministic, no defined python version, fairly low-level)
Conda (cons: slow and bloated, need to repeat all dependencies for each environment)
Pipenv (cons: doesn't support environment-based markers)

Hope it isn't too bold of me to suggest this! Due to timezones I thought it would be best to write everything down and let you review it async, but maybe would be better to have had a discussion since I am probably missing a lot of context. Also happy to scrap this for now and manage my dependencies locally.

Let me know your thoughts! Happy to chat about this on a call today or tomorrow.

Apart from that, I made a few other, minor changes:

Add a basic README
Handle case when cuda isn't available
Update docstring and variable names in process_urls function

mivanit · 2023-01-25T17:43:08Z

Hi Rusheb,

Thanks for doing all this!

I haven't used poetry extensively before, but its been on my todo list for a while so this is probably a good time to learn it -- usually I've just stuck with pip and a requirements.txt file.

PyTorch dep

As for handling PyTorch as a dependency, I am not sure what best practices are, but generally I rely on a global version of PyTorch, since it's a very large package that additionally relies on a global installation of CUDA and there is no practical way to handle that in a python package manager. I've found that many other projects do the same, since having an installation of PyTorch is almost a given for any ML project.

Jax + transformers issue

Since we're on the topic of package managers, a current problem I was facing when trying to run classify_tabs.py (which is mostly copied from a different older project of mine) is that the current version of transformers by huggingface complains that I'm lacking a working installation of jax despite me using the PyTorch variant of transformers. Anything you can tell me about this issue would be helpful.

formatting and type errors

You mentioned this in your email -- generally I use the black formatter and mypy for type checking, I just haven't gotten around to running that or setting up CI checks for it. This is a small enough project to where using up actions minutes probably isn't worth it. We can maybe set up a bash script to run the formatters and type checking.

next steps

Overall, I think my next steps are as follows:

gather some example data to create prompts
resolve jax/transformers issues and get the basic sequence generation in classify_tabs.py working
set up a pipeline:
unsorted bookmarks ==> gpt prompts ==> gen completions ==> extract tags from completions, store tagged as json ==> export to a note taking system(?)
finetine GPT2 on data, compare classification accuracy to prompted version
use prompted GPT3 via openai API
finetune GPT3 via API, if we get some $ to burn?

actual browser integration can probably wait, and is not the interesting portion of this project for you to work on from an ML perspective. Steps 1 and 2 can be done concurrently -- I will work on step 1, why dont you try to get basic generation working?

I am also unsure about where the pipeline should end -- personally, I make extensive use of Dendron as a PKM, so a relatively easy and useful thing for me would be "export sorted links and metadata to some markdown files". I am not sure if this is a useful thing in general.

rusheb · 2023-01-26T06:50:28Z

Thanks Michael.

As discussed offline, I will close this PR and raise a few new ones:

Steps to enable my local build, including
- adding requirements.txt
- cuda conditional
Documentation
Autoformatting with black
Type checking with MyPy

Following this I will start trying to reproduce #2 and then go on to look at the sequence generation piece.

I will raise a seprate issue where we can begin to discuss the pipeline.

rusheb added 3 commits January 25, 2023 10:54

Run bookmark_utils.py

d9d4ab8

Run classify_tabs.py

ddd8e7f

Run preprocess_urls

4d6d2a3

rusheb requested a review from mivanit January 25, 2023 14:09

rusheb closed this Jan 26, 2023

rusheb deleted the rusheb-build branch January 26, 2023 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rusheb local build #1

Rusheb local build #1

rusheb commented Jan 25, 2023 •

edited

mivanit commented Jan 25, 2023

rusheb commented Jan 26, 2023

Rusheb local build #1

Rusheb local build #1

Conversation

rusheb commented Jan 25, 2023 • edited

mivanit commented Jan 25, 2023

PyTorch dep

Jax + transformers issue

formatting and type errors

next steps

rusheb commented Jan 26, 2023

rusheb commented Jan 25, 2023 •

edited