Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rusheb local build #1

Closed
wants to merge 3 commits into from
Closed

Rusheb local build #1

wants to merge 3 commits into from

Conversation

rusheb
Copy link
Collaborator

@rusheb rusheb commented Jan 25, 2023

Hi Michael,

Good news! I managed to get everything building on my mac. I made a few updates in the process.

The major thing I have done is added Poetry for package management. Here is my motivation for this:

  • quicker onboarding - new devs can clone the package and install dependencies with a single command (poetry install).
  • helps keep everyone's environments the same, making builds deterministic
  • Helps keep projects sandboxed, avoiding dependency conflics

However I'm not sure about your preferences for package managment or what the best practice is in PyTorch projects. If you have a standard approach then just let me know and I will change it over! Also -- maybe this is all overkill since this is just a pet project.

The main challenge was including the PyTorch dependency, since this platform dependent. I think this S/O question describes it better than I could, and this comment seems like a viable way we could change the dependency based on platform.

The solution I have used seems like it might be brittle for a couple of reasons:

  • The need copy the version every time you update PyTorch
  • I'm not sure how we would support different hardware on the same OS, if this is a requirement

Other approaches I considered:

  • Venv and requirements.text (cons: relies on global PyTorch, not deterministic, no defined python version, fairly low-level)
  • Conda (cons: slow and bloated, need to repeat all dependencies for each environment)
  • Pipenv (cons: doesn't support environment-based markers)

Hope it isn't too bold of me to suggest this! Due to timezones I thought it would be best to write everything down and let you review it async, but maybe would be better to have had a discussion since I am probably missing a lot of context. Also happy to scrap this for now and manage my dependencies locally.

Let me know your thoughts! Happy to chat about this on a call today or tomorrow.


Apart from that, I made a few other, minor changes:

  • Add a basic README
  • Handle case when cuda isn't available
  • Update docstring and variable names in process_urls function

@rusheb rusheb requested a review from mivanit January 25, 2023 14:09
@mivanit
Copy link
Owner

mivanit commented Jan 25, 2023

Hi Rusheb,

Thanks for doing all this!

I haven't used poetry extensively before, but its been on my todo list for a while so this is probably a good time to learn it -- usually I've just stuck with pip and a requirements.txt file.

PyTorch dep

As for handling PyTorch as a dependency, I am not sure what best practices are, but generally I rely on a global version of PyTorch, since it's a very large package that additionally relies on a global installation of CUDA and there is no practical way to handle that in a python package manager. I've found that many other projects do the same, since having an installation of PyTorch is almost a given for any ML project.

Jax + transformers issue

Since we're on the topic of package managers, a current problem I was facing when trying to run classify_tabs.py (which is mostly copied from a different older project of mine) is that the current version of transformers by huggingface complains that I'm lacking a working installation of jax despite me using the PyTorch variant of transformers. Anything you can tell me about this issue would be helpful.

formatting and type errors

You mentioned this in your email -- generally I use the black formatter and mypy for type checking, I just haven't gotten around to running that or setting up CI checks for it. This is a small enough project to where using up actions minutes probably isn't worth it. We can maybe set up a bash script to run the formatters and type checking.

next steps

Overall, I think my next steps are as follows:

  1. gather some example data to create prompts
  2. resolve jax/transformers issues and get the basic sequence generation in classify_tabs.py working
  3. set up a pipeline:
    unsorted bookmarks ==> gpt prompts ==> gen completions ==> extract tags from completions, store tagged as json ==> export to a note taking system(?)
  4. finetine GPT2 on data, compare classification accuracy to prompted version
  5. use prompted GPT3 via openai API
  6. finetune GPT3 via API, if we get some $ to burn?

actual browser integration can probably wait, and is not the interesting portion of this project for you to work on from an ML perspective. Steps 1 and 2 can be done concurrently -- I will work on step 1, why dont you try to get basic generation working?

I am also unsure about where the pipeline should end -- personally, I make extensive use of Dendron as a PKM, so a relatively easy and useful thing for me would be "export sorted links and metadata to some markdown files". I am not sure if this is a useful thing in general.

@rusheb
Copy link
Collaborator Author

rusheb commented Jan 26, 2023

Thanks Michael.

As discussed offline, I will close this PR and raise a few new ones:

  1. Steps to enable my local build, including
    • adding requirements.txt
    • cuda conditional
  2. Documentation
  3. Autoformatting with black
  4. Type checking with MyPy

Following this I will start trying to reproduce #2 and then go on to look at the sequence generation piece.

I will raise a seprate issue where we can begin to discuss the pipeline.

@rusheb rusheb closed this Jan 26, 2023
@rusheb rusheb deleted the rusheb-build branch January 26, 2023 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants