Feedback on Local LLMs and Task orchestration #355

fat-tire · 2024-01-06T23:52:38Z

fat-tire
Jan 6, 2024

Hey there!

So I'm playing with the example scripts from the docs, specifically the two-agent collaboration example and have run into a problem with Mixtral instruct v1 based models using the Oobabooga text-generation-ui server.

The problem is when the agents are set up, for whatever reason, the prompt doesn't seem to make it to the LLM.

Here's how I set up the llm. llm_url is set to an http link to the /v1 endpoint at port 5000 and api_key is set to "sk-111111111111111111111111111111111111111111111111", which is how Oooba likes it. Then, per the example, I did this:

# create the (Pydantic-derived) config class: Allows setting params via MYLLM_XXX env vars
MyLLMConfig = OpenAIGPTConfig.create(prefix="myllm") #(1)!

# instantiate the class, with the model name and context length
my_llm_config = MyLLMConfig(
    api_base=llm_url,
    chat_context_length=2048,
    api_key=api_key,
    litellm = False,  
    max_output_tokens= 2048,
    min_output_tokens = 64,
    chat_model="local_mixtral",
    completion_model=OpenAICompletionModel.TEXT_DA_VINCI_003,  #tried lots of settings here including GPT3, GPT4 Turbo, etc.
    timeout=60,  # increased this as I was experiencing timeouts dunno why but this fixed it.
    seed=random.randint(0,9999999),  #tried 42 but wanted to see if this would change anything
    cache_config=RedisCacheConfig(fake=True)  # get rid of annoying warning
)

At this point, the following works fine:

mdl = OpenAIGPT(my_llm_config)
response = mdl.chat("Is New York in America?", max_tokens=30)

RESPONSE: Yes, New York is a state in the United States of America.

Great. So let's try it with multiple messages

messages = [
    LLMMessage(content="You are a helpful assistant",  role=Role.SYSTEM),
    LLMMessage(content="Is New York in America?",  role=Role.USER),
]
response = mdl.chat(messages, max_tokens=50)

agent_config = ChatAgentConfig(llm=my_llm_config, name="my-llm-agent")
agent = ChatAgent(agent_config)

RESPONSE: Yes, New York is a state in the United States of America.

However, setting it up with an agent, like this:

agent = ChatAgent(agent_config)
response = agent.llm_response("Is New York in America?")

Results in a very long ramble on random topics (how to use python, some long paragraph in french, etc.) that is completely unrelated to the prompt and appears to be what happens when no prompt makes it to the llm. It's processing a blank prompt, I suspect, and just spewing randomness.

Similarly, trying it with a Task:

agent = ChatAgent(agent_config)
task = Task(agent, system_message="""
        You are a helpful assistant
        """, single_round=True,)
task.run("Is New York in America?")

This also results in total garbage out.

Again, using a non MoE Mistral appears to work (although it didn't quite follow the prompts very well, which is why I was hoping mixtral would work better), but it doesn't seem to receive it through an agent. With Mixtral alone, it's prompt in, but garbage out.

Anyone else experiencing this?

Without examining the code in too much detail, I wonder why would the prompt make it to the LLM directly but not via an agent? Does this maybe have something to do with the instruction template setting or something?

I tried playing with various settings in the MyLLMConfig, some of which you can see above, but nothing seemed to work. Also tried changing instruction templates on Oobabooga itself, but no dice. I also tried moving the prompts from system_message to user_message, from the task to the agent... but it wouldn't "take".

Any thoughts? Why would using an agent "block" the prompt? 🤔

Using langroid v0.1.157 w/litellm FWIW.

Thanks - this looks like a fun and interesting project!

pchalasani · 2024-01-07T01:47:46Z

pchalasani
Jan 7, 2024
Maintainer

@fat-tire Thanks for trying this out! I tried this basic local-chat example with mistral after doing ollama run mistral:
https://github.com/langroid/langroid-examples/blob/main/examples/basic/chat-local.py

And it runs fine. That script also has instructions on how to set chat_model for various scenarios. The completion_model setting doesn't matter. I suggest following that script for setting up local models. You only need to set chat_model in most scenarios. The api_base is set up for you under the hood, when your chat_model is of the form litellm/... or local/....

I haven't yet tried MoE Mistral but your error is puzzling. Did you run it on Mac using ooba's openAI-extensions module, after downloading a GGUF/Quant? If I can reproduce your err, I could get to the bottom of it, but I suggest using the above script as a starting point for local model setup.

0 replies

fat-tire · 2024-01-07T02:12:49Z

fat-tire
Jan 7, 2024
Author

Okay let me have a play with that example to see if I can make anything improve. I tried several mixtral 8x7B local models, including gguf and bpw quantized versions.

Also, I don't run it on a mac-- it's running oobabooga's API (which I think is now openai compatible by default-- at one point there were two APIs, a native and a openai one, but it's one api now). It's not running on the same container, but it is running on the same machine and it's accessed via a llm_url = "http://192.168.#.#:5000/v1" before the code I sent previously. I am not using "local/xxx" because the IP address is different, but the above method worked. I also named the chat model "local_mixtral" because I saw somewhere in the code it was looking for a "local" at the start of the name.

0 replies

pchalasani · 2024-01-07T02:16:11Z

pchalasani
Jan 7, 2024
Maintainer

If the script works with some models but not others, it's an indication that the langroid "pipes" are fine, and the problem lies in the LLM setup, e.g. the chat-prompt formatting could be an issue.

0 replies

pchalasani · 2024-01-07T02:20:29Z

pchalasani
Jan 7, 2024
Maintainer

I also named the chat model "local_mixtral" because I saw somewhere in the code it was looking for a "local" at the start of the name.

The code looks for "local/" not just "local" so this shouldn't have an effect.

Also, if your model is listening at http://x.y.z:5000/v1 AND the endpoint is OpenAI Compatible AND it handles chat formatting, then it should suffice to set up your chat_model param as local/x.y.z:5000/v1 (please carefully note the syntax -- you start with your endpoint IP address, strip out the http://, replace it with local/), and you don't need to touch api_base at all.

0 replies

fat-tire · 2024-01-07T02:24:03Z

fat-tire
Jan 7, 2024
Author

Yeah, the only difference that I had considered is maybe there is something wrong with the template formatting such that the prompt wasn't being delivered properly. The weird thing is that this works fine:

mdl = OpenAIGPT(my_llm_config)
response = mdl.chat("Is New York in America?", max_tokens=30)

It's only with the introduction of the agent that it responds as if I hadn't asked it anything at all, with a totally random response. So I thought maybe somewhere along the line maybe something wasn't parsed right-- I just don't know if that's by langroid or on the server.. It's weird because my testing worked fine with a regular 7B model. I thought maybe there was a difference in the fine-tuning that has to do with an unexpected or different templating/formatting so that the prompt gets lost somewhere.

Note that with the regular ooba api docs I am able to specify a couple things like instruction_template and mode that maybe I should be passing through (?) I haven't used the regular openai API so I'm not sure if those types of things are even options.

Ah, okay--let me try changing the "http://" to "local/". I'm not sure if it will make a difference, but who knows... back in a few.

0 replies

fat-tire · 2024-01-07T02:34:58Z

fat-tire
Jan 7, 2024
Author

The code looks for "local/" not just "local" so this shouldn't have an effect.

Sorry, if I was unclear-- I was referencing this bit:

        if chat_model.startswith("litellm") or chat_model.startswith("local"):
            local_model = True

not here:

        elif self.config.chat_model.startswith("local/"):

Anyway let me give this a shot with local/192.168.etcetc

0 replies

pchalasani · 2024-01-07T02:44:21Z

pchalasani
Jan 7, 2024
Maintainer

Ah yes those need to be changed to have "/" at the end, in the next PR.

So just set chat_model as I said and don't touch api_base. But you said it worked with other models just having an issue with this specific one, so I'm pretty sure your problem is due to the finicky prompt formatting (INST, /INST etc) of the Mistral family (same as with llama2) as mentioned here
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format

The onus is generally on whichever library is creating a chat endpoint for the LLM, to automatically insert the requisite dialog-turn delimiters between system, assistant, user, etc. I would assume ooba is doing it, but maybe they haven't done it well with this model. Langroid itself has a general PromptFormatter class, and a llama2-specific version here https://github.com/langroid/langroid/blob/main/langroid/language_models/prompt_formatter/llama2_formatter.py, but we don't use it any more since we rely on litellm in some scenarios or the llm provider to do it.

0 replies

fat-tire · 2024-01-07T04:10:43Z

fat-tire
Jan 7, 2024
Author

Okay, update!

As a test, I'm using this model: https://huggingface.co/TheBloke/Starling-LM-alpha-8x7B-MoE-GGUF -- it's based on Mistral's MoE model.

Here's my simplified llmconfig, which uses the "local/#.#.#.#:5000/v2" formulation as you recommended, and is assigned this time to chat_model rather than api_base. (I previously used api_base instead because it successfully connected locally and didn't try OpenAI's servers. I didn't realize that I could do local/#.#.#.#... , but it does seem that the "local/" is pretty important)

my_llm_config = MyLLMConfig(
    chat_context_length=2048,  # adjust based on model
    api_key=api_key,
    litellm = False,  # use litellm api?
    max_output_tokens= 2048,
    min_output_tokens = 64,
    chat_model=llm_url,
    timeout=60,
    seed=random.randint(0,9999999),
    cache_config=RedisCacheConfig(fake=True)  # get rid of annoying warning
)

So now the agent responds correctly with:

agent = ChatAgent(agent_config)
response = agent.llm_response("Is New York in America?")

It responds correctly that yes, new york is a state.

Unfortunately, when I try the two-agent chat (adding the numbers together), i'm getting some weird timeout issues, but it does appear to work eventually, and I see some communication between agents now. It still isn't following the prompts perfectly. But at least it sees them! 😄

Thanks for the help!

A couple quick thoughts/suggestions:

Maybe offer an example in the docs (and in the example code) where a non OpenAI server is being accessed by IP or a domain name instead of only "localhost" just to demonstrate the difference between "local/" and "localhost" and to show how how non-OpenAI addresses are also be prepended with "local/" That is, I think I confused "local/" as synonymous with "localhost", meaning I thought "local" told langroid the model was "on the same machine", when really it signifies "not openai's official server"-- eg, I presume you could do "local/ServerSomewhereFarAway.com:5000/v1" so long as it's compatible with openai's API, right? (btw- would this handle SSL okay?)
To that end, maybe "local" isn't the right word, since in future people may be connected to anywhere for LLM services- it could be a local open source model, sure, but it could be somewhere else, right? Perhaps something like "private/" or "hosted/" or "external/" etc? Specifically referencing litellm might be too narrow since it should be compatible with lots of services, right? (Or maybe you treat OpenAI as a special case where tools plugins or whatever it's called become available automatically if it recognizes an OpenAI model, in which case do you need the "local/" at all? Ideally it's totally agnostic to the LLM endpoint and works the same with providers be they Mistral, Gemini, Cohere, etc. --- although OpenAI's API is the primary citizen here and so far everyone needs to conform to them)
It was not immediately clear in the docs that "DO-NOT-KNOW" has a special NO_ANSWER meaning/function and is treated differently than other responses until I got to the three-agent example, which comes well after the delegate concept has been introduced earlier. I think it may be clearer to explain how a delegate agent decides when it's done and when to move on, what a pass is, etc.
Related: The single_round and the llm_delegate configs seem to have been deprecated and replaced with done_if_response, and done_if_no_response instead. I didn't see docs on those options and i'm not quite clear conceptually on how they are drop-in replacements, especially as they've gone from a bool to something of the form done_if_response=[Entity.LLM]. What's Entity.LLM and what does this do? And what are the various options for the config? I'm sure it was deprecated for a great reason, but could the documentation be updated to help frame the new way of thinking of agent behavior? Does DO-NOT-KNOW constitute a response?
I had to add use_functions_api=False, to the ChatAgent's config to avoid the warning: "You have enabled use_functions_api but the LLM does not support it. So we will enable use_tools instead, so we can use Langroid's ToolMessage mechanism.". Maybe either suggest to do this to make the warning go away, or just silently switch to use_tools in the case of a "local" LLM, since openai's stuff will never be available there.
Similarly, when using a "local" model, I was getting a warning about using fakereddis until I put cache_config=RedisCacheConfig(fake=True) in the config. Similar to the above this warning include the "solution" if we're fine with the default behavior?

Again, I just want to stress how absolutely cool and fun this project is-- I can easily see a future of pre-written agents and tasks that you can download and snap together to do all kinds of cool tasks. A modular node-based graphical system a la Blender or invokeai or comfyui to follow? heh.

0 replies

tozimaru · 2024-01-07T04:26:48Z

tozimaru
Jan 7, 2024

Just to chime in, I too would like to express how fun it feels using this project to tinker with agents. I'd also appreciate a better explanation of done_if_response and done_if_no_response. Currently my biggest challenge using the repo is my agents not knowing when a certain task is DONE when using RecipientTool in combination with several other tools.

0 replies

fat-tire · 2024-01-07T04:46:21Z

fat-tire
Jan 7, 2024
Author

Ah yes those need to be changed to have "/" at the end, in the next PR.

Cool 👍

So just set chat_model as I said and don't touch api_base. But you said it worked with other models just having an issue with this specific one, so I'm pretty sure your problem is due to the finicky prompt formatting (INST, /INST etc) of the Mistral family (same as with llama2) as mentioned here https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format

Yeah, when connecting to the text-generation-webui server I guess you need to specify the ChatCompletionRequestParams like instruction_template and mode (which usually "chat" or "instruct"). I'm not sure if langroid would need to somehow set that to make sure it's in the right mode but I haven't looked too carefully at the api.

0 replies

pchalasani · 2024-01-07T13:16:52Z

pchalasani
Jan 7, 2024
Maintainer

@fat-tire @tozimaru Thank you for all the feedback. I will take all of this into account and rationalize some of the local-model setups and write an updated doc page on that.
And for task-orchestration, I am planning to write up a detailed doc that presents a mental model that people should have when designing multi-agent workflows, and what the various task settings mean -- even I run into issues setting these up!

Meanwhile I will point to a couple places that may be helpful, specifically for multi-agent task workflow design:

there are several workflows illustrated in test_task.py. Note that this and any other test can be run using pytest with an optional --m arg for the local (or hosted) model, e.g.

pytest -s -x tests/main/test_task.py --m local/x.y.z:5000

This arg globally overrides the chat_model setting, so you can easily run any test against models other than the default GPT4. (The -s shows output, and -x quits on the first test failure).

an example of a realistic 3-agent workflow for filtered RAG using LanceDB, consisting of agents for query planning, query feedback and RAG. This setup illustrates a few different ways to control task loops, including overriding ChatAgent methods. [UPDATE:] To see this in action, you can run the tests in test_lance_doc_chat_agent.py

0 replies

pchalasani · 2024-01-07T13:20:06Z

pchalasani
Jan 7, 2024
Maintainer

I'm not sure if langroid would need to somehow set that to make sure it's in the right mode but I haven't looked too carefully at the api.

Langroid doesn't have these; it simply assumes the endpoint is OpenAI-compatible and that the chat-formatting is handled by the endpoint.

0 replies

fat-tire · 2024-01-07T22:50:53Z

fat-tire
Jan 7, 2024
Author

Great information, thank you! I'll be looking forward to the updated docs! Maybe a page with some kind of flowchart or lifecycle or whatever you call it showing how the "hot potato" gets passed from one agent to another in a task workflow-- like when an agent passes to another agent, who does the agent thinks its talking to (eg in the two-agent example, the Student agent thinks its talking to the User, who then actually passes its output to the Adder agent instead who replies as a proxy for the User-- the Student is unaware that the Adder exists at all) and explain things like under what circumstances a "DO-NOT-KNOW" is sent, how it's handled, etc. Oh, and how "DONE" is a trigger word, which I've discovered if it's said accidentally (Agent: "So, to summarize your instructions, I will say "DONE" when I'm finished.") will insta-end the task.

Re the OobaBooga endpoint configuration- I guess that will have to either be pre-set up on the command line when starting the server or maybe via the OobaBooga API completely separate from langroid.

Unrelated Question-- does a Task always run synchronously? Could a "delegate" agent theoretically fire off a bunch of agents to do various things simultaneously, then either wait for them to report back, or, if ten of them were attempting different methods to achieve a single goal, maybe wait only for the first one that succeeds to return, then abort the other 9 and continue along? (I know this would put a big load on the LLM so probably wouldn't want to do it on your PC, and there'd be notions of LLM "thread safety" on Tasks that would have to be considered, but anyhoo) just curious if this was a thing.

Again, thank you so much for the pioneering effort here! All this stuff-- these concepts, terms, and workflows-- will one day be obvious, clear, standardized, and easily accessible to everyone, so it's really fun to see it develop. Terrific stuff.

0 replies

pchalasani · 2024-01-08T00:11:19Z

pchalasani
Jan 8, 2024
Maintainer

does a Task always run synchronously?

Ah yes @nilspalumbo is working on an async task spawning, glad to see interest in that

will one day be obvious, clear, standardized, and easily accessible to everyone, so it's really fun to see it develop. Terrific stuff.

Thank you for the interest! I'm thinking of putting down a definitive "Laws of Langroid" doc, stay tuned -- it will address what is a step, what is a valid response, when is a task done, what is the result of a task, when is a responder eligible to respond, etc. All of these are in the code but there's a real need to bring it out conceptually, and also show diagrammatically how each step evolves.

0 replies

pchalasani · 2024-01-08T14:26:02Z

pchalasani
Jan 8, 2024
Maintainer

how the "hot potato" gets passed from one agent to another in a task workflow

In case you didn't see it, there are logs generated by every task run, lightly documented here:
https://langroid.github.io/langroid/quick-start/two-agent-chat-num/#logs-of-multi-agent-interactions
This often helps in debugging workflows. But the log components require more explanation, which I will get to.

0 replies

fat-tire · 2024-01-08T18:46:00Z

fat-tire
Jan 8, 2024
Author

Yes, I did look at the logs, thank you-- the .log file was blank, and the .tsv file looked similar to the regular colored output as far as I could tell. The formatting of the .tsv was especially nice though, but a deep-dive explanation of the fields would be great.

I didn't mention, but I have been running everything on a (regular/non collab) jupyter notebook-- and it works nicely, including the real-time streaming responses, the color output, etc. Thanks!

0 replies

pchalasani · 2024-01-09T19:37:48Z

pchalasani
Jan 9, 2024
Maintainer

Nice to know it shows nicely in notebooks... I generally avoid notebooks so haven't extensively tested on them.
I'll close this issue since it's more of a general discussion.

0 replies

pchalasani · 2024-01-10T01:50:36Z

pchalasani
Jan 10, 2024
Maintainer

@fat-tire I realized I could migrate the issue into Discussions so I moved it here instead of closing it. It's nice to have it here since there is a bunch of great feedback here. Thank you for taking to write all of it down.

0 replies

fat-tire · 2024-01-10T02:37:16Z

fat-tire
Jan 10, 2024
Author

My pleasure. Let me know if I can be helpful in reviewing docs or whatever. Happy to help if/when I'm able. Cheers!

0 replies

fat-tire · 2024-01-17T07:33:17Z

fat-tire
Jan 17, 2024
Author

Quick update-- Unlike the Mixtral models I tried previously, this fine tune of mixtral 8x7B Mixture of Experts model supports the ChatML/OpenAI instruct template, and system prompts. Still not working 100% at following the prompt as I'd hope yet, but the system_message seems to work fine now.

gguf quantized versions are available.

0 replies

pchalasani · 2024-01-18T15:41:56Z

pchalasani
Jan 18, 2024
Maintainer

Thanks for this update. I will see if I can run it on my M1 Max Pro 64GB with ollama.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on Local LLMs and Task orchestration #355

{{title}}

Replies: 21 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feedback on Local LLMs and Task orchestration #355

fat-tire Jan 6, 2024

Replies: 21 comments

pchalasani Jan 7, 2024 Maintainer

fat-tire Jan 7, 2024 Author

pchalasani Jan 7, 2024 Maintainer

pchalasani Jan 7, 2024 Maintainer

fat-tire Jan 7, 2024 Author

fat-tire Jan 7, 2024 Author

pchalasani Jan 7, 2024 Maintainer

fat-tire Jan 7, 2024 Author

tozimaru Jan 7, 2024

fat-tire Jan 7, 2024 Author

pchalasani Jan 7, 2024 Maintainer

pchalasani Jan 7, 2024 Maintainer

fat-tire Jan 7, 2024 Author

pchalasani Jan 8, 2024 Maintainer

pchalasani Jan 8, 2024 Maintainer

fat-tire Jan 8, 2024 Author

pchalasani Jan 9, 2024 Maintainer

pchalasani Jan 10, 2024 Maintainer

fat-tire Jan 10, 2024 Author

fat-tire Jan 17, 2024 Author

pchalasani Jan 18, 2024 Maintainer

fat-tire
Jan 6, 2024

pchalasani
Jan 7, 2024
Maintainer

fat-tire
Jan 7, 2024
Author

pchalasani
Jan 7, 2024
Maintainer

pchalasani
Jan 7, 2024
Maintainer

fat-tire
Jan 7, 2024
Author

fat-tire
Jan 7, 2024
Author

pchalasani
Jan 7, 2024
Maintainer

fat-tire
Jan 7, 2024
Author

tozimaru
Jan 7, 2024

fat-tire
Jan 7, 2024
Author

pchalasani
Jan 7, 2024
Maintainer

pchalasani
Jan 7, 2024
Maintainer

fat-tire
Jan 7, 2024
Author

pchalasani
Jan 8, 2024
Maintainer

pchalasani
Jan 8, 2024
Maintainer

fat-tire
Jan 8, 2024
Author

pchalasani
Jan 9, 2024
Maintainer

pchalasani
Jan 10, 2024
Maintainer

fat-tire
Jan 10, 2024
Author

fat-tire
Jan 17, 2024
Author

pchalasani
Jan 18, 2024
Maintainer