add PAG support #7944

yiyixuxu · 2024-05-14T10:24:06Z

You can create a PAG pipeline using AutoPipeline API. for example, to use PAG with StableDiffusionXLPipeline, you just have to call

          AutoPipelineForText2Image.from_pretrained(...., enable_pag=True)

if you want to set specific layers to apply PAG

         AutoPipelineForText2Image.from_pretrained(...., enable_pag=True, pag_applied_layers=[["down.block_2", "up.block_1.attentions_0")

from_pipe also works,

e.g. if you already have a sdxl img2img pipeline and want to switch to text2img but with PAG

        AutoPipelineForText2Image.from_pipe(pipe_img2img, enable_pag=True)

testing script for sd-xl

from diffusers import AutoPipelineForText2Image
import torch
from diffusers.utils import make_image_grid

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
# test1:
# base_cfg
generator = torch.Generator(device='cuda').manual_seed(1)
output_base_cfg = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=7,
        generator=generator,
    ).images[0]

# base_uncond
generator = torch.Generator(device='cuda').manual_seed(1)
output_base_uncond = pipe(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=0,
        generator=generator,
    ).images[0]

# test2: 
# pag_cfg
pipe_pag = AutoPipelineForText2Image.from_pipe(pipe, enable_pag=True, pag_applied_layers=['mid'])

generator = torch.Generator(device='cuda').manual_seed(1)

output_pag_cfg = pipe_pag(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=7,
        generator=generator,
        pag_scale=3.0,
    ).images[0]
# pag_uncond


generator = torch.Generator(device='cuda').manual_seed(1)
output_pag_uncond = pipe_pag(
        "an insect robot preparing a delicious meal, anime style",
        num_inference_steps=25,
        guidance_scale=0,
        generator=generator,
        pag_scale=3.0,
    ).images[0]

make_image_grid(
    [output_base_cfg, output_base_uncond, output_pag_cfg, output_pag_uncond], 
    rows =2, 
    cols=2).save("yiyi_test_6_out.png")

first row is Base (guidance_scale = 7.0, guidance_scale=0)
second row is PAG (guidance_scale = 7.0, guidance_scale=0)

note that when pag_scale=0, PAG is disabled and the PAG pipeline works the same as its base SDXL pipeline, this testing script will get same results as the one above

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    enable_pag=True,
    torch_dtype=torch.float16
).to("cuda")


pag_scales =  [0.0, 3.0]
guidance_scales = [0.0, 7.0]

grid = []
for pag_scale in pag_scales:
    for guidance_scale in guidance_scales:
        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="an insect robot preparing a delicious meal, anime style",
            num_inference_steps=25,
            guidance_scale=guidance_scale,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(guidance_scales)).save("yiyi_test_5_out.png")

works with ip-adapter now thanks to @sunovivid

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    image_encoder=image_encoder,
    enable_pag=True,
    torch_dtype=torch.float16
).to("cuda")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.bin")

pag_scales = [0.0, 3.0]
ip_adapter_scales = [0.0, 0.6]

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
grid = []
for pag_scale in pag_scales:
    for ip_adapter_scale in ip_adapter_scales:
        pipeline.set_ip_adapter_scale(ip_adapter_scale)
        generator = torch.Generator(device="cpu").manual_seed(0)
        images = pipeline(
            prompt="a polar bear sitting in a chair drinking a milkshake",
            ip_adapter_image=image,
            negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
            num_inference_steps=25,
            guidance_scale=3.0,
            generator=generator,
            pag_scale=pag_scale,
        ).images
        images[0]

        grid.append(images[0])

# save the grid
from diffusers.utils import make_image_grid
make_image_grid(grid, rows=len(pag_scales), cols=len(ip_adapter_scales)).save("yiyi_test_4_out.png")

HuggingFaceDocBuilderDev · 2024-05-14T10:29:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2024-05-14T10:30:15Z

@asomoza can you test it out?
I tried to make it work with ip-adapter but I don't think it works - do you know if PAG works with ip-adapter?
what other pipelines should I add this too for testing? (

yiyixuxu · 2024-05-14T19:38:02Z

cc @HyoungwonCho for awareness
also question: does PAG work with IP-adapter?

src/diffusers/pipelines/pag_utils.py

asomoza · 2024-05-15T06:59:35Z

I've doing some tests and I like it a lot.

no PAG	PAG CFG

I think it makes the robot more coherent and it fixes some of the wrong details, but it makes it less "humanoid" and loses a bit of the cinematic look.

I'm still deciding if I like more if we could use a layer or block naming like with the loras and ip_adapter or if pag_applied_layers and pag_applied_layers_index is better. I'll give some examples to evaluate this.

So lets say, I want to test it with what I normally use for the pose in the loras which are all the layers in the down block 2, with the current system I need to do this:

pag_applied_layers_index = ["d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

the equivalent could be this:

pag_applied_layers = {"down": ["block_2"]}

or for example the last attention block which is what we can associate to the composition with IP Adapters:

pag_applied_layers_index = ["d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23"]`

for this, a equivalent could be:

pag_applied_layers = {"down": "block_2": "attentions_1"}

down_block_2	down_block_2_attentions_1

I don't know if going as granular as each of the layers could bring a benefit, even someone like me that likes full control won't go as far as to try to control an image with 70 different layers on top of everything else.

As an example, as an advanced user, I want to use PAG to make the image better but without the robot losing it's humanoid form and the cinematic look.

Doing some quick tests, I found that for this particular image, this works really well:

pipeline.enable_pag(
    pag_scale=3.0,
    pag_applied_layers=None,
    pag_applied_layers_index=[
        "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19", "d20", "d21", "d22", "d23", "u0", "u1", "u2", "u3", "u4", "u5", "u6", "u7", "u8", "u9",
    ],
)

which in the lora format would be like this:

pag_applied_layers = {"down": "block_2", "up": "block_1": "attentions_0"}

Hope this example is somewhat clear, and also we can see that it matters a lot, the image is a lot better with this.

I'll do tests with the other use cases later, specially with the upscaler.

HyoungwonCho · 2024-05-15T13:26:40Z

@yiyixuxu @asomoza Hello, I was impressed by the various experiments you conducted using PAG!
We are also discussing the use of PAG in various tasks, as well as layer/scale selection.

Since the guidance framework of PAG itself is simple, it seems quite possible to use it in conjunction with other modules like the IP-Adapter you mentioned. However, we have not yet implemented and experimented with it directly, so we have not confirmed whether there is a significant performance improvement when used together. If possible, we will conduct additional experiments in the future.

Thank you for your interest in our research.

KKIEEK · 2024-05-15T19:36:19Z

Thank you for the great work!
However, I encountered the following issue when using StableDiffusionXLControlNetPipeline with CFG and PAG:

  File ".../.env/lib/python3.11/site-packages/diffusers/models/controlnet.py", line 798, in forward
    sample = sample + controlnet_cond
             ~~~~~~~^~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0

I solved it by adding a new parameter do_perturbed_attention_guidance and appending the following lines in the prepare_image method.

        if do_classifier_free_guidance and do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 3)
        elif do_classifier_free_guidance and not guess_mode:
            image = torch.cat([image] * 2)
        elif do_perturbed_attention_guidance and not guess_mode:
            image = torch.cat([image] * 2)

yiyixuxu · 2024-05-15T23:27:10Z

@KKIEEK
thanks! I added your change:)

jorgemcgomes · 2024-05-16T10:43:18Z

Just leaving a brief report of my findings with PAG and Diffusers (I already had it integrated in my pipelines before this PR):

It generally works very very well when properly tuned. Almost looks like a significant model upgrade.
I'm using it with models derived from SD2.1.
Implemented it successfuly in text-to-image, image-to-image, controlnet, unclip, and inpainting pipelines.
I get the best results with values around guidance_scale=7 and pag_scale=3
The layers to which it is applied makes a huge difference on the output. It's the difference between garbage and excellent. Adding or removing a single layer can make it or break it.
For example, for SD2.1, I found that with just [m0] the effect was too subtle, [d4, d5, m0] was overcooked, [d5, m0] seems to work best; adding any up layers typically screws up the results [d5, m0, u0].
The applied layers will obviously change in different model architectures. And I imagine that the "optimal" layers might even change with fine-tunes. I couldn't replicate the optimal parameters described in the paper (for SD1.5), with SD2.1 (which has the same unet architecture).

yiyixuxu · 2024-05-20T17:22:40Z

@jorgemcgomes thanks!

sunovivid · 2024-05-24T02:59:30Z

Hello. I'm an author of PAG. Thank you for your insightful opinions and cool implementation. Is there anything currently in progress? We are excited to see that PAG is gaining popularity within the community and being utilized in various workflows. Especially in ComfyUI, PAG nodes are used in diverse workflows.

(Some workflows using PAG in ComfyUI:
https://www.reddit.com/r/StableDiffusion/comments/1c68qao/perturbedattention_guidance_really_helps_with/
https://civitai.com/models/141592/pixelwave
https://civitai.com/models/413564/cjs-super-simple-high-detail-cosxl-and-pag-workflow
https://www.reddit.com/r/StableDiffusion/comments/1c4cb3l/improve_stable_diffusion_prompt_following_image/
https://www.reddit.com/r/StableDiffusion/comments/1ck69az/make_it_good_options_in_stable_diffusion/
https://stable-diffusion-art.com/perturbed-attention-guidance/)

However, in Diffusers, it seems somewhat challenging to try creative combinations as the pipelines are separated.
( a collection of PAG pipelines with Diffusers: https://x.com/multimodalart/status/1788844183760847106 )

Therefore, the MixIn approach taken in this PR appears to be a very effective solution. However, it seems a bit awkward to call enable_pag every time to adjust the pag scale. Ideally, it would be more natural to set the pag_scale when calling the pipeline after enable_pag (similar to setting ip_adapter_image=image after in load_ip_adapter). So, I'm exploring a better design for this.

Additionally, since there are many users who want compatibility with IP-adapter, now I have time and would like to work on making it compatible with IPAdapter. I'm curious if there's any related progress about component design or IP-adapter compatibility.

Thank you!

yiyixuxu · 2024-05-28T22:18:57Z

@sunovivid thanks for the message!
this is not the finalized design just something we can use to test out compatibility of PAG - we will iterate on the final design

for IP-adapter, it will be super cool if we can make it work! I'm not aware of any related progress so would really appreciate if you are able to find time to work on this! maybe we can just pick one of the pipelines from this PR (with the mixin) and make it work with ip_adpter_image input?

sunovivid · 2024-06-02T18:24:28Z

@yiyixuxu Hi! I made a working version of PAG + IP-adapter. Can you check the PR?

yiyixuxu · 2024-06-03T21:06:57Z

@sunovivid we will merge in and work on a new design for PAG once you upload the new change for ip-adapter :)

for pag_applied_layers:

I think we should use the lora format, let me know what you think @sunovivid: see @asomoza 's comments and experiments here add PAG support #7944 (comment); you can also find more about the scale dict we support in ip-adapter and lora here and here
is pag_applied_layers something we would want to change a lot for different generations? i.e. can we make it a pipeline config/attribute instead of a call argument? I think we will have to make pag_scale a call argument

sunovivid · 2024-06-04T08:41:35Z

Hi @yiyixuxu,

Thank you for the feedback!

I might have misunderstood something. Should I upload the new changes for the ip-adapter in this PR? How can I upload the changes? Should I attach files or use another approach?

for pag_applied_layers:

Completely agree! For user convenience, the overall code should consistently follow the conventions used in the Diffusers codebase.
I believe once the best choice for pag_applied_layers is determined per model through experiments (like the great example you provided in @asomoza's comment), it likely won't need frequent changes. Users will likely follow the recommended approach for each model. I also agree that pag_scale should be a call argument.

* fix compatability issue between PAG and IP-adapter * fix compatibility issue between PAG and IP-adapter plus

yiyixuxu added 4 commits May 13, 2024 08:40

first draft

a6a0429

refactor

3605df9

update

f94376c

up

54c3fd6

yiyixuxu and others added 7 commits May 14, 2024 12:32

style

f571430

style

91d0a5b

update

01585ab

inpaint + controlnet

03bdbcd

Merge branch 'pag' of github.com:huggingface/diffusers into pag

b662207

style

1fb2c33

up

219f4b9

yiyixuxu commented May 14, 2024

View reviewed changes

src/diffusers/pipelines/pag_utils.py Outdated Show resolved Hide resolved

Update src/diffusers/pipelines/pag_utils.py

5641cb4

yiyixuxu added contributions-welcome help wanted Extra attention is needed labels May 15, 2024

sunovivid mentioned this pull request May 15, 2024

support IPAdapter? sunovivid/Perturbed-Attention-Guidance#4

Open

fix controlnet

8950e80

sunovivid mentioned this pull request Jun 2, 2024

fix compatability issue between PAG and IP-adapter #8379

Merged

sunovivid and others added 6 commits June 4, 2024 21:36

fix compatability issue between PAG and IP-adapter (#8379)

4cc0b8b

* fix compatability issue between PAG and IP-adapter * fix compatibility issue between PAG and IP-adapter plus

up

5cbf226

refactor ip-adapter

58804a0

style

7bc9229

Merge branch 'main' into pag

e09e079

style

1fa54df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add PAG support #7944

add PAG support #7944

yiyixuxu commented May 14, 2024 •

edited

HuggingFaceDocBuilderDev commented May 14, 2024

yiyixuxu commented May 14, 2024

yiyixuxu commented May 14, 2024

asomoza commented May 15, 2024 •

edited

HyoungwonCho commented May 15, 2024 •

edited

KKIEEK commented May 15, 2024 •

edited

yiyixuxu commented May 15, 2024

jorgemcgomes commented May 16, 2024 •

edited

yiyixuxu commented May 20, 2024

sunovivid commented May 24, 2024

yiyixuxu commented May 28, 2024

sunovivid commented Jun 2, 2024 •

edited

yiyixuxu commented Jun 3, 2024

sunovivid commented Jun 4, 2024 •

edited

add PAG support #7944

Are you sure you want to change the base?

add PAG support #7944

Conversation

yiyixuxu commented May 14, 2024 • edited

HuggingFaceDocBuilderDev commented May 14, 2024

yiyixuxu commented May 14, 2024

yiyixuxu commented May 14, 2024

asomoza commented May 15, 2024 • edited

HyoungwonCho commented May 15, 2024 • edited

KKIEEK commented May 15, 2024 • edited

yiyixuxu commented May 15, 2024

jorgemcgomes commented May 16, 2024 • edited

yiyixuxu commented May 20, 2024

sunovivid commented May 24, 2024

yiyixuxu commented May 28, 2024

sunovivid commented Jun 2, 2024 • edited

yiyixuxu commented Jun 3, 2024

sunovivid commented Jun 4, 2024 • edited

yiyixuxu commented May 14, 2024 •

edited

asomoza commented May 15, 2024 •

edited

HyoungwonCho commented May 15, 2024 •

edited

KKIEEK commented May 15, 2024 •

edited

jorgemcgomes commented May 16, 2024 •

edited

sunovivid commented Jun 2, 2024 •

edited

sunovivid commented Jun 4, 2024 •

edited