Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output control + forms numbering system #11

Closed
abubelinha opened this issue Oct 7, 2022 · 9 comments
Closed

output control + forms numbering system #11

abubelinha opened this issue Oct 7, 2022 · 9 comments

Comments

@abubelinha
Copy link

I am beginning with this nice tool so probably these are pretty basic questions.

  1. The docs recommend using the browser() object. But I have not clear how to capture output of some commands:

showforms – show all of the forms on the current page. When called from Python, this function returns a list of the forms.

  • But actually when I do forms = browser.showforms(), this produces an output, but type(forms) returns NoneType
  • Also, I don't know how to avoid twill to produce any output. I'd prefer to run all commands silently and capture and print only the output that I really need. Is there any example script on how to do this?
  1. How are elements being numbered? showforms() output is returning me something like this:
Form #2
## __Name__________________ __Type___ __ID________ __Value__________________
 1 login-submit             submit    login-button None

But there is only one form in this page, so I don't understand why it is numbered "2".
Maybe there is a mistake and twill is beginning numbers in 2 instead of 1.

  1. More confussion with numbering after calling showforms() in the next page:
Form #3
## __Name__________________ __Type___ __ID________ __Value__________________
 1 csrfToken                hidden                 ajBoGtmro9Lji44ks5ItKfrmpI1MyyP7
 2 email                    text      email
 3 password                 password  password
 4 login                    submit    login        login

After looking at numbers, I understand I should fill in this Form #3 like this:

fv("3","email","my_email")
fv("3","password","my_password")
showforms()

But that raises an error:

    raise TwillAssertionError("no matching forms!")
twill.errors.TwillAssertionError: no matching forms!

If I use fv("2", ...) instead, then it works correctly, although the target form number is 3:

Form #3
## __Name__________________ __Type___ __ID________ __Value__________________
 1 csrfToken                hidden                 ajBoGtmro9Lji44ks5ItKfrmpI1MyyP7
 2 email                    text      email        my_email
 3 password                 password  password     my_password
 4 login                    submit    login        login

I would say there is some kind of confussion here.

  1. Submitting the form. I would say now I should keep using number 2 in order to login:
    submit("2")
    I would expect that should submit the form and log me in, so the browser enters the site.
    But nothing happens. I keep on seeing the login page.

Thanks in advance for any help you can provide.

@Cito
Copy link
Member

Cito commented Oct 8, 2022

Hi @abubelinha, thanks for the feedback.

"When called from Python, this function returns a list of the forms." actually refers to the command function, not to the browser method. The explanation in the docs about the browser object is outdated and misleading. The command functions actually now do return values, you don't need the browser object for that. Abd if you use the browser object, then they do not return values, but you can get them through the properties like forms without them getting printed.

I will fix that in #12 (by making them behave the same and updating the docs).

Also note that you don't need to use twill by writing Python scripts, you can write Twill scripts instead which are even more readable and easier to write, and suffice in many cases. I'll try to emphasize this point a bit more in the docs, and give some more examples.

You also correctly observed that the form numbers printed by showforms are one too high.

I will fix that via #13.

Does this cover all your question?

Again, unfortunately, the documentation is very old and does not always match the behavior of the latest version. If you find more bugs, let me know. I will fix these issues and create a new release then.

@abubelinha
Copy link
Author

abubelinha commented Oct 8, 2022

Thanks for the prompt answers.
As for the Python usage, actually I do need to use it because I want to feed data into a website, and all my data-processing logic is already in Python.

I still have an important question though. I can't see the main page after login.
Perhaps there is some kind of javascript redirection which twill cannot handle itself?
(it doesn't look like, since the page url remains the same after I login with Chrome browser).

I don't know a word about cookies, but maybe I am supposed to somehow collect and propagate them?
I called show_cookies() several times before and after the login submit(), and it shows 2 cookies in the cookie jar:

  • JSESSIONID keeps its original value
  • CSRFtoken changes each time, but I guess this should not happen

Maybe I need to fill in the original CSRFtoken cookie value into the hidden csrfToken field in the login form above?
But I don't know how to collect that value using the available cookie functions.
I used save_cookies("twill_cookies.txt") , but the file structure looks very cryptic to me.

@Cito
Copy link
Member

Cito commented Oct 8, 2022

I can't see the main page after login.

Hard to say without knowing your webseite. An important point is that Twill is not a full-fledged web browser, like Selenium or other tools. It can only automate and test simple websites that do not rely on JavaScript. However, cookies, hidden fields and simple redirects should work. I have also tested a website that uses CSRF tokens without doing anything special to take care of that (messing with cookies or tokens). The CSRF token should change every time you request a form, that is ok.

@abubelinha
Copy link
Author

abubelinha commented Oct 8, 2022

Thanks @Cito
Here you can see the login interface.

I still think the CRSF token has to be passed into the login form.
I say this after looking into an old Python script which interacted with previous versions of this website using the Python requests module.
Its login function was like this:

def __login(URL, USER, PWD):
	session = requests.Session()
	session.post(URL)
	data = {"email": USER, "password": PWD, "csrfToken": session.cookies['CSRFtoken'], "login": "login"}

That works. But once logged in, internal forms are hard for me to handle with requests, so I am looking for other Python alternatives like twill.

The command functions actually now do return values, you don't need the browser object for that.

I am not being able to do what you said there. For example, show_cookies() produces an output but it doesn't return it:

returned_cookies = show_cookies()
print(type(returned_cookies), returned_cookies)

output:

==> at http://localhost:8080/iptest/

There are 2 cookie(s) in the cookie jar.

        1. <Cookie CSRFtoken=V9QsjBuC5OV2Ghena6Y33sUU3RYp0iy3 for .localhost:8080/iptest%22%22>
        2. <Cookie JSESSIONID=EDCA967D4BB3B3C1C2695902FF295225 for localhost:8080/iptest/>

<class 'NoneType'> None

How can get those cookies into a Python dictionary, or at least a string which I can parse and split?

Thanks a lot for your help with this!

@Cito
Copy link
Member

Cito commented Oct 8, 2022

How can get those cookies into a Python dictionary, or at least a string which I can parse and split?

Just noticed that the value is only returned for forms, history and links, but strangely not for cookies. Will add this to #12 so that this get fixed as well. With the current version, you can get the cookies as a dict like this:

from twill.commands import *
go("https://ipt.gbif-uat.org/login.do")
cookies = {cookie.name: cookie.value
           for cookie in browser._session.cookies}

@Cito
Copy link
Member

Cito commented Oct 8, 2022

I still think the CRSF token has to be passed into the login form. I say this because I have an old Python script which interacted with this website using the Python requests module.

The Python requests module does not send hidden fields automatically, it is not really a stateful browser like the one in Twill (although requests also supports redirection and can keep cookies if you use a session - which is what Twill does under the hood).

So the twill browser should be able to handle any "normal" CSRF mechanism out of the box.

Note however that your login page has two forms. You need to fill the values for the second form (displayed as #3 in the current version because of the mentioned bug), which has the email and password fields and the hidden CSRF token field, like this:

fv(2, 'email', 'test@example.org')
fv(2, 'password', '123456')
submit()

This should work without further ado. I think the reason why it does not work is that your site sends a strange cookie path (two quotes instead of an actual path). I guess it only works in some browsers because they silently "correct" the path.

So it should work if you correct the path like this:

go("https://ipt.gbif-uat.org/login.do")

for cookie in browser._session.cookies:
    if cookie.name == 'CSRFtoken':
        cookie.path = '/'
        break

fv(2, 'email', 'test@example.org')
fv(2, 'password', '123456')
fv(2, 'login', 'login')
submit()

# this finds the error message for the wrong password
find("combination does not exist") # should be ok

# to confirm that "find" works, this should raise an error
find("some garbage")

The "cookie patching" would not be necessary if the website had sent a proper cookie path, so if you fix this on the server, it should work without that.

@abubelinha
Copy link
Author

abubelinha commented Oct 9, 2022

Thanks a lot for your detailed answers.

  • I'm so grateful for the good workaround to get the cookies!
  • Thanks for warning about the two login steps. I had noticed that and was more or less doing like you say, with subtle differences: I was passing control numbers as strings fv("2",...) and also passing the form number when submitting submit("1"), as I saw in this docs section. Maybe that's something worth to clarify too.
  • So the key was the wrong cookie path. Thanks a lot! I probably would have never realized about that.

If I use Chrome developer tool Network tab and then select login.do to see its Cookies tab, the Request Cookies section shows / as the Path for CSRFtoken, whereas the Response Cookies section shows "" as the Path for CSRFtoken.
But the login succeeds anyway using Chrome browser. Is this what you mean with 'works in some browsers because they silently "correct" the path (...) so if you fix this on the server, it should work without that'?

Regarding fixing it on the server, that site is a test installation of this Java webapp (running on Apache Tomcat/7.0.76 -it tells you that after login-).
Might this wrong CSRFtoken cookie path be a common situation with Java-Tomcat served apps, that most browsers are able to correct themselves? (in that case, might it be worth for twill to do the same?).
Or do you think it is simply a webapp bug's fault (so we should tell developers about it)? There are some 'cookie' and 'CSRF' related issues (mostly closed) but I am not sure any of them is related to what you discovered.
If you think I should open a new issue there, I hope you don't mind me tagging you (or feel free to open it yourself since you would explain it much better -I hardly understand what cookies are-)

Thank you so much again!

@Cito
Copy link
Member

Cito commented Oct 9, 2022

I was passing control numbers as strings

I think you can do both. You can also pass field or form names instead of numbers (if they are named).

and also passing the form number when submitting submit("1")

That's optional, if you leave it out it uses the form of the last fieldvalue (fv) command.

Might this wrong CSRFtoken cookie path be a common situation with Java-Tomcat served apps

I guess it's the ipt web app or its configuration. The cookie path is set in its CsrfLoginInterceptor class, and something probably is not done right there. It also catches and ignores all Exceptions when setting the path, which does not look clean to me.

ipt issue #1652 could be related to this.

But the login succeeds anyway using Chrome browser.

Yes, I guess because Twill (or rather the RequestsCookieJar which is used under the hood) is more nitpicky about the path.

Requests issue #6245 also looks related to this.

You can tag me, but currently I do not have the time to look deeper into these issues.

The crucial issue here is that cookies can have a domain and a path attribute which specify for which domains and URL paths they shall be valid and sent to the server. If the client (the browser or Twill) thinks the path does not match, it does not send the cookie. The behavior if the server sends an invalid path (as ipt is doing) is undefined. Chrome seems to send the cookie anyway in this case, but the RequestsCookieJar does not. Maybe the RequestsCookieJar should be more sloppy as well.

@abubelinha
Copy link
Author

Thanks a lot @Cito
This should be more than enough !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants