New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade of Minio appears to have lost all data #5237
Comments
The data is not lost here but instead you don't have enough quorum to read data. I think perhaps the issue is startup servers are not connecting properly. |
How did you upgrade ? Did you attempt a rolling upgrade ? Rolling upgrade is not possible since minio requires under distributed setup all servers to be of same version. From the logs it seems like there were errors related JWT and it also looks like there is problem with NTP. Nodes are perhaps skewed in time. To make sure that things are fine
|
The upgrade was done cold with all the Minio servers down. The servers were all ntp synchronised though 3 of the hosts (which sit on the same physical host rebooted and may have been slightly out of sync for a short period) which was the start of the issue. Now 15 out of 16 of the hosts are up and and are within 30ms of each other Here is the current ntp variance.
All servers are confirmed to be running
Heal cannot be used since the server is not initialized
I have tried another restart with the servers definitely in time sync. I still see the JWT token string error and also
I'm still stuck. Any more ideas? |
Would it possible to access these nodes remotely ? |
Definately. I could do a screenshare |
I'll hang around in an appear.in room and should be available for the next 2-3 hours (today) |
After a lot of testing on a 16 node setup on packet.net - I am not able to reproduce the problem both with older releases and newer ones.
Only change here is that i am running on Ubuntu 16.04 perhaps you might be on a different operating system. |
I even tried with 10 different buckets too with each of them having policy.json in place, i am not able to see the JWT issue that you are having, neither the healing is failing. There is certainly something odd about the setup we are encountering on your end.
|
After looking further into the token issue which is perhaps what i think is happening in your setup there is only one reason that it could have happened. The actual verification is done by our jwt library if c.VerifyIssuedAt(now, false) == false {
vErr.Inner = fmt.Errorf("Token used before issued")
vErr.Errors |= ValidationErrorIssuedAt
} Which basically calls following function func verifyIat(iat int64, now int64, required bool) bool {
if iat == 0 {
return !required
}
return now >= iat
} the utcNow := UTCNow()
token := jwtgo.NewWithClaims(jwtgo.SigningMethodHS512, jwtgo.StandardClaims{
ExpiresAt: utcNow.Add(expiry).Unix(),
IssuedAt: utcNow.Unix(), ---> Unix epoch time
Subject: accessKey,
}) Similarly on the JWT library end it validates using // Validates time based claims "exp, iat, nbf".
// There is no accounting for clock skew.
// As well, if any of the above claims are not in the token, it will still
// be considered a valid claim.
func (c StandardClaims) Valid() error {
vErr := new(ValidationError)
now := TimeFunc().Unix() --> Unix epoch time. Now by the logic these two epochs can only different if the remote node is running behind the server which generated the token. That is why the message that token is used before issued. I still strongly think that there is some date problem on these machines, previous releases like 2017-06-13 had a different logging model which perhaps somehow worked, but newer release is exposing a bug in the setup. Attaching a tool to run Go compiled program on your systems to look for unix timestamp unix.zip - Can you run this on your end let us confirm if the unix timestamp is reporting the right values? Thanks - let me know if you are up for another remote session we can schedule it again at your convenience. |
All machines reported an identical epoch time. I'll run a proper iperf later though i notice the bandwidth is smaller than i'd expect between a single test host (Host has 2x10Gb bonded interface). There may be a subtle network misconfiguration though.
|
We should perhaps sit on this again PST time, if you have time and see the issue through. There is something which doesn't add up in the entire configuration.
This number is expected if you ran two iperf's then it would show you a higher number i.e combining each. |
This has also been ran from the virtual machine and not from the underlying host. I'll take a look later at home since i have a lot to do today. |
Happy to sit on this again later. |
I copied all the data onto a single machine and used the newest Minio pointed directly at the directories. There was not a single error and the data came online straight away. (though it seem to think i have 2.8TB free out of 5.8TB which is definitely not right) Does this point to the network issue you mentioned? |
Yes that i would expect that, you could even try on a different data partition like a temporary path such as
This would be something related to the
Yes it must be a network problem, there is some sort of a packet delay which causes this. But i am not really sure at this point what might be causing this. In my testing i am not able to reproduce it anyways. We can take a look again if you have time, also please reach us on https://slack.minio.io we can perhaps a more quick discussion on this - thanks |
To continue this discussion further would it be possible for you to join our slack channel?. Let me know when you wish to continue this discussion. |
I also see this error sometimes when I test 4 virtualbox VMs using Vagrant in my local machine. This PR dgrijalva/jwt-go#139 wants to add an option to configure a tolerated skew time between server and client. This RFC, http://self-issued.info/docs/draft-ietf-oauth-json-web-token.html#expDef
(as seen in the PR discussion) I wasn't able to test, but I guess the following diff can fix the problem: diff --git a/cmd/jwt.go b/cmd/jwt.go
index e8da488d..413fef2a 100644
--- a/cmd/jwt.go
+++ b/cmd/jwt.go
@@ -47,6 +47,16 @@ var (
errNoAuthToken = errors.New("JWT token missing")
)
+type relaxedJWTClaims struct {
+ *jwtgo.StandardClaims
+ // Tolerated clock skew time for IssuedAt time, in seconds
+ leeway int64
+}
+
+func (c *relaxedJWTClaims) VerifyIssuedAt(cmp int64, req bool) bool {
+ return c.StandardClaims.VerifyIssuedAt(cmp-c.leeway, req)
+}
+
func authenticateJWT(accessKey, secretKey string, expiry time.Duration) (string, error) {
passedCredential, err := auth.CreateCredentials(accessKey, secretKey)
if err != nil {
@@ -97,7 +107,7 @@ func isAuthTokenValid(tokenString string) bool {
if tokenString == "" {
return false
}
- var claims jwtgo.StandardClaims
+ var claims = relaxedJWTClaims{leeway: 60}
jwtToken, err := jwtgo.ParseWithClaims(tokenString, &claims, keyFuncCallback)
if err != nil {
errorIf(err, "Unable to parse JWT token string") |
That looks like an interesting fix @vadmeste can you test and confirm if the fix works? |
@vadmeste i looked at this further - IssuedAt never does an exact match, the matching is true if the current time is greater than or = the issued time. So the leeway fix is not needed IMO and might not be the reason why this issue occurred. func verifyIat(iat int64, now int64, required bool) bool {
if iat == 0 {
return !required
}
return now >= iat
} |
This is for VerifyExpiresAt not VerifyIssuedAt issued at should be verified as above @vadmeste |
Yeah, I am still doing a little investigation.. I was able to reproduce it with my Vagrant setup. |
Running two VMs in my machine (vagrant+virtualbox) shows the problem. I use the following code which prints the time with high precision + epoch time package main
import (
"fmt"
"time"
)
func main() {
for i := 0; i < 10; i++ {
t := time.Now()
fmt.Print(t, t.Unix())
fmt.Println("")
time.Sleep(50 * time.Millisecond)
}
} I have two terminals logged to VM1 and VM2, if I run the above code in VM1 and then manually and quickly in VM2, I can see that sometimes VM2 shows earlier time. VM1: VM2: It seems that time in VMs can skew sometimes |
Is there some hint in the code differences between the 2 releases above since i do not get the problem with the older version of code but do with the latest version. Incidentally. After setting up the latest version of minio and mirroring all my data into it I no longer see any issues on the new cluster so it might be that some time related artefact might get saved to the filesystem and cause problems with machines which are in sync after the problem happened (all our machines were very much in sync while experiencing the problem) |
Older version server was generating the token. The window of this to occur was lesser, now that we don't transmit access keys over network client generates a token and gets it validated from server - this issue is more visible. Client and Server are Minio servers themselves talking to each other. The timing issue of virtual machines is outside of this problem anyways. JWT thinks that is unexpected since according to them token issued and then used later within expiry should be on a entity for which the time has progressed monotonically. This expectation is a bit problematic when the precision rounding happens when only seconds are picked for epoch. A 50ms difference causes top level seconds to vary around a second due to rounding .. will read up more and see if we can use nano seconds instead for better precision here. The JWT spec is not explicit about this.. |
jpadilla/pyjwt#190 issue explains this frustration in more detail , looks like JWT RFC has gone back and forth on this https://tools.ietf.org/id/draft-jones-json-web-token-04.html (Draft) expired in 2011
https://tools.ietf.org/html/rfc7519#page-10 (Proposed) - doesn't mention any leeway regarding IAT but missing in official proposal. Looking a PyJWT implementation a certain leeway is provided, perhaps we should bring that in as well. So i guess @vadmeste proposal is okay in this scenario, we can implement our own claims wrapping over StandardClaims would work fine. |
Let me add a new implementation as leewayClaims which adds leeways for expiry, iat together. |
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have skewing check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes minio#5237
Remove the requirement for IssuedAt claims from JWT for now, since we do not currently have a way to provide a leeway window for validating the claims. Expiry does the same checks as IssuedAt with an expiry window. We do not need it right now since we have clock skew check in our RPC layer to handle this correctly. rpc-common.go ``` func isRequestTimeAllowed(requestTime time.Time) bool { // Check whether request time is within acceptable skew time. utcNow := UTCNow() return !(requestTime.Sub(utcNow) > rpcSkewTimeAllowed || utcNow.Sub(requestTime) > rpcSkewTimeAllowed) } ``` Once the PR upstream is merged dgrijalva/jwt-go#139 We can bring in support for leeway later. Fixes #5237
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Expected Behavior
Expected data to be available after upgrade
Current Behavior
All servers show the following errors
All my data is unavailable. I still see the fragments of the files in the data directories but am unable to access anything at all. This is a big problem.
Possible Solution
Steps to Reproduce (for bugs)
Server not initialized, please try again
Context
An upgrade. On startup i was told my version was 5 months old so i decided to upgrade and lost data.
Your Environment
minio version
): (previous verson)Version: 2017-06-13T19:01:01Z
Release-Tag: RELEASE.2017-06-13T19-01-01Z
Commit-ID: 353f2d3
Version: 2017-11-22T19:55:46Z
Release-Tag: RELEASE.2017-11-22T19-55-46Z
Commit-ID: d1a6c32
uname -a
):uname -a
Linux data2 3.11.0-17-generic Extracting storage api to interface #31~precise1-Ubuntu SMP Tue Feb 4 21:25:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: