New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases #2400
Comments
Please remove lavel "sqlite" - it was added by mistake, but I cannot change it anymore to "ms sqlclient" or similar. |
@JiriZidek Can you capture some EventSource logs. Also can you confirm by reverting back to Net7 the issue does not happen. |
I recommend you contact Azure support and have the networking team look into what is going on during those periods of failures. |
I am afraid there are no "periods". The problem appears randomly, but always after restarting the container. So from this point of view it is reproducible. |
Yes - we tested old version, which differs only by being .NET7 and all the nuget modules from December (EF, SqlClient, image aspnet:7.0) and yes, when restarted it does not yield this error. |
I realized one important point - I run the very same code as Azure Service, using Linux plan and .NET8 - and in this case there is not such problem with SqlClient. |
This confirms that your best option would be contacting Azure support. |
But how this plays with .NET7 works vs. .NET8 does not ? In same place, same AKS cluster ? I can put two PODs, one old, one new in parallel to decide if it is likely the network problem. Let you know. I'd like to avoid being pingponged back from Azure support. |
Just a note to confirm we're seeing the exact same issue in almost an identical setup:
Exception message:
Note that the error occurs incidentally (less than 0.5% of the calls), and we're seeing the exceptions since the day that the .NET upgrade was deployed. |
So positively confirmed. AKS, same K8s namespace. One POD with .NET8 and .NET 7. So for me this is NOT Azure problem. It is problem if .NET8 or problem of Debian 12 (which is bases of the image). But since I have observed same bad behavior on Unbuntu based images I bet on .NET8 or Microsoft.Data.SqlClient. I guess there is something wrong with TLS handshake. |
I'm having the same issue, there are no concurrent connections to the DB and no async operations happening in parallel on the same context.
|
We experienced the same SqlException (0x80131904), however in a slightly different environment, and I beleive I understand the source of this exception, at least in our environment.
TLDRThis exception occurs when there is a proxy between the client and the SQL Server AG Listener, and the proxy accepts TCP connections for both the primary and secondary SQL Server instances. The mechanism SqlClient uses to determine which host is primary is by opening a Socket to all IPs returned by the DNS lookup, and the first socket to open is considered the primary and the other connections are disposed. (See ParallelConnectAsync and ParallelConnectHelper from SNITcpHandle.cs). Under normal circumstances, without the proxy, the secondary instance is not listening on port <listener-ip>:1433 and therefore the client cannot open a socket. But when a proxy is between the client and server, the client can open the socket and even perform the TCP handshake with the proxy for both the primary and secondary SQL Server Instances. If the socket is opened to the secondary first, then as soon as any data is sent this exception is thrown. Environment
pod.yaml
service.yaml
DescriptionThe first issue with this, is that Istio will create a Virtual IP (VIP) for the ServiceEntry ex: 240.240.240.10 and any DNS queries performed within the pod for the listener name will resolve to this VIP. In addition to the VIP, the Istio proxy sidecar will do DNS resolution on that hostname and put both the primary and secondary IPs into a pool and perform load balancing between all IPs in the pool. This means sometimes when SqlClient will get sent to the secondary instance. Disable the VIP and Set the Service resolution to NONE
With the VIP disabled the the pod's DNS queries will now resolve to the 2 SQL Listener IPs, and with the ServiceEntry resolution set to NONE, connections will be routed/forwarded to the IP requested. This however does not solve the root of the issue. Istio Proxy is listening and accepting connection on 1433. SqlClient attempts to open 2 connections to the 2 IPs and because Istio is listening both sockets will succeed. If the primary is opened first then all is good. If the secondary is opened first then will perform the TCP handshake and then TLS / TDS handshake which is where the exception is thrown. Bypass Istio
This will cause any traffic on port 1433 to completely bypass Istio and traffic will go directly to SQL Server without issue. Possible FixIn SqlClient, delay the decision for which IP/Host is considered primary to after the TLS handshake / TDS Handshake is attempted instead of using Socket.Connect() and perform the handshake in parallel for all IPs. If a socket is opened to the secondary listener ip it will fail at the handshake, otherwise the Socket.Connect() will timeout and in the mean time the primary handshake can proceed, in which case discard the rest of the connections. |
👍 In my recent experiments this makes app almost unusable with Azure SQL. I planned to use this in POC project. |
The only reason I mentioned Pooling=False is because it is useful if anyone attempts to re-create the issue and wants to reliably reproduce it. It is unlikely you would want to Pooling=False in a real world application |
I am also facing the same issue as well.
|
Ignore my comment from before, the issue I am facing is not related to this issue. |
After migration from .NET7 to .NET8 SqlException 0x80131904 started to appear randomly when connecting to Azure SQL databases
The error happens in two flavors:
We observed that this error happens usually after a longer (>minutes) period of inactivity (kestrell web app not being hit). After first occurence, the successive DB operations succeed. We use SqlClient in our code indirectly through EF.
Version information
Microsoft.Data.SqlClient version: 5.2.0 (but same observed for 5.1.5) - Microsoft.Data.SqlClient.dll 5.20.324059.2 dated 28-FEB-2024 size 911480 bytes
Target framework: .NET 8 - 8.0.3 (but same observed for 8.0.2, 8.0.1 & 8.0.0)
Operating system: Linux docker image - same result on mcr.microsoft.com/dotnet/aspnet:8.0 and mcr.microsoft.com/dotnet/aspnet:8.0-jammy - hosted in AKS, similar behavior for backend service using image runtime:8.0
Relevant code
Original code - in .NET7 - now randomly failing .NET8:
Improved code - no better, same problems:
Failing place examples (they are ranadom in fact, in various programes):
Connection string
Connection string looks similar to
Server=tcp:SOME-sql.database.windows.net,1433;Initial Catalog=MyDB;Persist Security Info=False;User ID=some-admin;Password=123456789;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;
TrustServerCertificate=True
with no effect.MultipleActiveResultSets=True
with no effect.Full stack traces
We hoped the problem would be solved by 5.2.0, but it does not look to be the case.
The text was updated successfully, but these errors were encountered: