However, as a SRE who works at extremely large scale I'm puzzled by your comment about Google. They know how to manage capacity and deploy safely, and I'm shocked that they cycled hosts in a way that impacted customers. If I did that I'd be writing a detailed post-mortem document that identified the process gaps and described the solutions that we would be implementing to prevent it from happening again... and the deployments would have been stopped as soon as customer impact was observed, likely within minutes.
I understand NDAs and that you probably can't discuss details but perhaps you can answer a yes/no question-- do you know how/why capacity buffers were exceeded in the upgrade, and why the upgrade continued even with customer impact?
really annoying when ur full recharging an important portal thats under attack and due to lagg and then suddenly destroyed. (happened to me more then once.)
Thanks for the update; I appreciate any information given about this topic at this point. And I think that's the most annoying thing about all this; we've been experiencing issues for three years. You 'have been working on this for god knows how long', but we have been informed exactly 3 times.
There was a tentative fix (for at least the recharge parT), and usually when there are tentative fixes, we never hear from it again, as they never work. (The blank/incomplete map loaded at startup for example)
In any case, when Niantic is completely silent, and just ignores every topic, question etc; we rightfully so have the feeling that it's just what it is. And I still can't believe that after 3 years of issues, one topic with 'data collected from players' there is no insight as to any root cause (or multiple) and no impending solution or fix. But I don't work at Niantic, I just deal with the diminished players counts and just am not enjoying Ingress anymore, and I know a lot of other people feel the same way. (But that's something I assume you already know and see in the data/monitoring)
Highly appreciate the update. While the lag is annoying, I think the community is more annoyed at lack of communication (not just about this issue), but community outreach is not something that is part of your role, or that you should be doing.
I honestly do not understand how a developer whose primary job is, well, developing things rather than taking care of community, can have THAT MUCH more effective communication and better responsiveness than, you know, @NianticThia . Just what is happening at Niantic??? I can't even imagine.
Still though. I barely used that button in OG Ingress and even Prime predating that move to cheaper compute space, this issue didn't occur nearly as much as it does now.
Not that it matters, but I remain convinced this is a capacity issue on some level. The lag problem is at least as old as C.O.R.E. is and that celebrates it's second birthday in three months.
As much as I appreciate the update, I remain highly skeptical solving this problem is actually on the calendar with any kind of priority.
Thanks for the update; these sorts of communications from the teams are really appreciated, even if it's still "we're working on it." It really sucks to hear nothing but radio silence on all fronts when there are many known issues.
@Kevinsky86 I suspect it's some combination of a capacity issue and a design/architecture issue.
I do know that there are things that used to work well for me at some time in the past that are constantly problematic right now. The sequence of events I encounter most frequently is:
Single-reso capture a portal
Try to deploy a mod
More often than not this fails with an error about the portal being neutral, but then succeeds on the second try. I used to be able to do this pretty reliably.
No. Ingress usage isnt constant. That is to say that at that time there are fewer players playing and so we dont need as much capacity to be running to support healthy gameplay.
So, its a little complicated. The specific thing which happened was a GKE node upgrade. We can disable auto-upgrades altogether, but the GKE control plane will auto-upgrade no matter what. This means that if the control plane upgrades past the node plane, the system will stop working. This is why we keep auto-upgrade on. The more specific issue is that there is supposed to be a setting which causes the auto-upgrades to happen at a specific time and not in the middle of the day. Not sure what went wrong here.
As to the the issue of the lag itself, I can give a little more insight into some of the complexity here. So, prime uses the same communication layer from client to server as the other niantic games. This layer is primarily worked on by a shared team at niantic so that each game doesnt need to duplicate all that effort. That communication layer consists of two parts: a client side component and a server side component. We initially thought that the client side component which handles things like "what does the client do if it doesnt receive a response in X seconds" and "how many requests should the client be allowed to make at once so that it doesnt spam the servers" was the core issue. Sounds like it matches a lot of the symptoms, right? Turns out that some of our fixes (and when I say our, I mean Ingress coordinating with a central team as we dont generally work on this shared code) didnt fix it. So, we then thought that it might be how the server receiving the request and processing it, only for the client to disconnect before the server could send the response back. After that, we thought it might be that the client is retrying like 10 times and the server is doing a lot of work figuring out that they're duplicate requests.
It is important to us to fix and so we're still working on this, but hopefully you can at least see how involved / complicated it is to actually figure out. We know that many of you dont care how complicated it is and just want it fixed and thats fair, but I figured some of you may be interested in understanding a little more.
I get around 600-700 "RetriesError" on average when I play normally trying too deploy resonators. Ongoing for over a year. Deploying mods sometimes I get 30-60 seconds before it deploys. Recharging has been much smoother lately.
I understand that it's complex, but I fail to see how that explains that it is now taking two years+ to fix. There have been so many people willing to help, run with debug builds, beta test, etc.
The biggest issue is the fact that communication is non existent. We get a topic to post our 'logs' in, but never get any feedback, the topic dies out. We get a tentative fix, that didn't work, and it's not mentioned anywhere, ever. Even this topic is one year old in November, and nothing noteworthy up until you appeared (thanks again, I can't stress enough how much of a boost this is to morale, to have someone take the time and write something of a reply).
I'm eagerly awaiting the fix, but you (niantic) have to see yourself that it's all a little too late, as most of the players have lost faith that it will ever be fixed and have moved on to other games.
I actually do grok how complicated things like this can be... and I'm also aware that a single problem perceived by users is actually multiple complex issues under the hood, and sometimes compound issues. I'd say that I don't envy you the task of solving this but I sort of do... this is exactly the sort of problem that I love the most because it involves a lot of deep-diving and complex troubleshooting, and it's incredibly rewarding to finally solve the problem.
For got paid the subscription NIA don't have latency. There's not problem. Players have problems, scanner don't work, latency, bugs, fakes, multiaccount are not problem because NIA don't have problem with got paid subscriptions.
When i read this i really wonder how the 'old' ingress was able to perform at all 😂
In old days it worked on 256/512kB mobiles, anomalies with many thousands battling in single or very few mobile cells.
Not that there werent any update errors or even eventual app crashes under these heavy load conditions but comparably rare esp in comparison to nowadays just normal gameplay with just very few agents at once even in big cities.
Mobile networks, servers and esp mobile phones massively increased their speeds, available ressources and sensors precisions not just in percentages but by remarkable factors.
Guys, its really a shame what you roll out as enduser software and which stale software concept now hides on servers end to massively slow down ingress gameplay and fup updating in esp battle situation means.
I would really love to see big anomalies taking place again but i understand its completely impossible with the current setup and system design in general.
This is fantastic information. I personally enjoy learning more about how things work in Ingress, even if I don't understand the high level code that does it.
I'm more familiar with embedded code than anything else. Eg robots, 3d printers etc
I reckon that there is actually a sizable number of us players that just love to hear more. Perhaps even a podcast between other devs too.
Turns out that some of our fixes (and when I say our, I mean Ingress coordinating with a central team as we dont generally work on this shared code) didnt fix it. So, we then thought that it might be how the server receiving the request and processing it, only for the client to disconnect before the server could send the response back. After that, we thought it might be that the client is retrying like 10 times and the server is doing a lot of work figuring out that they're duplicate requests.
Shouldn't the client requests be tagged with some sort of session ID+Request Counter+Attempt Counter, so that when resends are done, it can tell "Oh I already did that Request, this is just a retry" without any real calculation?
Thanks for somewhat of an answer. Hopefully someone can keep us updated at least somewhat going forward. Hearing nothing about a big problem in the game is crappy.
It doesnt sound they are any closer then 1 year ago isolate the issue and solve it finally? its been so long now i dont know really what they need do to properly fix it. But its nice to get an update atleast even if not the best news we hoped for.
Non technical update: we have a fix that we're trying out, were hopeful that itll work, but we have a couple ideas to try out if it doesnt. Fix should be live either today or tomorrow.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a smoking gun that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found. After continuing to look, we've discovered that theres a hole in our data: our infrastructure is designed to scale up / down when the load changes. Google seems to be assigning requests to servers that have just been scaled up, but arent actually ready yet. Although the request isnt "delivered" to the machine, because it isnt ready, it is assigned to the machine and just waits. We're talking to Google about this and we have a workaround that we're trying, but we arent sure that it will fix it. Since this is a weird startup thing which only happens under a lot of load, its hard to test internally. We've done our best, but the final test is what happens with real players. So, we're trying out the current fix, but if that doesnt work we have a few more ideas.
All of that being said, we know that its a long journey in fixing this, so thanks for bearing with us.
Non technical update: we have a fix that we're trying out, were hopeful that itll work, but we have a couple ideas to try out if it doesnt. Fix should be live either today or tomorrow.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a smoking gun that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a obvious problem that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found.
After continuing to look, we've discovered that theres a hole in our data: our infrastructure is designed to scale up / down when the load changes. Google seems to be assigning requests to servers that have just been scaled up, but arent actually ready yet. Although the request isnt "delivered" to the machine, because it isnt ready, it is assigned to the machine and just waits. We're talking to Google about this and we have a workaround that we're trying, but we arent sure that it will fix it. Since this is a weird startup thing which only happens under a lot of load, its hard to test internally. We've done our best, but the final test is what happens with real players. So, we're trying out the current fix, but if that doesnt work we have a few more ideas.
All of that being said, we know that its a long journey in fixing this, so thanks for bearing with us.
Cold start issues can be a real source of frustration, and I deal with them regularly for the services I manage and their dependencies. I don't know much about your internal architecture but intuitively it's easy for me to imagine that cold starts are a part of the problem. If so, I'd expect that you might also see increased lag during code deployments. Assuming you have internal metrics about request latency it should be pretty easy to plot those with scale-ups and code-deployments as vertical lines on the graph. You might consider reworking your health check mechanisms so that new servers aren't put into service until they've completed their startup sequence and are ready to serve traffic at scale.
Comments
(especially since we've asked maintenance to be scheduled at night and not during the day)
One question: Does this mean that whoever's day time is your night time would then see that lag instead?
@ofer2 Thank you for the update!
However, as a SRE who works at extremely large scale I'm puzzled by your comment about Google. They know how to manage capacity and deploy safely, and I'm shocked that they cycled hosts in a way that impacted customers. If I did that I'd be writing a detailed post-mortem document that identified the process gaps and described the solutions that we would be implementing to prevent it from happening again... and the deployments would have been stopped as soon as customer impact was observed, likely within minutes.
I understand NDAs and that you probably can't discuss details but perhaps you can answer a yes/no question-- do you know how/why capacity buffers were exceeded in the upgrade, and why the upgrade continued even with customer impact?
really annoying when ur full recharging an important portal thats under attack and due to lagg and then suddenly destroyed. (happened to me more then once.)
Thanks for the update; I appreciate any information given about this topic at this point. And I think that's the most annoying thing about all this; we've been experiencing issues for three years. You 'have been working on this for god knows how long', but we have been informed exactly 3 times.
There was a tentative fix (for at least the recharge parT), and usually when there are tentative fixes, we never hear from it again, as they never work. (The blank/incomplete map loaded at startup for example)
In any case, when Niantic is completely silent, and just ignores every topic, question etc; we rightfully so have the feeling that it's just what it is. And I still can't believe that after 3 years of issues, one topic with 'data collected from players' there is no insight as to any root cause (or multiple) and no impending solution or fix. But I don't work at Niantic, I just deal with the diminished players counts and just am not enjoying Ingress anymore, and I know a lot of other people feel the same way. (But that's something I assume you already know and see in the data/monitoring)
Highly appreciate the update. While the lag is annoying, I think the community is more annoyed at lack of communication (not just about this issue), but community outreach is not something that is part of your role, or that you should be doing.
I honestly do not understand how a developer whose primary job is, well, developing things rather than taking care of community, can have THAT MUCH more effective communication and better responsiveness than, you know, @NianticThia . Just what is happening at Niantic??? I can't even imagine.
Still though. I barely used that button in OG Ingress and even Prime predating that move to cheaper compute space, this issue didn't occur nearly as much as it does now.
Not that it matters, but I remain convinced this is a capacity issue on some level. The lag problem is at least as old as C.O.R.E. is and that celebrates it's second birthday in three months.
As much as I appreciate the update, I remain highly skeptical solving this problem is actually on the calendar with any kind of priority.
Thanks for the update; these sorts of communications from the teams are really appreciated, even if it's still "we're working on it." It really sucks to hear nothing but radio silence on all fronts when there are many known issues.
@ofer2 Very much like the update. Could we get a monthly update or even every two months, even if nothings been done it reassures us you care.
You have always been really good at providing insight into what's going on and it's appreciated.
@Kevinsky86 I suspect it's some combination of a capacity issue and a design/architecture issue.
I do know that there are things that used to work well for me at some time in the past that are constantly problematic right now. The sequence of events I encounter most frequently is:
More often than not this fails with an error about the portal being neutral, but then succeeds on the second try. I used to be able to do this pretty reliably.
No. Ingress usage isnt constant. That is to say that at that time there are fewer players playing and so we dont need as much capacity to be running to support healthy gameplay.
So, its a little complicated. The specific thing which happened was a GKE node upgrade. We can disable auto-upgrades altogether, but the GKE control plane will auto-upgrade no matter what. This means that if the control plane upgrades past the node plane, the system will stop working. This is why we keep auto-upgrade on. The more specific issue is that there is supposed to be a setting which causes the auto-upgrades to happen at a specific time and not in the middle of the day. Not sure what went wrong here.
As to the the issue of the lag itself, I can give a little more insight into some of the complexity here. So, prime uses the same communication layer from client to server as the other niantic games. This layer is primarily worked on by a shared team at niantic so that each game doesnt need to duplicate all that effort. That communication layer consists of two parts: a client side component and a server side component. We initially thought that the client side component which handles things like "what does the client do if it doesnt receive a response in X seconds" and "how many requests should the client be allowed to make at once so that it doesnt spam the servers" was the core issue. Sounds like it matches a lot of the symptoms, right? Turns out that some of our fixes (and when I say our, I mean Ingress coordinating with a central team as we dont generally work on this shared code) didnt fix it. So, we then thought that it might be how the server receiving the request and processing it, only for the client to disconnect before the server could send the response back. After that, we thought it might be that the client is retrying like 10 times and the server is doing a lot of work figuring out that they're duplicate requests.
It is important to us to fix and so we're still working on this, but hopefully you can at least see how involved / complicated it is to actually figure out. We know that many of you dont care how complicated it is and just want it fixed and thats fair, but I figured some of you may be interested in understanding a little more.
I get around 600-700 "RetriesError" on average when I play normally trying too deploy resonators. Ongoing for over a year. Deploying mods sometimes I get 30-60 seconds before it deploys. Recharging has been much smoother lately.
I understand that it's complex, but I fail to see how that explains that it is now taking two years+ to fix. There have been so many people willing to help, run with debug builds, beta test, etc.
The biggest issue is the fact that communication is non existent. We get a topic to post our 'logs' in, but never get any feedback, the topic dies out. We get a tentative fix, that didn't work, and it's not mentioned anywhere, ever. Even this topic is one year old in November, and nothing noteworthy up until you appeared (thanks again, I can't stress enough how much of a boost this is to morale, to have someone take the time and write something of a reply).
I'm eagerly awaiting the fix, but you (niantic) have to see yourself that it's all a little too late, as most of the players have lost faith that it will ever be fixed and have moved on to other games.
@ofer2 Thank you for the clarifications.
I actually do grok how complicated things like this can be... and I'm also aware that a single problem perceived by users is actually multiple complex issues under the hood, and sometimes compound issues. I'd say that I don't envy you the task of solving this but I sort of do... this is exactly the sort of problem that I love the most because it involves a lot of deep-diving and complex troubleshooting, and it's incredibly rewarding to finally solve the problem.
For got paid the subscription NIA don't have latency. There's not problem. Players have problems, scanner don't work, latency, bugs, fakes, multiaccount are not problem because NIA don't have problem with got paid subscriptions.
When i read this i really wonder how the 'old' ingress was able to perform at all 😂
In old days it worked on 256/512kB mobiles, anomalies with many thousands battling in single or very few mobile cells.
Not that there werent any update errors or even eventual app crashes under these heavy load conditions but comparably rare esp in comparison to nowadays just normal gameplay with just very few agents at once even in big cities.
Mobile networks, servers and esp mobile phones massively increased their speeds, available ressources and sensors precisions not just in percentages but by remarkable factors.
Guys, its really a shame what you roll out as enduser software and which stale software concept now hides on servers end to massively slow down ingress gameplay and fup updating in esp battle situation means.
I would really love to see big anomalies taking place again but i understand its completely impossible with the current setup and system design in general.
Thats really sad..
Hrm running redacted at a large anomaly you had plenty of lag. Just keep pressing buttons n hope something happened.
This is fantastic information. I personally enjoy learning more about how things work in Ingress, even if I don't understand the high level code that does it.
I'm more familiar with embedded code than anything else. Eg robots, 3d printers etc
I reckon that there is actually a sizable number of us players that just love to hear more. Perhaps even a podcast between other devs too.
Turns out that some of our fixes (and when I say our, I mean Ingress coordinating with a central team as we dont generally work on this shared code) didnt fix it. So, we then thought that it might be how the server receiving the request and processing it, only for the client to disconnect before the server could send the response back. After that, we thought it might be that the client is retrying like 10 times and the server is doing a lot of work figuring out that they're duplicate requests.
Shouldn't the client requests be tagged with some sort of session ID+Request Counter+Attempt Counter, so that when resends are done, it can tell "Oh I already did that Request, this is just a retry" without any real calculation?
That's kinda how it works right now though...years later, with much faster mobile phones, on much faster networks.
Thanks for somewhat of an answer. Hopefully someone can keep us updated at least somewhat going forward. Hearing nothing about a big problem in the game is crappy.
Force sync should fix some issues. It would remind the server to speak to the client me thinks.
It doesnt sound they are any closer then 1 year ago isolate the issue and solve it finally? its been so long now i dont know really what they need do to properly fix it. But its nice to get an update atleast even if not the best news we hoped for.
Non technical update: we have a fix that we're trying out, were hopeful that itll work, but we have a couple ideas to try out if it doesnt. Fix should be live either today or tomorrow.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a smoking gun that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found. After continuing to look, we've discovered that theres a hole in our data: our infrastructure is designed to scale up / down when the load changes. Google seems to be assigning requests to servers that have just been scaled up, but arent actually ready yet. Although the request isnt "delivered" to the machine, because it isnt ready, it is assigned to the machine and just waits. We're talking to Google about this and we have a workaround that we're trying, but we arent sure that it will fix it. Since this is a weird startup thing which only happens under a lot of load, its hard to test internally. We've done our best, but the final test is what happens with real players. So, we're trying out the current fix, but if that doesnt work we have a few more ideas.
All of that being said, we know that its a long journey in fixing this, so thanks for bearing with us.
Non technical update: we have a fix that we're trying out, were hopeful that itll work, but we have a couple ideas to try out if it doesnt. Fix should be live either today or tomorrow.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a smoking gun that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found.
Technical update: so, we have a new theory for what is happening. So, for some context first: when we hear "latency is high" there isnt a obvious problem that we can look backwards from because we cant reliably reproduce the issue. The issue primarily happens at scale and when there is load. Therefore, we use a different strategy which is to look at the data. We have data which samples requests randomly and then gives us some information about the request. We can filter these sampled requests by how long they took. This process essentially gives us a list of "top 10 sources of latency." That's primarily what we've been looking into to try and make things faster (this is where we've looked into retry stuff, and other various ideas we've had). So, we've done a bunch of work here, reducing the latency of things that we've found.
After continuing to look, we've discovered that theres a hole in our data: our infrastructure is designed to scale up / down when the load changes. Google seems to be assigning requests to servers that have just been scaled up, but arent actually ready yet. Although the request isnt "delivered" to the machine, because it isnt ready, it is assigned to the machine and just waits. We're talking to Google about this and we have a workaround that we're trying, but we arent sure that it will fix it. Since this is a weird startup thing which only happens under a lot of load, its hard to test internally. We've done our best, but the final test is what happens with real players. So, we're trying out the current fix, but if that doesnt work we have a few more ideas.
All of that being said, we know that its a long journey in fixing this, so thanks for bearing with us.
@ofer2 When will you start dealing with the problem of multi-accounts and who uses gps spoofing?
Excellent update @ofer2. This is the kind of progress update we need 🎉
@ofer2 Thank you for that.
Cold start issues can be a real source of frustration, and I deal with them regularly for the services I manage and their dependencies. I don't know much about your internal architecture but intuitively it's easy for me to imagine that cold starts are a part of the problem. If so, I'd expect that you might also see increased lag during code deployments. Assuming you have internal metrics about request latency it should be pretty easy to plot those with scale-ups and code-deployments as vertical lines on the graph. You might consider reworking your health check mechanisms so that new servers aren't put into service until they've completed their startup sequence and are ready to serve traffic at scale.