Nitantic Ingress team will likely have a meeting Monday morning to triage what went so catastrophically wrong.
It is great that players are organising themselves to create massive link stars. Though it should never prevent other players in the world from playing.
This is like the 3rd link star in the last few weeks, and each time with the new low cost server, game has been unplayable for anyone else.
Reminds me of that time during the Christchurch Shonin anomaly, with the high cost servers, before Prime, and both factions had to do the Anomaly without intel, because Japanese players overwhelmed the server. Japan was also doing the Shonin Anomaly at the same time as Christchurch.
But question is was niantic aware that there was gonna be a linkstar attempt during weekend in Japan since it seems nobody from niantic even wrote something that they are aware of server network issues? it seems not..
It's weekend and I think they were chilling without having a clue that something is wrong/performs slowly. Probably they would know and write something if they played the game or read the forum :trollface:
Would of been nice to see a response from anyone at Niantic this weekend to acknowledge that they are aware of the issue, and that they are either working on making the system stable so agents can log in and play, or well....anything.
Nothing posted on the Community Forums or on their Telegram channel.
Communication is a huge thing @ace@NianticBrian . Things like this worry us (the player base you are trying to desperately grow). I'm just glad this time that when someone decided to basically cripple the game (not the agent's fault, code is at fault) itself for the rest of the world, I wasn't several hours from home trying to Explore a new to me area and finish up an Onyx badge. I don't blame the agents who did the starburst. It's part of the game. But what isn't part of the game, is the fact it kills almost everyone else's ability to log into the game to play. Ingress needs to take a look at their code and figure out a way to resolve such issues or (I really hope not) limit the ability for inbound links. But I'm scare to promote that, because it will be limited and will likely never actually get fixed.
If you are planning on releasing a Subscription for Ingress, are you planning on having monitoring for game issues, and increasing your communications within the game for things that happen that break the game? At this point, despite people stating that they do monitor the game, I'm hesitant to really believe it, as this weekend has shown us yet again. Game play-ability breaks, and zero communication from Niantic / Ingress.
Hi everyone, I'm really sorry about this event. We have been looking into the incident and are looking into some ways to make sure that this doesnt happen again. For context, we currently manually scale up / down servers for when we expect there to be periods of high usage. E.g. we typically do this for first saturday. That being said, we do want to make sure that players can not only surprise the other faction, but also us in being able to do things like this. So, we are looking at a few different possible mitigations for this in the future: 1. always scaling up servers on the weekend, 2. optimizing the routing servers to be able to handle more traffic overall, 3. potentially make it so that the servers automatically scale up and down, and lastly 4. separate out the computationally expensive components to make sure that links/linking doesnt bring down the game for everyone else. We will be looking into these options and trying to see what we can incorporate into development quickly and which may be left for later because they're more complicated to implement. Lastly, this one was my fault in particular, I had an issue with alert notifications. The event occurred at 1:18 AM our time and the notification on my phone was incorrectly configured and so it did not wake me up. I'm really sorry about this. I have reconfigured the notification and tested it three times to make sure that it actually notifies me properly to wake me up if an event does occur. Other people who werent supposed to be the ones to handle issues woke up before me and fixed it, but that is why there was such a delay.
First of all I would like to, as always, thank you for clarifying things for us in a terrific detail.
Second, I want to, separately, wholeheartedly thank you for admitting your fault. You are possibly the single person who can just do this without too much corporate speech. Thank you thank you thank you.
You see, this linkstar happened in the most inconvenient time for you to react to it: in the middle of the night on a weekend. Not only that, but Ingress agents generally prefer the most inconvenient times for their opponents to create huge and unexpected operations, even if that is a night before a weekday. So you never know when someone decides to make another 8k linkstar, and this may not be weekend. I would prefer you implementing the automatic scaling, so you know that you can sleep calmly in case a bunch of agents goes to make some crazy stuff.
At last, once again, dude you're just a treasure for this forum and this community. I love how honest and open you are. Keep it up!
Agreed, thanks again @ofer2 for letting us know what's going on - once again you've stepped up to the plate and taken on the community manager role when we needed it.
Would it help if the players tipped you off about linkstars in advance? I might be speaking out of turn here, but I think the community would be happy to give you whatever warning you wanted if it meant the servers stayed up, especially if it could be done securely.... and even more so if it saved you getting woken up in the middle of the night at weekends!
Make sure it's option 3, it's a complete no-brainer. Being able to automatically scale down could greatly reduce costs (profit/sustainability here), and scaling up avoids outages and increases player engagement. 🙂
While auto-scaling should probably be the preferred solution (so you can sleep), I'd get that Ingress doesn't always have the budget for that.
Alternatively, or in the meantime, would having Vanguards or some other contact method be able to advise the Niantic team when a high stress period is likely? Being surprised by what we do is great, but being surprised by what we do to the server is probably non-optimal.
Do appreciate the frankness, but auto-scaling and optimization/load-hardening are just good practices that should be done no matter what. While it may not be "fun" but since breaking the game isn't fun anyway, perhaps there should be (soft) caps on things that are essentially DoS. If the game currently can't handle it, agents shouldn't be able to either.
What is the argument in favor of allowing unlimited inbound links if you know it will disrupt gameplay for agents globally?
Ofer, thank you for the information and apology. Really classy of you to take personal responsibility. I hope you can find a technical solution which lets you sleep! So impressed with your honesty and integrity. Good on you.
I've asked on the Slack. We can definitely help to give the Niantic support team a heads up for planned ops like these Linkstars. Opsec would be maintained as well.
Thanks for the insightful response. Got a link where I can buy you a beer? Or even a 6 pack?
I remember being on call in a previous job. Its not always fun, and you do miss calls sometimes (I got automagic calls from our phone system that would keep calling til you picked up and accepted the call (yeah, actually had to accept it, really fun at 3 AM)).
Thanks for admitting fault and the apology . You are human! That goes a long way with many of us. Communication is key.
@ofer2 Given that the cloud resources and budget are limited for Ingress and the surprising nature of agent operations, will you consider building a dynamic auto-scaling system instead of doing scaling manually everytime? Isn't it the industry standard to let machine decide when and for how much the target scale should be? If the budget is limited, I believe you can just lower the scale ceiling.
I am glad to hear the enthusiasm of the staff that they quickly decided that they would also focus on load balancing in a limited budget.
We're not going to do the operation because we want you and other engineers to wake up in the midnight(^^;
In an unrivalled global game, it is very sad to see the limits of resources such as systems. However, instead of clearing it by volume, I hope that you will be able to get through it with your ingenuity.
first, I would like to thank you so much for your honest response, describing the issue in details and mapping your efforts in preventing situations like this from happening again. Truly, very much appreciated!
A big thanks also for admitting the fault and apologizing, that goes a long way. We're all just humans, this is exactly the type of communication that we would love from Niantic and I am so happy to see it more frequently these days.
But to the second point, as the orga of the mega linkstar on 19th September, the world record operation Cepheus, I'm curious:
1, do you know what exactly is the reason why we couldn't cross the 8399 incoming links threshold, at least not permanently? Is it something intentionally programmed into the game or some random error?
2, do you have any plans, of course when the performace issues are fixed, to increase this limit, so that operations following us are able to break our record? Because records are made to be broken and they further inspire agents to do more amazing things...
Yeah, my own thoughts are that we may use the vanguards as suggested by @Perringaiden to give us a heads up until we implement a longer term solution. Still discussing this with the team, though.
The complicated bit is that our infrastructure isnt set up for auto-scaling and there are a few things that would need to change to accommodate it. So, it may take time to implement this approach. Therefore, your idea of using the vanguards to give us a heads up for scaling might be the way to go in the mean time. IMO, we should have a scaling solution because this allows players to take advantage of dynamic situations.
Ingress has always had an element of trying to do crazy insane things. I think thats a part of what people like so much. Therefore, we shouldnt discourage it, but figure out ways to make it safer. In this case, that would mean that creating a ridiculous amount of links wouldnt break the game for other people. It is possible to do this, just takes some time to implement. This is why it is one of the options listed above.
Thanks! As I said above, I think that our best approach would be to have you all help us until we get a better solution in place, but we're still discussing it as a team.
As far as standards go, it depends on the environment. E.g. IIRC dropbox has a similar setup as us where they have fixed capacity with warnings which allow them to scale up when they cross that threshold. In our case, our infrastructure isnt setup to automatically scale, so we'd have to invest some time in rejiggering things to get it to be able to do this. Like you, I think that this is probably our best approach, but thats just my opinion. We'll see what the team thinks.
I took a brief look at the code and it doesnt appear that there are any hard limits imposed by the code itself. Its possible that I missed it, but I dont think thats the case. There may be implicit limits (e.g. a "storage bucket" can only hold X bytes and trying to write more than that will error). As far as future plans, the RWP server for links should be more scalable than the current one, so if/when we move over to that, you theoretically should be able to go higher (assuming that the bottleneck is in the processing and not the storage), but we'd have to test and see at that point. As far as plans to figure out how to increase it in classic, not AFAIK.
Probably something you've already thought of @ofer2 , but would it be possible to code a linkstar early warning system triggered by a set of criteria like: Over 8400 keys exist for a portal AND that portal has over 500 inbound links AND over 100 links have been added in the last 10 minutes?
We had a deploy this morning. Deploys generally cause outages for the game, sorry about that. The second event was a change in caching servers which was not supposed to cause any issues, but that ended up not being the case. The game should be up and working for everyone though. The total downtime was around 30 minutes.
Comments
Nitantic Ingress team will likely have a meeting Monday morning to triage what went so catastrophically wrong.
It is great that players are organising themselves to create massive link stars. Though it should never prevent other players in the world from playing.
This is like the 3rd link star in the last few weeks, and each time with the new low cost server, game has been unplayable for anyone else.
Reminds me of that time during the Christchurch Shonin anomaly, with the high cost servers, before Prime, and both factions had to do the Anomaly without intel, because Japanese players overwhelmed the server. Japan was also doing the Shonin Anomaly at the same time as Christchurch.
We need some word of what is going on!
But question is was niantic aware that there was gonna be a linkstar attempt during weekend in Japan since it seems nobody from niantic even wrote something that they are aware of server network issues? it seems not..
It's weekend and I think they were chilling without having a clue that something is wrong/performs slowly. Probably they would know and write something if they played the game or read the forum :trollface:
Would of been nice to see a response from anyone at Niantic this weekend to acknowledge that they are aware of the issue, and that they are either working on making the system stable so agents can log in and play, or well....anything.
Nothing posted on the Community Forums or on their Telegram channel.
Communication is a huge thing @ace @NianticBrian . Things like this worry us (the player base you are trying to desperately grow). I'm just glad this time that when someone decided to basically cripple the game (not the agent's fault, code is at fault) itself for the rest of the world, I wasn't several hours from home trying to Explore a new to me area and finish up an Onyx badge. I don't blame the agents who did the starburst. It's part of the game. But what isn't part of the game, is the fact it kills almost everyone else's ability to log into the game to play. Ingress needs to take a look at their code and figure out a way to resolve such issues or (I really hope not) limit the ability for inbound links. But I'm scare to promote that, because it will be limited and will likely never actually get fixed.
If you are planning on releasing a Subscription for Ingress, are you planning on having monitoring for game issues, and increasing your communications within the game for things that happen that break the game? At this point, despite people stating that they do monitor the game, I'm hesitant to really believe it, as this weekend has shown us yet again. Game play-ability breaks, and zero communication from Niantic / Ingress.
Hi everyone, I'm really sorry about this event. We have been looking into the incident and are looking into some ways to make sure that this doesnt happen again. For context, we currently manually scale up / down servers for when we expect there to be periods of high usage. E.g. we typically do this for first saturday. That being said, we do want to make sure that players can not only surprise the other faction, but also us in being able to do things like this. So, we are looking at a few different possible mitigations for this in the future: 1. always scaling up servers on the weekend, 2. optimizing the routing servers to be able to handle more traffic overall, 3. potentially make it so that the servers automatically scale up and down, and lastly 4. separate out the computationally expensive components to make sure that links/linking doesnt bring down the game for everyone else. We will be looking into these options and trying to see what we can incorporate into development quickly and which may be left for later because they're more complicated to implement. Lastly, this one was my fault in particular, I had an issue with alert notifications. The event occurred at 1:18 AM our time and the notification on my phone was incorrectly configured and so it did not wake me up. I'm really sorry about this. I have reconfigured the notification and tested it three times to make sure that it actually notifies me properly to wake me up if an event does occur. Other people who werent supposed to be the ones to handle issues woke up before me and fixed it, but that is why there was such a delay.
First of all I would like to, as always, thank you for clarifying things for us in a terrific detail.
Second, I want to, separately, wholeheartedly thank you for admitting your fault. You are possibly the single person who can just do this without too much corporate speech. Thank you thank you thank you.
You see, this linkstar happened in the most inconvenient time for you to react to it: in the middle of the night on a weekend. Not only that, but Ingress agents generally prefer the most inconvenient times for their opponents to create huge and unexpected operations, even if that is a night before a weekday. So you never know when someone decides to make another 8k linkstar, and this may not be weekend. I would prefer you implementing the automatic scaling, so you know that you can sleep calmly in case a bunch of agents goes to make some crazy stuff.
At last, once again, dude you're just a treasure for this forum and this community. I love how honest and open you are. Keep it up!
Agreed, thanks again @ofer2 for letting us know what's going on - once again you've stepped up to the plate and taken on the community manager role when we needed it.
Would it help if the players tipped you off about linkstars in advance? I might be speaking out of turn here, but I think the community would be happy to give you whatever warning you wanted if it meant the servers stayed up, especially if it could be done securely.... and even more so if it saved you getting woken up in the middle of the night at weekends!
Thank you for the info, effort to fix potential future issues, and all the time you put into stuff.
Are there any Intel pics of the link stars?
thank you by communicating and sharing the insights 👏
Communication, this is what players need from Niantic.
Thank you for this VERY insightful post. Infrastructure is hard; we appreciate the effort.
Make sure it's option 3, it's a complete no-brainer. Being able to automatically scale down could greatly reduce costs (profit/sustainability here), and scaling up avoids outages and increases player engagement. 🙂
@ofer2 Much appreciation to let us know what happened.
While auto-scaling should probably be the preferred solution (so you can sleep), I'd get that Ingress doesn't always have the budget for that.
Alternatively, or in the meantime, would having Vanguards or some other contact method be able to advise the Niantic team when a high stress period is likely? Being surprised by what we do is great, but being surprised by what we do to the server is probably non-optimal.
I also, like the others want to thank you for your reply, I hope a solution is implemented soon.
Do appreciate the frankness, but auto-scaling and optimization/load-hardening are just good practices that should be done no matter what. While it may not be "fun" but since breaking the game isn't fun anyway, perhaps there should be (soft) caps on things that are essentially DoS. If the game currently can't handle it, agents shouldn't be able to either.
What is the argument in favor of allowing unlimited inbound links if you know it will disrupt gameplay for agents globally?
Ofer, thank you for the information and apology. Really classy of you to take personal responsibility. I hope you can find a technical solution which lets you sleep! So impressed with your honesty and integrity. Good on you.
Check the sitrep category on the forum
I've asked on the Slack. We can definitely help to give the Niantic support team a heads up for planned ops like these Linkstars. Opsec would be maintained as well.
Thanks for the insightful response. Got a link where I can buy you a beer? Or even a 6 pack?
I remember being on call in a previous job. Its not always fun, and you do miss calls sometimes (I got automagic calls from our phone system that would keep calling til you picked up and accepted the call (yeah, actually had to accept it, really fun at 3 AM)).
Thanks for admitting fault and the apology . You are human! That goes a long way with many of us. Communication is key.
Set a limit of max links to a portal, say 1000; but invent a new mod to enlarge the limit for 1 hour, to be sold on store for $xx.xx;
When ever that mod is deploy, scale up the cloud resources.
@ofer2 Given that the cloud resources and budget are limited for Ingress and the surprising nature of agent operations, will you consider building a dynamic auto-scaling system instead of doing scaling manually everytime? Isn't it the industry standard to let machine decide when and for how much the target scale should be? If the budget is limited, I believe you can just lower the scale ceiling.
Thank you very much @ofer2 and staffs!!
I am glad to hear the enthusiasm of the staff that they quickly decided that they would also focus on load balancing in a limited budget.
We're not going to do the operation because we want you and other engineers to wake up in the midnight(^^;
In an unrivalled global game, it is very sad to see the limits of resources such as systems. However, instead of clearing it by volume, I hope that you will be able to get through it with your ingenuity.
Hi @ofer2,
first, I would like to thank you so much for your honest response, describing the issue in details and mapping your efforts in preventing situations like this from happening again. Truly, very much appreciated!
A big thanks also for admitting the fault and apologizing, that goes a long way. We're all just humans, this is exactly the type of communication that we would love from Niantic and I am so happy to see it more frequently these days.
But to the second point, as the orga of the mega linkstar on 19th September, the world record operation Cepheus, I'm curious:
1, do you know what exactly is the reason why we couldn't cross the 8399 incoming links threshold, at least not permanently? Is it something intentionally programmed into the game or some random error?
2, do you have any plans, of course when the performace issues are fixed, to increase this limit, so that operations following us are able to break our record? Because records are made to be broken and they further inspire agents to do more amazing things...
Thank you so much again for all your work!
Yeah, my own thoughts are that we may use the vanguards as suggested by @Perringaiden to give us a heads up until we implement a longer term solution. Still discussing this with the team, though.
The complicated bit is that our infrastructure isnt set up for auto-scaling and there are a few things that would need to change to accommodate it. So, it may take time to implement this approach. Therefore, your idea of using the vanguards to give us a heads up for scaling might be the way to go in the mean time. IMO, we should have a scaling solution because this allows players to take advantage of dynamic situations.
Ingress has always had an element of trying to do crazy insane things. I think thats a part of what people like so much. Therefore, we shouldnt discourage it, but figure out ways to make it safer. In this case, that would mean that creating a ridiculous amount of links wouldnt break the game for other people. It is possible to do this, just takes some time to implement. This is why it is one of the options listed above.
Thanks! As I said above, I think that our best approach would be to have you all help us until we get a better solution in place, but we're still discussing it as a team.
As far as standards go, it depends on the environment. E.g. IIRC dropbox has a similar setup as us where they have fixed capacity with warnings which allow them to scale up when they cross that threshold. In our case, our infrastructure isnt setup to automatically scale, so we'd have to invest some time in rejiggering things to get it to be able to do this. Like you, I think that this is probably our best approach, but thats just my opinion. We'll see what the team thinks.
I took a brief look at the code and it doesnt appear that there are any hard limits imposed by the code itself. Its possible that I missed it, but I dont think thats the case. There may be implicit limits (e.g. a "storage bucket" can only hold X bytes and trying to write more than that will error). As far as future plans, the RWP server for links should be more scalable than the current one, so if/when we move over to that, you theoretically should be able to go higher (assuming that the bottleneck is in the processing and not the storage), but we'd have to test and see at that point. As far as plans to figure out how to increase it in classic, not AFAIK.
Probably something you've already thought of @ofer2 , but would it be possible to code a linkstar early warning system triggered by a set of criteria like: Over 8400 keys exist for a portal AND that portal has over 500 inbound links AND over 100 links have been added in the last 10 minutes?
Two times per today evening,right now and two hours before. Last time lafs were 10-15 minutes. What's going on?:)
So game is down again ;(
We had a deploy this morning. Deploys generally cause outages for the game, sorry about that. The second event was a change in caching servers which was not supposed to cause any issues, but that ended up not being the case. The game should be up and working for everyone though. The total downtime was around 30 minutes.
Thank you for the explanation! What a breath of fresh air.