Weekly Update #76

It is talked about a problem that has no conceivable source … I think it’s a bit preemptive to say it is unsolvable if we can’t even pin down what the source of these jitters are.

We know that the client has all the information it needs to perfectly calculate the position of a player that has not changed his course.
The most rudimentary extrapolation method is very simple:

If the information provided by the snapshot is correct and accurate enough and the pilot did nothing in the meantime, it should be a near perfect match.
Even if the pilot did move, it shouldn’t have any more influence on his vector if he’s already moving 100m/s or 100km/s in the absolute frame.

It could be precission. The faster one moves the bigger velocity value gets. If there isn’t enough precission for velocity it could generate floating point errors.
From my test calculation usual floats should be more than precise enough though. Theoretecally.
The direction also plays an important role … although if the positional snapping is mostly back and forth in the velocity direction it should be fine.

I think in other games it’s the actual velocity precision that is the problem. They didn’t design the game for high velocity so maybe decided to use lower precision in order to save bandwidth. Just speculation though. We don’t know what causes stutter in other games …

One thing it could also be is the interpolation/extrapolation algorithm. There can be many situations that it could not be prepared for. What if it the snapshot distance changes? So it sometimes compares snapshots that are 20 ms apart and sometimes snapshots that re 30 ms apart.
Maybe floating point errors are amplified trough the algorithm?

Is there any place where information is lost? Are floating point numbers reduced in precision before they are send over the network?

Is velocity and direction “rasterized”? So if I speed up my ship it isn’t using all the available floating point precision but “step” from one allowed value to the next?
I have the feeling this is quite important.

Ninjad a little … but eh.
I personally also seem to have issues with floating point calculating in my current project … it’s also driving me nuts.

It kind of depends on the test he’s performing, but I’m tending to lean towards it being a projection calculation precision error.

If he flies along a parallel vector to the chase target at a constant velocity, then if all the calculations were done relative to each other, there wouldn’t be any difference between the calculation at one point in time and the next.

If they use absolute locations to perform a projection or derive relative values from the absolute locations, then those calculations will be numerically different every time, but should produce the same calculated result…unless you have a precision problem. This would also apply if they are using velocity to predict the next location.

This gets magnified if you end up using an angle calculation to project that ship onto your screen or predict it’s location in space in front of you. If that last significant digit keeps changing, then so does the position you end up placing the ship on screen. If you then try to forward predict based on current velocity, it gets magnified even more.

This is more critical in the chasing scenario where that angle is very small…essentially getting closer and closer to zero. That or the jitter is just plain more noticeable because the difference should be zero, but instead it just keeps dancing around because of that precision error. i.e. it’s probably jittering all the time, but you just don’t notice it until the position obviously shouldn’t change.

Kind of reminds me of the fun I had with python when I first tried it. I’d take an integer, divide it evenly by another integer, and then multiply it by that same integer. Instead of getting the original integer, I’d get a float that was really close boggle

1 Like

Using relative positions/velocities as suggested by @JB47394 and seconded by @Crayfish is clearly the theoretically ‘correct’ solution here. I’m really surprised to discover you aren’t already doing so! I guess I’d assumed the ICP’s ‘bubble’ system had been transplanted to I:BS.

Having read about this problem over the past 4 (more?) weekly updates has me wondering if, from a project management perspective, it wouldn’t be more efficient to step back and implement a proper relative reference frame solution before moving forward. Given how fundamental such a thing is, fixing it later on once more layers of systems have been built on top of the duct tape would be daunting.

Anyhow, one way or another, if @INovaeFlavien is serious about brainstorming a solution, count me in.

Are you testing a connection between two machines? I tried using Clumsy recently and couldn’t get it to affect ARMA when running multiple clients locally. Clumsy’s author warns quite a bit about loopbacks, and I’m assuming that’s what is making things difficult for me.

It makes no difference if velocities and positions are relative or absolute as long as you have the precision. Transferring something like (Island-A + position [0.1]) or (abs-position [0.00001]) has exactly the same effect. Also it doesn’t matter if the ships look like they are stationary to each other, you still have the forward velocity that can be a few km/s, which means that the position is changing regardless if the ships appear to be stationary to each other. An island just creates an offset for the position.

Even if they went with something extreme like making every single ship it’s own island and transferred each ship relative positions of every other ship, it solves nothing if the absolute position can be given at same precision.

As for the current problem of jitter with no lag whatsoever, it’s starting to smell like a code error or some technical oversight.

1 Like

You would communicate relative vectors (position, speed, direction) over the network instead of absolute. There’s the difference. As @thelazyjaguar noted it should not make a difference in the end result. Thanks for that link by the way. Great read.
I like the island/relative idea but there’s no point in implementing it if you don’t even know what for.
Here’s more of that. In there also a link to the first occurrence on the new forums of the discussion between JB and Flavien in the E:D thread:


But back then I forgot to think about if and why there was a problem at all. I just assumed there was one!

We theorize that it would shift this problem to the islands and it might but without actually knowing what the cause of these jitters is it doesn’t make a lot of sense to build a workaround for them. And @cybercritic might even be right and it wouldn’t change anything at all if you just translate it back to absolute at some point …


How would one go about debugging floating point errors? Especially ones that might accumulate? You can determine when they get larger and smaller. But how do you use that information to track down the places in the code/calculation where errors appear and might accumulate (the most)?

I don’t know how complex the prediction algorithm is but an insight into that would be helpful. Or maybe at least some sample outputs from different inputs. Although that wouldn’t tell us more than before.

No it’s not. Besides the code uses double precision everywhere. The randomness doesn’t come from the positions but from the time itself that is getting sped up to catch up with the latest snapshot or slowed down when snapshots are missing.

How do you initialize “Time_Now” ?

Also you can’t extrapolate all the time, only when missing snapshots. Otherwise you should interpolate in the past along the known path. If you extrapolate all the time it’ll render at the wrong position and you’ll even see ships passing through walls during collisions.

3 Likes

[quote=“NavyFish, post:34, topic:5747”]
Using relative positions/velocities as suggested by @JB47394 and seconded by @Crayfish is clearly the theoretically ‘correct’ solution here.[/quote]

The physics engine works in absolute coordinates though so at some point you have to add back your relative + absolute values into something you can feed to the physics engine.

No, it’s all localhost. Try using “localhost ipv4 all” preset. Or filtering: “outbound and ip.DstAddr >= 127.0.0.1 and ip.DstAddr <= 127.255.255.255”

That’s what I use, and it works even in other games ( tested in Overwatch ).

2 Likes

Even for the time? Shouldn’t matter there though.
Precision should be good, but there still will be rounding errors that might accumulate depending on the calculations made. As @thelazyjaguar mentioned sometimes higher precision floats have rounding errors that lower ones don’t.

Yeah maybe a bit too rudimentary. You would need to take the time when you intent to apply/hand these vectors to the physics code. So instead of “Time_Now” it would maybe be “Time_NextLocalTic” with some safety measures in there to check if there’s still enough time to calculate all of it before the next Physics Tic occurs. If there isn’t enough it need to skip one Tic.

Edit: After doing that you can now interpolate between the point you have and the one from the “future”. But ye I forgot to handle what happens when the actual data comes in from the network. Having the game run in the past is much easier …

So you make the local physics engine run a little in the past instead of letting it try to “simulate FTL communications”. That’s alright. How much in the past? Does that vary? Is it fixed like 50m/s / 5 ticks … etc?
What happens if the code needs to extrapolate a certain tick and that tick is then received through the network? Probably it’s treated like interpolation where it’s just “time between ticks”. But with interpolation the endpoint will not change, whereas with extrapolation it might. That needs to be considered.
By the sound of it in the localhost test extrapolation shouldn’t happen, does it maybe?

Ingame time? Why is the speed of time varied? If there is no data there is no data. What’s the point of making data you have last longer?
If the client experiences big amounts of jitter or the quality of the connection changed during the game the client can choose to run further in the past to have more leeway.
But these changes to distance from “Now” shouldn’t happen constantly. Maybe a few dozen times a match.
If it’s been switched instantly it will result in a “jolt”,
If it is done with a ramp it may create distortions (also note if floats are used for the time it may introduce rounding errors). I don’t know what influence a changing speed of time might have on the interpolation and extrapolation algorithms … it might be the source of the prolbem.

Yup, that’s exactly the root of the problem. It has nothing to do with precision of data, or networking itself. Variable processing times ( aka: a slight slowdown on when you process the snapshots ) can cause it. So this happens all the time.

Ideally, you’d specify a constant time offset, let’s say 50 ms in the past, so that you have enough snapshots to interpolate.

If the snapshots get delayed, at some point you’re not going to receive the next snapshot in time. That’s where you do extrapolation. But then when you eventually receive your missing snapshot, you might be receiving the 2 or 3 next snapshots at the same time that got hold up. At that point you’re too far behind the “latest snapshot”’ server time, so you need to catch up. If you do it instantly you’d see the position getting teleported. So what I’ve been doing is to slowly interpolate the client time to match up the server’s time ( minus the artificial offset of, say, 50 ms ). That works but results in the stuttering problem.

The effect is essentially what happens with cars on a highway. When one car at the front slows down then accelerates again, it creates a bottleneck kilometers behind.

7 Likes

This stuttering is basically the ship/servertime slowing down and speeding up when the snapshots catch up again? I would have though that when an extrapolation happens that the ship keeps its velocity/acceleration and just continues at the same (increasing/decreasing) speed, in my imagination the only difference between the extrapolation and the snapshots that are catching up would be if the ship made any changes in its speed or direction during these missed/delayed snapshots.
So a visible correction would only occur then and not with every missed/delayed snapshot.
If I was to receive a timestamped message exactly every minute to move a figure one inch forward and at the 30 minute I don’t receive the message at the exact minute I would assume that the message is delayed and continue to move the figure, say after 20 seconds I receive the delayed message and check if my anticipated move was correct or if the sender made any changes in movement in the new message. I would then move the figure at its correct position or it stays where I moved it because my anticipated move was correct. After 40 seconds the next message is received and I move the figure again one inch.

How I understand it now is that I wait for the message, move the figure one inch. This happens again till the 30 minute, the message is delayed, I don’t move the figure, after 20 seconds the delayed message arrives and I move the figure one inch, after 40 seconds the next message is received in time and I move the figure again one inch. This would then result in a slow down and speed up in movement.

Could this stutter be resolved if the position is extrapolated every time a snapshot is delayed and not just when a snapshot is missing? The delayed snapshot would be only used to check if the extrapolated position was correct and if not corrects the ships position accordingly.

1 Like

After thinking about this for a while I think I understand. The only thing I don’t see is how the client can fall behind. I can see how he can miss snapshots but don’t see how the clients clock can be slower than the one from the server.

If the client receives 3 snapshots he can discard the ones that are behind his “playback” time (the 50 ms for instance).

On top of that extrapolation should generate local “snaps” when one from the server are missing. These ones are never replaced, even if they have a timestamp in front of the current “playback” time. This is because that could create hard snapping. An incoming snap will only be used if no local one has been generated.
The client should then adjust his “playback” time to further avoid extrapolation. As said this shouldn’t be too aggressive due to possible side effects.
You could use extrapolation to lower bandwidth and server resources by only sending every other snapshot to the client. The extrapolate would then fill in the gaps.

The client clock should never get out of sync with the server clock. It doesn’t need to be all that precise either. I think precision up to one ms is enough as you probably wouldn’t want to run any algorithm more often than that. What it should not is shift.

Brainstorming:
So we have the server that collects user info and combines them into snapshots and numbers them. (And of course other stuff).
At login the clocks are synchronized with the client … or not, don’t know how clock synchronization works over the net. That time we call “now time”.

The server himself needs time to receive data, process it and pack them into snaps. So the server also needs to have a somewhat have a buffer/leeway to do that. Depending on server power and calculation load this is set to something. We call that “truth time”. In order to do this the server gives himself some time by declaring that the snapshot that he is creating now is valid at some point in the future.
For instance. “time_truth”=“time_now”**+**5ms
As soon as he calculated that he sends it to the clients, may even be before the truth time arrived.
There’s also a paket interval. Lets say it is: “snap_interval”=10ms.

The clients also needs time to calculate and receive the snapshots. So in order to do that they playback the snapshots from the past. We call this “playback time”.
For instance: “time_playback”=“time_now”**-**50ms.
The client receives snapshots and adds them to its bank of snapshots if it’s not older than its “time_playback” and is not the next snapshot the playback is moving towards.

Example:
“time_now”= 1503575432068
Server Prepares snapshot Nr. 5 with “time_truth”=1503575432073
Server Sends packet at “time_now”=1503575432071

Client receives packet at “time_now”=1503575432090
Clients “time_playback”=1503575432040
Client has snapshot Nr.1 with “time_truth”=1503575432033
Client has “extrapolated” snapshot Nr.2 with “time_truth”=1503575432043
Client has snapshot Nr.3 with “time_truth”=1503575432053
Client has snapshot Nr.4 with “time_truth”=1503575432063
Client checks if truth time of snapshot Nr. 5 is older than the snapshot it’s is currently traveling to. Nr. 2.
(Edit: Actually just checking if 5 > 2 is doing the same thing.)
Client adds snapshot Nr. 5 with “time_truth”=1503575432073
Client continues playback and sends own snapshot versions to the server (shots fired, hits landed … etc)

Snapshots could be set up so they could have as much and as little information as needed with the rest being assembled/extrapolated by the server/client

I work with realtime computers by the way. :grin:

2 Likes

You’d be lucky to get 1ms accuracy. Between 5ms and 10ms is more likely.
With NTP 90% of clients have offsets less than 10ms. (ref)

You typoed:

1 Like

The Network Time Protocol (NTP) is considered state of the art.

It’s a pretty simple concept. Note the time (T0), send a message to the server. Server sends its current time back (TS). Client receives message and notes time again (TC). Do some math to determine the difference in clocks by assuming that the trip times between client/server and server/client are the same.

(TC-TS) - ((TC-T0) / 2)

Edit: While I suspect that the above is similar to NTP’s algorithm, I didn’t mean to suggest that’s the way they do it. It’s just what occurs to me as a way to do the synchronization.

Thanks. I’ll give it another try - though I think those are the defaults :slight_smile:

1 Like

Okay, I think I understand the problem you are getting at now, but I’m still confused on some of the specifics.

Why are you changing time itself? The snapshot is either there at the time of rendering, or it isn’t. You extrapolate if it isn’t. You then receive the correct data after the rendering, but instead of simply updating it, you do some interpolation corrections over time to bring it to the correct location without causing teleporting. Are you trying to vary the delta-T value in order to minimize extrapolated distance errors in the hopes of eventually getting the authoritative data and then smoothing the correct data back into the game?. i.e dD=v*dt – smaller change in time = smaller distance traveled = smaller corrective adjustments

Are you interpolating the time at which the snapshots are played or are you interpolating the positional data between each snapshot?

If this is the case, wouldn’t you end up needing to interpolate all snapshots? Also, wouldn’t this be smoothed out within the “ideally” 50ms buffer?

Anyway, thanks for taking the time to answer Flavien. Just trying to understand the problem and I would love to be able to help you out.

3 Likes

I’m sorry but must keep this reply brief.

As Flavien said earlier, the problem - whatever it’s cause - is evident only at high absolute velocities, ie 5 KPS.

Therefore, reducing these velocities to near-zero - by transmitting ‘bubble’-relative velocities for nearby ships - will solve the problem.

This is a design challenge, no doubt, but I think you would agree that such a system would resolve this issue.

Likely you’ll need to decouple client-side physics and interpolation from the server (which is already the case anyhow). Instead of receiving absolute values, clients within a bubble will receive bubble-relative coordinates/velocities for everything else (or you could have multiple physics sub-spaces, whatever… Depends on the flexibility required). The sever still operates exclusively with absolute positions, but tailors each client’s network physics update to respect the relativity bubbles. This is analogous to a moving origin technique, but also considers the ship’s velocity ‘origin’ (i.e. the velocity of the local reference frame / bubble).

Thus the physical values used by the client engine will be near zero for nearby ships (because they represent relative velocities etc), and thus jitter or precision or interpolation/extrapolation artifacts will still happen but will not be visible because they will be relegated back to the millimeter scale of error.

Happy to discuss further, elsewhere if desired.

Navy

1 Like

Symptom.
So you suggest we should just stop looking for the cause?

I find it quite irresponsible to start a whole redesign of some major systems without justifying it with some proof of why it is needed. If it turns out there’s a limitation, which currently we can’t think or find any source of, to how the internet or high precission arithmetic works, then sure a redesign of the basis can be done to hide this limitation. As a side effect it would also allow for only transmitting dfloats for bubbles instead of every single entity, could reduce load on both network and computation.

Flavien already tried out such a system. JB implemented it in his warp prototype from the beginning. We discussed it. It’s cool and all. But we didn’t compare their pros and cons. Just claimed it would solve some symptom that appears in games and people, including me, just assumed was some inherent limitation that isn’t solvable with traditional coordinate systems. This doesn’t appear to be the case.

[quote=“thelazyjaguar, post:48, topic:5747”]
Why are you changing time itself? The snapshot is either there at the time of rendering, or it isn’t. You extrapolate if it isn’t.[/quote]

Allright, try to imagine this scenario: 20 snapshots per second = you receive a snapshot every 50 ms. You introduce a delay of 100 ms to ensure you have enough interpolation data, ie. you’re 2 snapshots behind for rendering. The latency to the server is 100 ms.

Everything is stable, you receive a bunch of snapshots every 50 ms, rendering is 100 ms behind, no problem. Then suddenly, latency to the server suddenly changes. To a whole second. Yeah it’s extreme, but it’s for the sake of the exemple, bear with me. And worse… it stays there, forever.

On the client, there’s a 900 ms gap during which you extrapolate. After that you start receiving the snapshots from the server again. They’re obviously all dated, so you reject them. In perfect conditions it’d just be a lag spike and latency would reduce back to 100 ms, which means you’d receive 18 snapshots at once, and the latest one would end up being synchronized again. But that’s not what happens here. You’re now in hell: you’re extrapolating all the time. Despite still receiving 20 snapshots per second from the server. Because the latency changed durably.

So there’s a need to re-synchronize the client time to be 100 ms behind the latest snapshot. If you change that time instantly, you see actors get instant teleported, which is no good. So instead you keep track of the “curretn” time on the interpolation curve and the “desired” time, which is 100 ms behind the latest snapshot. And you do a smooth interpolation of that time, aka. speeding up or slowing down the time itself. That way when latency changes, interpolation adapts smoothly to the varying conditions. But you get stuttering…

TLDR


Lets try to find out why.

So the client realizes that the last couple of snapshots he received are way behind his playback time.

I assume this desired time is taken once. Before the time readjustment is started. If it’s not and the “life” time of always the newest snapshot is used we have a situation where you try to interpolate to a target that constantly changes. Making your playback time stutter.
If you interpolate between two points you keep track of your progress from point a to point b. If during the interpolation point b changes you will get different result for the same progress you had before. That’s basics but just wanted to point that out.

So the client realises what’s going on and calculates a new target playback time based on the last few snapshots it got. The arival time averages out at about “time_now” - 1000ms. So it chooses “time_playback_new”=“time_now”-1000ms-100ms.
Now it starts to interpolate its “time_playback” from “time_playback_old” to “time_playback_new”.

What happens in this timeframe? Each cycle the “time_playback” changes. So every calculation that takes that time into account and does stuff for more than one cycle is affected.

Position interpolation for instance. Position interpolation probably uses “time_playback” for its “progress”. Something like (most basic interpolation, just an example):

When “time_playback” is interpolated it may move faster, slower or even backwards. Compared to normally when it moves exactly like time does. One second in one second.
This depends on how large the lagspike is and how agressive the “time_playback” interpolation is configured. I make it simple here and just assume linear interpolation. But you can imagine how other interpolation would influence it. It’s basically the same with the difference that the speed is faster in some sections of the interpolation then in others instead of linear like it is in this simple example of linear interpolation.

So in some situations positional interpolation is done faster. If network connection gets better.
In some cases positional interpolation just stops. Because the the time interpolation moves backwards in time exactly one second per second.
In some cases positional interpolation moves backwards. Where the time interpolation moves backwards in time more than one second per second.

So you may even see ships moving backwards. :grin: https://i.ytimg.com/vi/n5I4oqFNxvM/maxresdefault.jpg

If you constantly try to adjust “time_playback” to be exactly 100ms behind the time you receive snapshots you will probably see what you are seeing with ships stuttering arround. Why?
Because you will never have packets arive at exactly the same time. Even on localhost. Network dirvers and the operating system aren’t set up for this kind of consistency.
Companies trying to use TCP/UDP for communication that is dependent on consistency (motor control, safety) have to use real time network drivers that are run in a precisely crafted environment (read: not Windows, Linux or similar).
But you don’t need that consistency. That’s what the 100 ms buffer and the “time_truth” is there for.


I would suggest checking if either interpolation targets are moved while interpolation is going on or if “playback_time” interpolation just happens too often or/and with extreme settings/speed.
The algorithm to decide if its needed should be quite picky.

Both could result in stutters like you and Keith described.

Edit: Now do these stutters increase with velocity? I think they do. The faster an entity gets the more difference there in his position in two successive snapshots. Making the distance over that is interpolated larger. Snapshot frequency on the other hand stays the same. The larger the interpolation distance is, the more it amplifies either of the two theoretical sources of the stutter.
So this theory lines up with observation.

It kind of acts like a regulatory circuit that hasn’t been configured correctly and is oversensitive.

@navyfish, the high velocity doesn’t matter if the precision is sufficient in the absolute method, also he’s interpolating position and the position changes at the same rate even in a relative scenario. As I said, [some relative island]+[offset] and [absolute value] is exactly the same thing if the precision in the absolute method is sufficient.

/edit
@inovaeflavien I still don’t understand why there is stutter, you should be constantly interpolating the positions, even if you extrapolate at some point you should interpolate from that point. In essence you should be smoothing all the time and there won’t be any stutter, because even if you do get it wrong, you keep interpolating. The presence of stutter indicates to me that you ar at some point reverting with a hard reset to some position, don’t do that, always interpolate.

1 Like