Authors: Alice, Bob, Carol, Jan Beznazwy, Amir Houmansadr
Presenter: David Fifield
This is the talk for our IMC'20 paper How China Detects and Blocks Shadowsocks.
This is the talk for our paper How China Detects and Blocks Shadowsocks.
You can select English or Chinese subtitle by clicking the cc
button on the video.
This is “How China Detects and Blocks Shadowsocks”, by GFW Report, Jan Beznazwy and Amir Houmansadr. I’m David Fifield and I’m presenting this work on behalf of the authors, most of whom are anonymous. I have experience researching in this field and the authors have acquainted me thoroughly with this work.
The grand summary of this research is that the Great Firewall of China detects and blocks Shadowsocks using a combination of passive traffic analysis and active probing. And let’s talk about what those terms mean.
Shadowsocks is an encrypted proxy protocol and it’s designed to be difficult to detect. It’s really popular in China as a means of censorship circumvention, a way of getting around the Great Firewall. And the Great Firewall for its part, as part of its general mission of information control tries to find and block all types of different proxy servers, Shadowsocks included. And in fact, since about May 2019, there have been anecdotal reports of people’s Shadowsocks servers being blocked from China sometimes during politically sensitive times, but without a good explanation. This research helps provide an explanation for how this has been happening.
Now in Shadowsocks, the connection between the client and the server is encrypted and furthermore it’s encrypted in a way that it reveals only ciphertext to an observer. So unlike TLS, for example, which has plaintext framing bytes, there’s nothing like that in Shadowsocks. If you flatten out a Shadowsocks stream, it looks like just a sequence of uniformly random bytes and that’s by design. This quality means that it’s not possible to, for example, write a simple regular expression that will match all Shadowsocks traffic, you have to work a little harder than that. Now if you’re thinking that this randomness, this lack of a fingerprint is itself a kind of fingerprints, you’re absolutely right. And in fact this research shows that the Great Firewall uses the entropy and the length of packets in a TCP stream as part of its first step in classifying Shadowsocks traffic.
Now what do I mean by active probing? This research shows that the Great Firewall discovers Shadowsocks servers in a two-step process: the first step is passive and the second step is active. In the first step, it looks for possible or potential Shadowsocks connections; and in the second step, it connects to the servers involved in those connections from its own IP addresses as if it were a Shadowsocks client and watches how the server responds. You can think of step 1 as guess and step 2 as confirm.
Now you can understand this process of active probing as a way of increasing precision or reducing cost in network classification. If you were to write a purely passive classifier for Shadowsocks, it may yield unacceptably high false positives. On the other hand, if you were to try to active probe every single connection that passes through the firewall that may be more probes than you can manage to send. So you can think of step one as being a sort of pre-filter for step two.
Now this is certainly not the first time that active probing has been documented to be used in China against censorship circumvention protocols. There is research going back all the way to 2011, showing it being used against Tor, against various VPN protocols like that. But the level of detection now in using Shadowsocks reaches a new heights of sophistication.
How do we know all this? Well, the authors investigated it in the way you might expect. They ran an experiment, they set up their own Shadowsocks servers outside of China; They set up their own Shadowsocks clients inside China and then they connected to their own servers through the Firewall and watched for what else connected to those same servers. They also set up some control servers and never connected to them, just to be able to distinguish the connection triggered active probes from random internet scanning. And they ran this experiment for about four months. Now there are many many implementations of Shadowsocks out there. For this experiment, the authors chose two of the most popular, which are called Shadowsocks-libev and Outline. These are two independent implementations of the same protocol.
The main observations of the four month server experiment are that active probers send a variety of probe types, some of them look like replay attacks and some of them do not. The ones that are replays may be stored and replayed after a surprisingly long delay. The ones that are not replays have a peculiar distribution of packet lengths. And active probes come from apparently thousands of different source IP addresses.
Let’s talk about the replay based probes. First these are copies of the author’s own legitimate connection from their authenticated Shadowsocks clients. And specifically, they’re copies of the first data packets in an authenticated Shadowsocks connection. Sometimes the replay is identical, sometimes it has certain bytes changed, one or two or maybe a dozen bytes changed, but usually at fixed positions.
So what could be the intention behind sending replay probes? Well, potentially it’s exploiting a vulnerability in the Shadowsocks protocol. See the protocol doesn’t specify what should happen when a server gets a replay of a previous properly authenticated client connection. Now if an implementation doesn’t do any sort of replay filtering, any prevention of replay attacks, what’s likely to happen is that it will do the exact same proxy request that it did earlier for the authenticated client, and send back to the active prober a big blob of ciphertext. Now the active prober won’t be able to decrypt that blob because it doesn’t know the password for that Shadowsocks server. But the fact that it received a large amount of ciphertext back is a giveaway that the server is in fact Shadowsocks. And even in implementations that try to filter out or prevent replays, there are certain edge conditions in how connections are closed, for example, that can be characteristic of Shadowsocks. And the fact that certain bytes are sometimes changed in these replay based probes may be an attempt to evade implementations that have a replay filter.
Replay-based probes are convenient for analysis because it’s easy to match the active probe with the legitimate connection that it is a replay of. It makes it possible to, for example, measure how long the delay is between when a legitimate connection is sent and then replays based on that connection are sent. So take a look at this graph, this is a CDF. Because a probe may be replayed more than once, the darker line here only considers the first replay, and then the paler line considers all replays. And as you can see, for first replays anyway at least around 25 percent of replay probes come within one second, so almost immediately; but there is a surprisingly long tail and some replay probes are sent after a delay of minutes, hours, even days.
Now the non-replay probes: these ones had a payload that was to all appearances random; but didn’t match any prior legitimate connection. And you notice there’s a very strange distribution of packet lengths: looking at the ones of length below 50, you’ll see that they’re roughly uniformly distributed in what I’ll call triplets centered on lengths 8, 12, 16, 22, 33, 41, and 49. So the triplet at 8, for example, that represents a length of 7, a length of 8, and a length of 9. All being about equally likely to be sent. Besides those notice the different scales here, the great majority of the non-replay probes had length exactly 221 bytes, and this is an interesting and thought-provoking distribution of packet lengths.
The authors think they have at least a partial explanation for why active probers send probes of these lengths. You see when you send random unauthenticated data to a Shadowsock server, the server may react differently depending on how much data you send it. So if you send too little data, the server is going to wait to receive the rest of the data that it’s expecting, and eventually timeout. But if you send beyond that threshold, the server will attempt to authenticate the data that it’s received, be unable to authenticate it, and close the connection.
Now I won’t get in too far into the details here, but you can configure Shadowsocks with a variety of different ciphers and initialization vectors of different lengths, and things like that.
But you’ll notice in this table that those triplets many of them straddle what I’ll call byte thresholds, between where the server times out and when it closes the connection with a RST or otherwise. So looking at the first row here, if you send a server so configured a packet of seven bytes or eight bytes, it’s going to time out but if you send it nine bytes, you’ll get an immediate RST.
So that’s a distinguishable difference in how the server reacts. This analysis doesn’t fully explain the triplet distribution, because, for example, the triplet at 32, 33, 34, and the one at 40, 41, 42, don’t match up with any byte thresholds and neither does the 221.
Alright, moving on to the origin of the probers. Over those four months, the authors’ Shadowsock servers received over 50,000 active probes and those came from over 12,000 different IP addresses, which all geolocate to China. So a consequence of this observation is that it’s not possible to simply enumerate all the active prober IP addresses and ban them from your server. It also isn’t surprising because prior research studying active probing has also found large numbers of IP addresses being used to send active probes.
Now comparing the 12,000 IP addresses in this work, with previously compiled lists of prober IP addresses, there is not much overlap although there is some; however this is not really that surprising, because past research has found that there is a lot of churn in the IP addresses used for active probing over time.
Now despite the fact that there seemed to be these thousands and thousands of different active probers, it’s likely that they are all centrally managed by a small number of processes; and the evidence for that comes from a TCP layer side channel, namely the TCP timestamp. So the TCP timestamp is a 32-bit counter that increases at a fixed rate, and it’s attached to every outgoing TCP segments. Different computers will generally not have synchronized TCP timestamp sequences, because it’s going to be relative to usually when the computer was last rebooted, and the counter was reset to zero or initialized to a random value. So this graph shows the TCP timestamp sequences over time, of a few thousand active prober IP addresses in one sub-experiment. And you can see that even though they come from many different IP addresses, they fall into a small number of distinct TCP timestamp sequences and these sequences increase at typical rates so 250 HZ or 1,000 HZ. That 1,000 hertz line goes through a cluster of about 20 data points that are very closely spaced, but within that space they’re much more like 1,000 HZ than 250 HZ. So this TCP timestamp observation is consistent with prior work, as are most of the other network layer fingerprints that you might think to look of.
Look at the exception is TCP source port numbers. Prior work has found a roughly uniform distribution of source port numbers, whereas in this work the authors found a marked bias towards the default ephemeral port range used by Linux.
So it’s clear that active probing of Shadowsocks is a phenomenon. It happens what features is the Great Firewall looking for. The authors investigated this aided by the fact that:
So the authors designed an experiment to establish a TCP connection and then send one TCP packet with a configurable entropy and a configurable packet length. A configurable payload size and, from this graph, we can see although there isn’t a real sharp distinguishing threshold that high entropy packets are more likely to be replayed than low entropy packets, and the length of the packets matters as well.
So here we have another CDF: the gray line in the back is the author’s own trigger connections and they tested packet lengths between 1 and 1,000 bytes uniformly distributed. Now you can see the non-replay probes there with the expected peak at 221. The replay probes only occur between in the interval of about 160 to 700 bytes lengths. Outside that interval were almost never replayed and even within that interval certain lengths, are more likely to be replayed than others.
So you’ll notice the replay line has a sort of chunky stair-step pattern, and there’s actually some structure to that: so between lengths about 160 to 384, packets were more likely to be replayed if they had a length whose remainder was 9 when divided by 16. And in the interval about 264 to 700, they were more likely to be replayed if they had a length whose remainder was 2 when divided by 16. And in the area where those two intervals overlap, there was a mix of remainders 2 and 9. The authors don’t have an explanation for this phenomenon, it’s just an intriguing feature of the packet length distribution.
Taking active probing of Shadowsocks as a given, what can be done to mitigate it? Well, because we know that the detection process is a two-step process, it is sufficient to disrupt either of those two steps. So you can either evade the passive traffic analysis, or you can invade the active probing components.
Evading the passive traffic analysis means changing the features that the Great Firewall is looking for: so entropy and packet lengths. Changing entropy in Shadowsocks is not easy without kind of fundamentally changing how the protocol works; but with packet lengths, you have a little bit of leeway. And, for example, newer versions of Outline will coalesce consecutive packets: maybe something that would be sent as two packets could send as one packet instead, as a way of disguising the characteristic packet length distribution that the Firewall may be looking for.
Another interesting observation is with a tool called Brdgrd (Bridge Guard). So this is software that you can install on a Shadowsocks server and it causes clients to send smaller than usual packets. When they’re in the early stages of their connection, it does this by rewriting the server’s TCP window size. Although there are some drawbacks and caveats to using Brdgrd with Shadowsocks, it’s clear that here in this experiment while Brdgrd was active, the incidence of active probing is notably diminished, although not quite to zero.
The other thing you can do to avoid detection is changing the way that you respond to active probes. So I showed you this table earlier and it was a little bit of a lie because that table described the behavior of some older versions of Shadowsocks. Some newer versions of Shadowsocks, partially as a result of this research, try to disguise the distinction between timing out a connection and terminating the connection. So their reactions in newer versions of Shadowsocks looks more like this. Now I don’t want to get into the details but the AEAD is the newer, currently recommended version of the Shadowsocks protocol. And you can see in this version, in these two implementations, at least the server always times out, no matter the length of the unauthenticated probe. In this older deprecated stream version of the protocol, for compatibility reasons, it’s not possible to completely eliminate that distinction, but they have done it as far as possible.
In summary, the Great Firewall of China detects Shadowsocks servers using a combination of passive traffic analysis and active probing. Probing is triggered by the first packet in a data connection and it’s more likely when packets have high entropy or have certain payload lengths. There are many different types of active probe: some are replays, some are not. Probes come from many IP addresses but they show signs of being centrally managed and it’s possible to mitigate the effects of active probe into Shadowsocks by disrupting either of the two steps in the classification process.
Thank you for your attention if you have questions or comments, it’s best to get in touch with the authors directly. Source code and data for this research is available at the URL you see.