Solving the Mystery of My SSL/TLS Issue

Solving the Mystery of My SSL/TLS Issue

The Backstory

I've recently joined a new team working on a platform that will allow our data scientists to easily run some models that may be rather CPU intensive across a cluster of beefy machines without really having to manage how the work is distributed. Tools like Mesos and Kubernetes have started to make this extremely easy to do - you just set up the cluster and then submit jobs along with a specification of how much CPU and memory should be allocated for the task. With data science becoming an important part of everyday computing, being able to quickly allocate a bunch of resources is important, especially when working with deep learning, where there really is no limit to how much CPU you can meanfully use (you can always add more layers and units to the model, but that obviously increases the amount of computations required).

One of the things that will be driving this platform and making it so easy to use is containers. If you aren't familar with containers, there are many resources out there that will get you up to speed, but the really short version is that a container is just a process that has been isolated and has limited access to the resources of the machine that it is running on. Using this, you can effectively simulate a virtual machine (from the perspective of the process) without the overhead of running an actual virtual machine that would have additional resource requirements (due to things like a Guest OS).

One thing that we need to ensure is that the containers we run are secure. We already get a lot of security due to the fact that a container by nature is already somewhat isolated from everything around it, but we need to make sure that the image that defines the container doesn't have malware or try to execute inappropriate commands. We have identified a couple of different options for doing this, but the two main tools we plan to start with are called Anchore and Falco.Anchore is an image scanning tool that will look for signatures of known vulnerabilities, and Falco is a process monitoring tool that allows you to define rules that will flag suspicious activity (trying to read files that normally shouldn't be accessed, for example).

The Problem

Both of the tools are really easy to install and use because they both can be run inside containers. This makes it really convenient because we can use our infrastructure that already runs containers to also run the image scanning and process monitoring tools and upgrading the tools is easy. I ran into a problem with Anchore, however, and describing the problem and the solution I came up with is the purpose of this post.

One of the things that Anchore does when it starts up is to gather data from remote sources in order to update an internal vulnerabilities database. As part of this update, it made a simple https call to https://anchor.re, but the call was failing and the policy engine would fail to start. Looking in the logs, I saw this error message:


    anchore-engine_1  | 2018-02-26 19:16:32,615 ERROR policy_engine_bootstrap - Preflight checks failed with error: ("bad handshake: SysCallError(104, 'ECONNRESET')",). Aborting service startup
    anchore-engine_1  | Traceback (most recent call last):
    anchore-engine_1  |   File "/usr/bin/anchore-engine", line 128, in startup_service
    anchore-engine_1  |     raise Exception("process exited: " + str(rc))
    anchore-engine_1  | Exception: process exited: 1

This didn't make a lot of sense to me at first, but then I fired up Wireshark to see what was happening on the network, and I saw this:

wireshark-rst-packet

This capture clearly shows the start of the handshake that happens when you are making an HTTPS request. Normally you send a client hello and the server will respond with a server hello and then you exchange certificates and eventually the channel is secure. Obviously it didn't get that far. Instead, the response to my hello was a quick response (only 30 milliseconds later) with the RST flag set on the packet. In layman's terms, it is like someone just hung up after I said "hello". Kind of rude, actually.

I tried the request using curl on my Linux host, and it responded ok (200). Trying the same URL request inside the container with curl, however, resulted in a response of Connection reset by peer (which is what you see from curl when a packet with the RST flag set is received). This was kind of odd to me because by default, Docker containers (which this was) use a bridged network. I tried again using host networking (where the container effectively acts like another device on the network) but got the same reset result. The folks at Anchore suggested I try this on a different network, so I tried it at home, and the response came back ok (200).

I then tried something different back on the original network. Instead of using the Anchore image, I ran a simple ubuntu image and tried the curl again. This time it responed with an ok. Obviously it had something to do with the combination of the image and the network because:

  1. It worked on my home network
  2. It worked on my linux host outside the container
  3. It worked inside a container using a different base image

Firing up Wireshark again, I listened to the packets that were sent from my ubuntu image.

wireshark-tls-packet

This looks a lot better, with a nice Server Hello. But wait a second. The Client/Server hello here is using TLSv1.2. The Anchore image was sending the handshake over SSL:

wireshark-rst-packet-1
Sure enough, this was the difference, and whever the handshake was done using SSL, it failed. Ultimately we determined that there must be some firewall or other network route component that was blocking these SSL packets due to the POODLE SSL vulnerability that basically rendered SSL 3.0 (and even some versions of TLS 1.0 and 1.1) useless because attackers could gain access to data within the encrypted channel. Therefore, we would need to force clients to use TLS 1.2.

Debugging the Problem

To debug the problem further, I hooked up my IDE to the Python code running inside the Anchore container. Fortunately with PyCharm, this turns out to be not too hard to do. As explained in this article, it is relatively easy to configure a docker-compose deployment so that you could simply start debugging the Python scripts in the container as if they were running outside (one issue I did find is that the IDE wouldn't actually rebuild the images for me if I made a change, so I had to rebuild the images manually before starting the debug session).

Unfortunately, the debugger breakpoints don't work when the Python script forks off child processes with a library called twisted (there is a preference that indicates that breakpoints should still work in child processes, but they didn't with the twisted processes), but luckily the issue still occurred if I made requests from the main script. I started debugging down into the guts of some Python libraries for doing HTTP/HTTPS calls, namely requests and urllib3. These libraries allow you to override the SSL version by various means, but for some reason, it wasn't working for me. I checked that the ssl_version field was set correctly in the data objects that were passed arround, and eventually I hit the native OpenSSL library calls that actually do all the handshaking. Even though I was explicitly telling the library to handshake with TLSv1.2, it was still using SSL for the requests.

Taking a Different Approach With Containers

While thinking over the problem, I came up with an elegant solution that would require minimal changes to the Anchore code base. Since the problem seemed to be in the OpenSSL library that the image was using, and I really didn't feel like trying to debug that, it seemed a different approach was needed, and then it hit me that I was able to request the URL from a container with a different base image. If I can do that, why not have the Anchore container just call the other container and have it do the actual request? The response is some simple JSON, so it would be easy to return that back to the Anchore container. Essentially, I'm using the proxy pattern.

Using the Flask package of python, I created a really simple web server that simply listens and then channels the requests that it is given (as mentioned in the TODO section below, I need to address making this secure). I made it so that it could also pass along headers that were encoded in JSON. The proxy therefore has a simple API for making requests:

http://proxy:5000/get?target=https://ancho.re/...&headers={'xxx': 'yyy', ...}

I've trimmed out the details of the URL and not URL-encoded this example for clarity. As you can see, the request to the proxy container is just over regular HTTP, so it doesn't require certificates to be created. I'm only going to expose the proxy's 5000 port to the Anchore container, and with some additional checks on the URLs that can be requested, this shouldn't require an actual HTTPS connection.

The Python code required for this solution is pretty minimal:


@app.route('/get')
def proxy_get():
    target_url = unquote(request.args.get('target'))
    try:
        headers = json.loads(unquote(request.args.get('headers'))) or {}
    except Exception as e:
        return Response("ERROR: invalid headers param: " + e, status=400)

    if target_url:
        response = requests.get(url=target_url, headers=headers)
        if response.status_code == 200:
            content_type = response.headers.get('Content-Type')
            return Response(response.content, content_type=content_type, mimetype=content_type, status=200)
        else:
            return Response("ERROR: request returned status of {}".format(response.status_code), status=400)
    else:
        return Response("ERROR: missing target param", status=400)

TODO

There are a couple of things I need to do before I can really call this solution complete:

  1. A little bit of code cleanup, as I will need to add some additional code to secure this proxy.
  2. Security around the valid URLs that can be requested and who can actually call the proxy.
  3. Right now the Anchore code has been altered significantly to force using the proxy. I need to create a simple wrapper around the requests call that hides the proxy from the code making the request.
  4. Make a simple configuration for Anchore to enable/disable the proxy.

Conclusion

This all took about a week to figure out, mostly because it didn't really make sense at first. The folks at Anchore were super helpful and pointed me in the right directions that eventually led me to discover the root of the problem and come up with a solution that really was relatively simple. I'm pretty happy that I didn't have to debug the OpenSSL library to resolve it.

Related Article