Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Container Isolation Gone Wrong (sysdig.com)
270 points by knoxa2511 on June 28, 2017 | hide | past | favorite | 83 comments


I enjoy reading descriptions such as these that start with easily detectable problems and go into a level of debugging far beyond my skill level. Helps to illuminate some of those "unknown unknowns" that I didn't even ever consider before.


I had the same feeling, but this also reinforced my view that Docker and containerization in general (often used as a scapegoat to not have to do proper configuration management) 'for the masses' is more problematic than helpful. In most cases it doesn't solve anything but does add problems that can be hard to debug. The actual 'lack' of isolation wouldn't have happened with true virtualisation, and the method of debugging here is something most people that think they need containers won't have.

To me, debugging like this is something that should be far more important to people than slinging words like Docker and NodeJS around all day. (and then mostly on Discord, or to them, the older Slack, but not IRC because that is too hard for that crowd -- totally unfounded opinion/rant)


Docker didn't cause this problem, the point of the article is that Docker doesn't prevent all such problems. On the other hand it does solve a lot of packaging, dependency and environment parity problems that traditional virtualization is too heavy to accomplish.

I'm old enough to also be frustrated with buzzword driven development, and it's pretty annoying that so many believe Docker invented containerization, but don't throw the baby out with the bathwater. Containerization is an awesome tool and orthogonal to config management.


Traditional virtualization is "too heavy", now, for solved problems like packaging? How and why?


For the reasons mentioned in the article:

- slow (re)start times

- greater resource consumption

Granted, "too heavy" is relative, but starting a few hundred VMs on a single host (assuming commodity hardware) is not going to work very well.


Slow? What is slow. VMs start in 4 seconds. While a container might do it in 1 or less, 4 seconds isn't slow.

Resource consumption might be more, but it's not going to be dramatically more than a container. It's not like a container uses nothing, the management and resource constrainers take up resources too.

Containers simply solve nothing and aren't 'better' in general. Containerizing certain programs might be useful, but other than that they are being hyped by the 'shiny new thing' crowd more than it deserves. On top of that, the amount of people using it vs. the amount of people that actually need it is way more of an issue than a container vs. vm debate.


Did you try runV. It launches a Docker image into a micro VM in 100ms. github.com/hyperhq/runv


That works just fine. It's typical in a VM farm to have hosts with a hundred VMs.


>It's typical in a VM farm to have hosts with a hundred VMs.

Several things:

1. We may have a different definition of "commodity hardware", but you're missing the broader point.

2. The broader point is that VMs are significantly less resource-efficient.

3. 1 & 2 notwithstanding, you're conveniently ignoring the issue of (re)start time

4. It's fine to use VMs, but it's frankly bizarre to fight tooth-and-nail over the ridiculous notion that they should always be preferred over containers.


I am simply addressing the fact that it's perfectly fine and common to have hosts with a hundred VMs and it works flawlessly.

VMs are memory intensive because they duplicate the operating system. The starting point is around 500 MB per VM. That's the only meaningful difference in resources compared to containers.

I am not discussing that they have different starting and stopping time.


CPU time is also more complicated to schedule in VMs.

For instance, the guest operations generally assume they're running on physical hardware and use spinlocks for some small critical sections. Under a physical hardware assumption this can be the correct approach, because the contending thread will leave the critical section soon and this overall performs better.

However, if the contending thread's vCPU is scheduled away by the hypervisor, the other thread may spinlock until the other vCPU gets scheduled back in. This wastes cycles.

A single operating system that uses OS-level virtualization (i.e., containers) has a more complete view of the system and can better multiplex the existing resources. That said, OS-level virtualization is generally accepted to have less isolation than VM isolation and solving the VM problems with, for instance, spinlocks might be easier than solving the isolation problems with containers, which is near intractable given the size of the kernel.

Unikernels try to take this approach and have a lot of the benefits of containers. If you squint, what we're really looking at with unikernels is a microkernel that uses virtualization support in hardware for robust process isolation. What's interesting to me is the question that I never see asked, which is whether we should revisit the microkernel architectures instead of laying more crap on top of monolithic kernels. The problems with Mach in terms of IPC time have been largely mitigated/eliminated with the L4 branch of microkernel.


>I am simply addressing the fact that it's perfectly fine and common to have hosts with a hundred VMs and it works flawlessly.

And that was never the point.


Yet that was your conclusion.


+1. Docker is not Docker-Swarm/Kubernetes.


Hey, there are plenty of solutions which combine the best of both worlds; for example https://www.vmware.com/products/vsphere/integrated-container...


How is that the best? You're still running a full kernel for each container, rather than sharing it.


But, so what? VMWare under the hood is sharing common pages between VMs, and a kernel that isn't doing anything isn't consuming any CPU, so why not?


Nope, common pages are no longer shared between VMs, because it was demonstrated that was a bad idea, security-wise:

upcoming ESXi Update releases will no longer enable TPS between Virtual Machines by default

https://kb.vmware.com/selfservice/microsites/search.do?langu...


Guess what, if it is a bad idea for a VM, it must be exponentially worse for something less isolated.


Yes. Thankfully, the point is that using the regular container model you don't need memory page sharing, because there's only one kernel anyway, not a copy per each container.


Page cache and disk cache are quite shared between containers...


And they share a CPU, too. Please come back with an actual point (like a link describing a attack on encryption using that shared cache), as I don't have time to make one for you.


Administrators may revert to the previous behavior if they so wish.

Sounds like a sane change to the defaults, but anyone who isn't securing against 3rd party code can turn it back on (to return to much more Docker-like security/performance).


>"I had the same feeling, but this also reinforced my view that Docker and containerization in general (often used as a scapegoat to not have to do proper configuration management) 'for the masses' is more problematic than helpful. In most cases it doesn't solve anything but does add problems that can be hard to debug."

The issue described in this post has nothing to do with "config management vs containers." Its' odd that this article would have "reinforced" that view. How would configuration management have prevented a noisy neighbor?

From the summary:

"The core lesson of this story: just because you are using containers and you get the impression that your applications are perfectly virtualized and isolated, don’t assume the kernel is fully isolating every underlying resource at a container granularity."

and

"Luckily, the solution is there and rather simple: make sure to deeply monitor all your applications."

That's nothing to do with any "configuration management vs containers" argument and everything to do with proper metrics collection and monitoring, which should be part of every "operational readiness" checklist whether Docker is used or not.

Lastly saying that Docker "In most cases it doesn't solve anything" is an absurd statement. Do you believe that virtualization does't solve any problems? If so why do you imagine the Linux kernel supports it?


just because you are using containers and you get the impression that your applications are perfectly virtualized and isolated

Anyone who believes in the first place this shouldn't be running production systems...


Your comment contains two sentence fragments, neither of which is coherent.


Seems perfectly straightforward. If you believe containers give you the level of isolation that VMs would, then you have fundamental misunderstandings of the technology which in a sane organisation would preclude you from operating important systems.


No your comment is anything but straightforward. In fact its grammatically incorrect to the point of being incoherent and incomprehensible. Maybe you should re-read what you wrote? It's bizarre to think that anyone would read that and think it was articulate.

Nowhere did I state that or even remotely suggest that containers give you the level of isolation that VMs would. My comment was refuting the OPs suggestion that "configuration management" was relevant to the article. Maybe you should go back and re-read the thread.


sysdig is a monitoring system for docker that sells for $25 per month per host.

Part of the debugging method has to do with "let's show our product".


I had the exact same feeling.

Well written OP.


Ah, large directories and d_entries... the bane of any NAS operator. Having seen hundreds of OpenSolaris appliances being abused in similar ways, I can relate.

It doesn't seem like Kubernetes supports I/O resource limiting at this point [0][1].

In any case, after a problem like this is identified, a cluster admin can use pod affinity/anti-affinity to avoid both apps co-existing on the same node [2].

EDIT: For hypervisor-based container runtime, check Frakti (https://github.com/kubernetes/frakti)

0 - https://kubernetes.io/docs/concepts/configuration/manage-com...

1 - https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-con...

2 - http://blog.kubernetes.io/2017/03/advanced-scheduling-in-kub...


Regarding why block io limiting isn't implemented yet in Kube - its really hard to make block io sharing work well without killing performance (it's easy for one workload to screw up another if it seeks at the wrong time, and ssds are fast enough that just having to check the io limits may severely limit your max throughout). If you read some of the proposals for io, the end goal is to make it easy to use multiple volumes per workload where possible, and have high level limits in place for other things like inodes, total writes, etc.


Isn't that what BFQ scheduler was designed for ?


Yes, although even the 4.12 code can have substantial overhead vs NOOP (phoronix benchmarks, as an example). It's not a complete slam dunk - turning it on for io indifferent workloads might make sense, but not necessarily on all boxes.

http://www.phoronix.com/scan.php?page=article&item=linux-412...


Phoronix tested BFQ in low latency mode on throughput. Results are obvious. There is a simple twiddle for that.

Of course they didn't test fairness or latency, that is too hard.


Yeah, I'd love to see comprehensive benchmarks of competing workloads on larger scale boxes. If workload isolation can be achieved with moderately low overhead (15%?), there would be a lot of interest in pushing a default setup with BFQ in Kubernetes once more stable kernel streams have it available.


Thanks for the background information. Would you have a link for those proposals?


https://github.com/kubernetes/community/pull/306

Has both the discussion and the current proposed path for io separation.


Nice problem solving.

I'd classify the primary root cause as a kernel bug. It's good to make use of otherwise unused memory for caches, but not to the extent that the caches grow so large they slow things down.

Secondarily, there's probably something wrong in a system where you have to constantly poll and attempt to access large numbers of files that don't exist. (But probably 100% of systems that do anything useful have at least some weird cruft like this somewhere in them at any given time, so I'm not judging.)


The author notes that future kernels did decide to introduce limits that would've prevented the slowdowns from happening, but the customer was running an outdated kernel.

That's what made the article disappointing for me. Do all this impressive in-kernel debugging just to find out that you should've upgraded your systems first. Sigh...


But the fix was to limit the cache based on process memory constraints. jmull's point is that if a cache is permitted to blow out so big that lookups are impacting performance, then that cache is not really serving its purpose in the first place.


Other fixes would have been possible. If the hash table could be resized when necessary there would never have been a performance problem.


Web servers do that, especially when spiders ask for urls that don't exist.


That's a good point, although I don't know why a spider that isn't specifically malicious (or I guess just badly malfunctioning) would request millions of different files that don't exist, which is what's needed to trigger this problem.


Millions would be unusual, but I could see it happening. An old domain that used to be an image host, for example.


Really enjoyed this, thanks! One lesson I learned from this, and correct me if I'm misinterpreting the root cause, is that more memory is not always better. This is, to me a far more powerful lesson I gained from this article.


What I want to know is why the 'trasher' was looking directly for so many different files. Could it not parse some output of finding/listing existing files to look for its targets?


Containers != type 1 hypervisor with all-encompassing resource quotas, reservations and prioritization like VMware ESXi. The problem can be solved either using a hypervisor that deploys only one container per VM (with suitable paravirt/dedupe), or fix the OS to operate with much finer-grained resource contention and allocation knobs for each and every limited resource. The latter is superior when only a single OS is needed, because it reduces the need for virtualization as a crutch for inadequacies of the OS.


https://www.vmware.com/products/vsphere/integrated-container... and I’ve developed a small docker backend for Xen previously (which is vaguely similar to vic).


Lately I've learned to panic whenever a job candidate starts dropping Docker and Kubernetes in the interview.


Maybe if they think it's a one-stop solution to hosting problems, yeah. But penalizing candidates just for familiarity with new technology?


The issue is more on people dropping buzzwords rather than what they're actually doing it for.


In my opinion, that's entirely the wrong mentality. If someone gave me a pile of raw buzzwords, or told me they used entirely the wrong tool because of buzzwords, then I'd penalize them, sure.

But say they made something wonderful, and it was cleaner and more efficient because of their use of Docker/Kubernetes, and they had taken the time to figure out the tradeoffs inherent to that approach. Is that worth penalizing, from your point of view?


Of course not. My issue is with leading with the tool, rather than the problem it's solving.


That's odd criteria.

Mind passing those resumes this way?


This was a great read, the conclusion of always monitoring, no matter what technologies one is using, should be obvious, but I've noticed that it really isn't, unfortunately.

Even I fall into the trap and sometimes I wish I knew about all this stuff but, alas, I prefer development.



What's the solution here? Can you limit d_entry table size per-process? Do you have to limit it globally? Is the answer to just not use containers?


OP here

The solution is very simple: as mentioned in the article, just use a newer kernel and always set memory limits for containers, the blog post is based on an older kernel (2.6.32) that quite a few people irresponsibly still use in containerized environments, mostly because EL6 is so popular among enterprises.

In newer kernels, allocations from object pools are now tied to the limits of the memory cgroups that requested them in userspace, if any, so you wouldn't incur in this specific issue and you would just effectively have a container not being able to use more than X MB of dcache entries (although there are probably other minor ones, for example related to sharing global kernel mutexes and such).


I couldn't understand two things from the article:

1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?

My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?

2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.

Capping the hash table size to some constant and putting all the excess elements to its linked lists makes it perform like a linked list divided by the constant, no surprise. So it sounds like there's a bug in dentry hash table implementation -- it should either increase its size accordingly to elements count, or stop accepting new/evict old entries.


> 1. If one of the two containers caused the issue, then the why you needed both of the containers to produce the issue? Why running just the offending one was not enough?

Running just the offending one would have been clearly enough, since its effects would have caused the same increased latency for every other process in the system (including itself). However, using a second container to observe the performance degradation proves the point that one container is able to affect another one, which is sort of the gist of the article, since too many people think containers provide much more isolation than what in reality happens.

> My guess is that "worker" container requested those non-existent files from a volume mounted by the other container, is it right?

No, the containers didn't share any volume, the dentry cache is effectively a singleton within the kernel, so even if the set of volumes is not overlapping, all processes in the system will see a performance degradation, regardless of where the files being accessed reside.

> 2. Kernel hash table implementation. The whole point of hash table is that it's size is O(N), where N is the number of elements it holds.

Your speculation is correct, however, there are sound reasons for doing such a thing in the kernel (and not allowing the main array of the hash table dynamically expand/shrink), so I wouldn't consider it a bug per se. I'll refer you to this excellent comment: https://news.ycombinator.com/item?id=14660954


Thank you. Very good article, thank you for writing it!


It's not irresponsible to use a perfectly fine OS.

What is irresponsible is for Docker to purposefully avoid to mention that it has endless issues on these widely used OS.

The 2.6.X is used in CentOS/RHEL 6, which is the standard in numerous enterprises.

It is not a 2.6 kernel by the way, redhat is backporting tons of stuff from the 3 and 4 branches.


> It's not irresponsible to use a perfectly fine OS.

The first problem with this statement is the idea that there's such a thing as a "perfectly fine OS". We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.

Windows XP is a perfectly fine OS; using it nowadays is irresponsible.

> What is irresponsible is for Docker to purposefully avoid to mention that it has endless issues on these widely used OS.

That responsibility doesn't and should never fall on the developers of an application. The extent of one's responsibility as a developer is to define the recommendations for its use. Anything beyond that is entirely on the user.

One would go insane if one had to wonder every single operating system someone decided to use one's application in.

> It is not a 2.6 kernel by the way, redhat is backporting tons of stuff from the 3 and 4 branches.

"Backporting stuff" doesn't make it not the 2.6 Kernel, it very much is.


>We don't even need to consider containers, the longer an OS has been in the wild, the longer its potential vulnerabilities have been found and exploited.

I challenge you to find exploitable bugs in its kernel. Windows XP is not supported anymore, while RHEL 6 is.


> ES6 is so popular among enterprises

I had to re-read this a few times-- I think you meant EL6, right?


Updated, thanks! I am working with ElasticSearch (ES) more than EL these days and my muscle memory tricked me ;)


I ran into a similar issue with kernel memory caching behavior.

While it's nice to just say LOL upgrade you fool, most of us are stuck with the environment were given.

You can adjust kernel level memory behavior, in particular vfs_cache_pressure can be set very high to force dentry to empty more aggressively.

https://www.kernel.org/doc/Documentation/sysctl/vm.txt


If your dcache is growing to the point that each bucket has many entries in it, you can also increase the number of buckets in the hash table using the dhash_entries kernel command-line option.

(The latency in this situation is caused not by the sheer number of entries, but by the fact that the hash table is undersized for the number of entries it gets).


Out of curiosity, do you know why the kernel uses a fixed-size hash table as opposed to a dynamically-expanding one?


I'm not sure I can give a definitive answer, but there's a few difficulties in making it expand. One is that in the kernel, you can only reliably make large contiguous kmalloc() allocations at init time; another is that the dentry cache is highly optimised for parallel access (most lookups will proceed without taking locks under Read-Copy-Update).

In most cases memory pressure will tend to naturally limit the dentry cache size - the "perfect storm" here was almost zero memory pressure combined with a process doing a lot of negative lookups on an essentially endless list of unique filenames. For such an unusual situation, it's probably reasonable to ask the administrator to manually tune things, rather than building a more complex runtime-resizing hashtable that almost everyone won't need - especially since the failure mode is a graceful performance degradation.


It seems like you could use a dynamically scaling hash instead of a fixed size one. Or evict old entries to keep the number of elements reasonable.


There is little point to caching non-existent objects, the benefit disapares after a second at most.

The smart solution would be to expire these objects out of the cache reasonably rapidly.


You're being downvoted, and I suspect it's because you're not considering different workloads. There's an unfortunate amount of software that uses filesystem polling as a crude form of IPC. Check every x seconds to see if a file at a certain path exists. Fast expiring these dentry nodes would hurt that workload.


I knew before I got into the meat of the article that it was going to be i/o contention. The first two sections talked about memory and cpu limits on the containers, but nothing about i/o rates. This was a Known Problem back in the 90s when a variety of filesystems (DEC's AdvFS being one) that were efforts to address the issues around dentries and inode. See also http://www.starcomsoftware.com/proj/usenet/doc/c-news.pdf


I think they're actually talking about blowing out the dentry cache, but sure IO contention is anoyher shared resource in containers. Depending on what you're doing you might run into issues in various networking limits (somaxconns comes to mind, as does what's left of the route cache), blowing out the page cache, or something eating up your memory bandwidth (maybe via some unlucky NUMA).

All for containers but they don't solve the hard problems folks often ascribe to them, really just shows in most cases you don't need to solve the hard problems. Most of the time what containers are buying you is an easy deployment method that leverages some nice features in the OS to make believe you're on separate machines.


>"Depending on what you're doing you might run into issues in various networking limits (somaxconns comes to mind, as does what's left of the route cache)"

I'm curious what issue(s) you might be referring to here with the route cache? Could you elaborate?


Sure - The route cache was largely removed in (IIRC) 3.8, but there's still entries that get stored[0]. There's a limit to how many entries Linux will store, and like any LRU-esk data structure rapidly cycling entries through it isn't going to do anything wonderful for your performance, never mind if you actually expected to use any of the cached data for a business performance 'feature'.

25g NIC is an awful lot of 60byte packets. I'm not saying this is going to be a common concern, just that like any other shared kernel resource cgroups and namespaces aren't going to help.

0: https://www.systutorials.com/docs/linux/man/8-ip-tcp_metrics...


Sure, that makes sense and thank you for the link. I was curious about your comment:

>"25g NIC is an awful lot of 60byte packets."

Where are you getting that 60 number from? A minimum IPv4 header is 20 bytes and a minimum TCP header is 20 bytes. Also how would a tiny TCP packet relate to the route cache? Tiny TCP packet are certainly a problem with PPS that a NIC is capable I understand that. Cheers.


But it's not, strictly speaking, about i/o contention.


OP here

That's correct, I should have included a chart explicitly measuring the I/O activity done by the two containers, but I can assure you there was literally no I/O activity, a dozen open files per second is a very negligible throughput. The bottleneck was solely in the cache.


Yep, the old version of this was 'why is tar/find/rsync/etc hosing my server even when I've [io]niced it to hell'. Except now (as everything else) with containers!


Linux have all manner of oddities when it comes to IO it seems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: