Closed
Description
I cannot reproduce this at the moment (we hit it and I had to purge the state once I realised what happened), but it appears that docker load
is not atomic. If you kill Docker in a certain window during a docker load
operation then you will end up with the metadata for the image being stored without the actual layer data being stored in /var/lib/docker/btrfs/imagedb/
.
Here's the debug logs, though they don't tell you much.
Mar 28 11:19:50 e169 systemd[1]: Starting Docker Application Container Engine...
Mar 28 11:19:50 e169 docker_service_helper.sh[14404]: Waiting for docker daemon to start
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.241650297Z" level=debug msg="docker group found. gid: 487"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.241701358Z" level=debug msg="Listener created for HTTP on unix (/var/run/docker.sock)"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.242028758Z" level=debug msg="libcontainerd: containerd connection state change: READY"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.242551475Z" level=debug msg="Using default logging driver json-file"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.242594286Z" level=debug msg="Golang's threads limit set to 57330"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.256946250Z" level=info msg="[graphdriver] using prior storage driver \"btrfs\""
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.256982462Z" level=debug msg="Using graph driver btrfs"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.257140636Z" level=debug msg="Failed to load layer sha256:3ff2154d3784e35e52ce07e3c5fbce0975f3aa0cd0465bcdb20294f2c8170dc0: failed to get diff id for sha256:3ff2154d3784e35e52ce07e3c5fbce0975f3aa0cd0465bcdb20294f2c8170dc0: invalid checksum digest format"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.257418656Z" level=debug msg="Failed to load layer sha256:c69569b15b637f9abe81be0dd8bb7f428d4c096d5f9c7c0ed79d515f20934e2f: failed to get diff id for sha256:c69569b15b637f9abe81be0dd8bb7f428d4c096d5f9c7c0ed79d515f20934e2f: invalid checksum digest format"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.257444251Z" level=debug msg="Max Concurrent Downloads: 3"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.257460428Z" level=debug msg="Max Concurrent Uploads: 5"
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.296573461Z" level=debug msg="Cleaning up old mountid : start."
Mar 28 11:19:50 e169 dockerd[14402]: time="2017-03-28T11:19:50.297060519Z" level=fatal msg="Error starting daemon: layer does not exist"
Mar 28 11:19:50 e169 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Mar 28 11:20:51 e169 docker_service_helper.sh[14404]: Docker is dead
Mar 28 11:20:51 e169 systemd[1]: docker.service: Control process exited, code=exited status=1
Mar 28 11:20:51 e169 systemd[1]: Failed to start Docker Application Container Engine.
Mar 28 11:20:51 e169 systemd[1]: docker.service: Unit entered failed state.
Mar 28 11:20:51 e169 systemd[1]: docker.service: Failed with result 'exit-code'.
Mar 28 11:56:05 e169 systemd[1]: Stopped Docker Application Container Engine.
This is the actual error that it dies on, which I believe is happening when we call image.NewImageStore
(during restore
) in NewDaemon
.
Error starting daemon: layer does not exist
Output of docker version
:
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.1
Git commit: 78d1802
Built: Wed Feb 15 15:00:28 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.1
Git commit: 78d1802
Built: Wed Feb 15 15:00:28 2017
OS/Arch: linux/amd64
Output of docker info
:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 7
Server Version: 1.12.6
Storage Driver: btrfs
Build Version: Btrfs v3.18.2+20150430
Library Version: 101
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host null overlay bridge
Swarm: inactive
Runtimes: oci runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 4.4.52-1-default
Operating System: SUSE Container as a Service Platform 1.0
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.796 GiB
Name: e169
ID: 7N3T:HCTD:HQPH:T7A2:L2CS:VAW5:OKEJ:SSOA:2HEV:JQSZ:FHYQ:X4U5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
WARNING: No kernel memory limit support
Insecure Registries:
127.0.0.0/8
Metadata
Metadata
Assignees
Type
Projects
Relationships
Development
No branches or pull requests
Activity
thaJeztah commentedon Mar 28, 2017
ping @dmcgowan @tonistiigi
tonistiigi commentedon Mar 28, 2017
@cyphar Are you sure that image that caused the problem was created by
load
? That case should be protected against with reference counting. Also note that diffIDs are not in the directories in/var/lib/docker/image/btrfs/imagedb/
or/var/lib/docker/image/btrfs/layerdb/
. Let me know if you can reproduce and what version of docker was used for making the loaded tarball. Was there something running in the system that would try to remove images the same time load finished?mjura commentedon Mar 31, 2017
@tonistiigi : I have found a way to reproduce this issue.
I used this same environment as @cyphar
mjura commentedon Apr 11, 2017
Similar issues were reported:
cyphar commentedon Jul 3, 2017
I'm going to further investigate this this week.
vrothberg commentedon Jul 4, 2017
I make similar observations when using
pull
instead ofload
. I performed the following steps in a clean environment:Starting the daemon with
--debug
gives the following logs:I can do some detailed debugging tomorrow.
cyphar commentedon Jul 4, 2017
Alright so this is what the actual error looks like if you do it with a multi-layer image. If you only do it with an image like
alpine
it still fails but you don't get the debug messages about invalid digest.So the problem is that the file
/var/lib/docker/image/btrfs/layerdb/sha256/1118f33a0ee7a874a04318248a886b2bdaf44cba286644ab7ded870aefe64b62/diff
exists but it is empty -- even though Docker is atomically doing things withrenames
and so on. I noticed that if I callsync
before doing a forceful reboot the error doesn't happen -- which tells me thatbtrfs
is doing something very dodgy here.I'm going to see if I can create a reproducer outside of Docker that causes this same behaviour. It's a bit hard because I can't get the full
strace
output without usingsync
(that causes it to work).cyphar commentedon Jul 5, 2017
I believe this issue was actually fixed by c37bd10. You still have the problem that new Docker will die if you try to run it on a corrupted
imagedb
, but it appears to have been fixed in 1.13.0.thaJeztah commentedon Jul 6, 2017
Thanks @cyphar - I added it to the 1.13 milestone for easier discovery 👍
2 remaining items