Hungry Process Breaks Your “while read” bash Cycle

I am working on a build system that makes easy to control several connected git repositories forming one project. This system is being written in bash and uses lots of rarely used git and bash features.

Often I have to iterate over a table usually generated by git, e.g., to see the changes between a commit and its parent, I run:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos
:160000 160000 236fc8025f106375944457007f5a7a803297e683 f5ede37ddbf9eccd55012f1ddda3ae37259ca800 M	repos/altai/altai-installer
:160000 160000 2706a907bf2d136dd1f737e6c6cb4ca8e420329c 10a1bf5d8f716f30af089f1558eefbdeb07f9b3b M	repos/altai/nova-networks-ext
:160000 160000 7aaafb9f29b60ef0a4cf938b653de23354308be2 ad3725e92b08ca40cf65fb9ed604ae3285fee271 M	repos/altai/python-openstackclient-base
:160000 160000 748da9c4c1d058f96dd40ba328fd100719f768f7 eb568c5ffb4543b676208c96de7af2c62e455329 M	repos/openstack/glance

This output is easily parsed with bash’es while read:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos | while read mode1 mode2 hash1 hash2 ignored path; do if [ "$mode2" == 160000 ]; then echo $path; fi; done
repos/altai/altai-installer
repos/altai/nova-networks-ext
repos/altai/python-openstackclient-base
repos/openstack/glance

read command get a line from stdin and sets variables one by one. We redirect stdin for while, so, read is parsing git‘s output. When the input stream is over, read returns 1, and the cycle stops. That’s a simple and elegant bash solution.

The problem begins when you put some more logic into the cycle. I want to rebuild all changed projects on a remote machine using ssh to run a command.

$ set -x
$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos | while read mode1 mode2 hash1 hash2 ignored path; do     if [ "$mode2" == 160000 ]; then ssh  builder@build-server echo have to rebuild $path; fi; done
+ read mode1 mode2 hash1 hash2 ignored path
+ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos
+ '[' 160000 == 160000 ']'
+ ssh builder@build-server echo have to rebuild repos/altai/altai-installer
have to rebuild repos/altai/altai-installer
+ read mode1 mode2 hash1 hash2 ignored path

Surprise! We see a message for the first line only. After the first iteration, read meets end of file and stops the cycle. What’s the matter?

We have redirected stdin for the whole cycle. The first command (read mode1 mode2 hash1 hash2 ignored path) scans the first line and returns. The main thing is that every next command can read from stdin so many bytes as it wants. Usually you can predict that none of commands inside the cycle will scan stdin, but that’s better to redirect it from /dev/null for a firm guarantee:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos | while read mode1 mode2 hash1 hash2 ignored path; do if [ "$mode2" == 160000 ]; then ssh builder@build-server echo "have to rebuild $path"

So, that was ssh who devoured our input. It has scanned it and buffered for later usage by remote commands. Actually, echo doesn’t need stdin, but ssh doesn’t know it and starves the rest read. The ingenious ssh has overdone the ingenuous bash!

There is another solution than just to mention /dev/null – just use -n option for ssh that redirects stdin implicitly:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos | while read mode1 mode2 hash1 hash2 ignored path; do if [ "$mode2" == 160000 ]; then ssh -n builder@build-server echo "have to rebuild $path"; fi; done
have to rebuild repos/altai/altai-installer
have to rebuild repos/altai/nova-networks-ext
have to rebuild repos/altai/python-openstackclient-base
have to rebuild repos/openstack/glance

But redirection is obviously more generic.

It’s worth to say that the problem is not with ssh itself. The glibc library performs buffering by default for all its FILE objects, including stdin, stdout, and stderr. So, if you write a program that scans only one line and exits, that’s more than possible that glibc reads far more bytes. Usually you don’t care because you close the file and it will be reopened from the beginning by the next program. The headache comes when you share one opened file between several processes as it happened with pipes in my example.

At last, let’s try one more way to deal with hungry glibc. We will open a file in bash as we do it in C or in Python and make read scan it.

$ # save the git output to a temporary file
$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos > git.list
$ # open the file
$ exec 3<git.list
$ while read mode1 mode2 hash1 hash2 ignored path <&3; do if [ "$mode2" == 160000 ]; then ssh builder@build-server echo "have to rebuild $path"; fi; done
have to rebuild repos/altai/altai-installer
have to rebuild repos/altai/nova-networks-ext
have to rebuild repos/altai/python-openstackclient-base
have to rebuild repos/openstack/glance
$ # close the file
$ exec 3<&-

Well, bash is pretty cool, isn’t it?

One response to “Hungry Process Breaks Your “while read” bash Cycle

Leave a comment