GSoC Week 10

Addressing comments on my patch

I spent this week addressing the suggestions made by Johannes which, were extremely detailed and long. I am done with most of the suggestions so I will highlight some important ones and what I learned from them.

This week, I saw the ocean for I guess the first time in my life and it was a totally unique experience!

Peaceful Seas

Moving to the comments made by Johannes.

Eliminate find_unique_abbrev()

The shell version of summary calls git rev-parse twice for two jobs which are:

  1. Checking for a missing_src and a missing_dst i.e., the missing source and destination committish.

     test $mod_src = 160000 &&
     ! GIT_DIR="$name/.git" git rev-parse -q --verify $sha1_src^0 >/dev/null &&
     missing_src=t
     
     test $mod_dst = 160000 &&
     ! GIT_DIR="$name/.git" git rev-parse -q --verify $sha1_dst^0 >/dev/null &&
     missing_dst=t
    

    Quoting my mentor Kaartic:

    The above code assigns missing_src and missing_dst based on the
    exit status of rev-parse only when they correspond to valid
    submodules
    (the test $mod_src = 160000 and test $mod_dst = 160000
    checks).

  2. Finding abbreviated committishes (7 characters long) for outputting them later on in the summary output.

     sha1_abbr_src=$(GIT_DIR="$name/.git" git rev-parse --short $sha1_src 2>/dev/null ||
     			echo $sha1_src | cut -c1-7)
     sha1_abbr_dst=$(GIT_DIR="$name/.git" git rev-parse --short $sha1_dst 2>/dev/null ||
     			echo $sha1_dst | cut -c1-7)
    

    Quoting my mentor Kaartic:

    This gets the abbreviated hash using rev-parse without caring about
    whether they are submodules. It falls back to abbreviating
    sha1_{src,dst} if rev-parse fails. Note however that we don’t
    verify if the src or dst are valid submodules are not.

Since we try to be very frugal with spawning processes and allocating memory in C, aiming to short-circuit as many if-statements and loops as possible in a fashion which does not hamper the working of the program, we here aim to combine these two spawns of a child process into one. This will make our program more efficient and lighter but the job isn’t an easy one because of a (twisted?) nature of summary's code i.e., the cmd_summary() function in shell.

Originally, I had implemented the latter rev-parse in the form of find_unique_abbrev(), which at first sight seems good since we eliminated an extra child no? Sadly, that isn’t the case as this function is called by rev-parse itself and to find a --short=n version of the hash we give it (n is passed as argument). Due to the availability of a function which does this job (more on that below), there was the need to eliminate this function. Also, we aim to look for the abbreviation in the submodule and not in the superproject, thus adding another reason not to use the function.

Therefore find_unique_abbrev() isn’t the best option hence I modified the verify_submodule_committish() function to do a rev-parse with short=7 hence fulfilling the purpose of outputting a shortened hash and return the new short hash if it is verified else return NULL. But as you notice in case (2), even if rev-parse failed to give a hash (i.e., we got a NULL), we did a echo $sha1_src | cut -c1-7 meaning sha1_abbr_src never really stays empty.

To combat this, one way was to do this in generate_submodule_summary():

if (S_ISGITLINK(p->mod_src))
	src_abbrev = verify_submodule_committish(p->sm_path, oid_to_hex(&p->oid_src));
if (!S_ISGITLINK(p->mod_src) || !src_abbrev)
    src_abbrev = xstrndup(oid_to_hex(&p->oid_src), 7);
if (S_ISGITLINK(p->mod_src) && !(src_abbrev = verify_submodule_committish(p->sm_path,oid_to_hex(&p->oid_src))))
	src_abbrev  =  NULL;

So even here the hash remained NULL at the end. But this created another roadblock, which is that there will be a case when the commit will be non-existent or invalid (for example I pass a commit which isn’t actually fetchable), in that we want to warn the user that this hash isn’t really fetchable, but in our case that warning won’t happen since we treat this case differently by making the hash NULL at the end hence not eligible for a correct warning. The warning is not what we are aiming for but rather is my method to explain this; the final hurdle to solve this part was to solve test 14 of t7401 which involved a non-existent commit hence the warning.

So to counter this, a new solution is to do this:

char *src_committish = xstrndup(oid_to_hex(&p->oid_src), 7);
char *dst_committish = xstrndup(oid_to_hex(&p->oid_dst), 7);

At the start of print_submodule_summary(). What this does is sieve out the faulty cases but also not keep their hashes NULL by giving them the hash they were. The src_abbrev and dst_abbrev variables act as gatekeepers which will help the function differentiate between the right and faulty cases since for the faulty case the hash will be null anyway irrespective of the value of src_committish or dst_committish.

TL;DR: We create control variables to be able to differentiate between the right and faulty scenario while retaining the hashes. Earlier we were not retaining those hashes and trying to keep these control variables as hashes and gatekeepers at the same time which clearly isn’t possible. Therefore, we will have to make a compromise at some point or the other.

NOTE: My mentor Kaartic also had another method (which inspired my method) in which the control variables were actually two integers missing_src and missing_dst, thus eliminating the additional char* definitions I created at the end. I personally find his solution better since we will need a control
variable at the end anyway thus contradicting Dscho’s suggestion.
So why not keep it an integer and make things even more simpler? We are waiting for Christian’s analysis on this.

Remove the occurences of is_sm_git_dir

Johannes suggested that the prepare_submodule_repo_env() in verify_submodule_committish() will set the GIT_DIR to the .git/ of the submodule and since our working directory will be the superproject anyway, the whole point of checking if we are in a GIT_DIR or even checking if it is a bare-repository or not is totally redundant. Therefore, it was better to remove the variable and hence its occurences everywhere thereby saving tons of extra checks and saving one whole indentation level in generate_submodule_summary(). So, I removed the variable and things did get simplified.

Simplify the range computation

The range is actually the range of commits for which we will compute a git log --oneline. To be more explicit, in the output of running a git submodule summary:

$ git submodule summary 7c1532cab3125ae24f14ee9433b3c673f2964ef1
* bash 4f90705...f10c25c (4):
  > Git: add config for hiding prefix shown in diff
  > tools: add cron script to automatically build,install git(-next)
  > tools: add script to check conflicts in sync(thing) folders
  > Git: use the experimental commit graph feature

The range is 4f90705...f10c25c. Earlier, we had a range variable holding %s...%s where %s is the source hash and destination hash. The whole system to compute range was overly explicit:

if (S_ISGITLINK(p->mod_src) && S_ISGITLINK(p->mod_dst))
	range = xstrfmt("%s...%s", oid_to_hex(&p->oid_src),
			oid_to_hex(&p->oid_dst));
else if (S_ISGITLINK(p->mod_src))
	range = xstrdup(oid_to_hex(&p->oid_src));
else
	range = xstrdup(oid_to_hex(&p->oid_dst));

Just at a glance one can see that the last two conditionals can be combined into one. The more shorter route is to compute the range and store it directly into the args of the child process:

if (IS_GITLINK(p->mod_src) && IS_GITLINK(p->mod_dst))
	argv_array_pushf(&cp_rev_list.args, "%s...%s",
	oid_to_hex(&p->oid_src),
	oid_to_hex(&p->oid_dst));
else
	argv_array_push(&cp_rev_list.args,
	IS_GITLINK(p->mod_src) ?
	oid_to_hex(&p->oid_src) :
	oid_to_hex(&p->oid_dst));

And hence execute this directly using a capture_command() or a run_command() whatever pleases you. This leads to my solution.

Some other suggestions

Some other suggestions from Johannes were the following:

  1. Improve the for-status block since it had some repetitions. This was achieved by checking for the existence of the submodule just at the start of the if-statements and combining three checks in the first if-statement as follows:

     if  (info->for_status  &&  p->status  !=  'A' &&
     (sub = submodule_from_path(the_repository,
     					&null_oid,  p->sm_path)))
    

    Thereby freeing us of one extra indentation level as well.

  2. Use is_submodule_active() at the end of prepare_submodule_summary() instead of spawning a child process. This is a very good suggestions but it is failing for the case when there is not .gitmodules such as that in the option for-status because this function accesses the .gitmodules to check for the submodule.name.active configuration. I tried to make a makeshift (it took me long to find this word last night, it has a very good equivalent in Hindi and that is what I was aware of :P) version in which I spawned a process in case info->for_status was true else did a is_submodule_active(). This version is better than the way it was originally but it obviously can be made better!

  3. Compute the error messages in generate_submodule_summary() and pass them to print_submodule_summary() as a char *. Earlier we were computing them in the latter but doing this in the former seems more sensible since print_submodule_summary() should just print what is given to it. I fixed this and it works.

There is other minor stuff too such as refining the commit messages, improving the space and indentations, updating Stefan Beller’s email in the commit messages, etc.

Next steps

The next steps are to solve the remaining 2-3 issues and submit this patch as soon as possible. Some stuff requires guidance from Christian and Kaartic as well. Let’s keep going!

Over and out,
Shourya Shukla

Comments

Popular posts from this blog

The Final Report

GSoC Week 15

GSoC Week 7