Julian asked for a formal description of how 'svn patch' applies hunks. Below are my current plans. The current code does something quite different and will be changed to implement what is described below before 1.7 release.
Any feedback appreciated. Stefan This note is a high-level description of how 'svn patch' applies hunks from a patch file to a file in the working copy. First, the unidiff input is parsed by libsvndiff, and passed to the patch application code as a list of patches, each containing a list of hunks to be applied to the patch target. The patch application code in libsvnclient is shielded from the specifics of the unidiff file format. All it really cares about is: 1) The 'patch target', which is the file to be patched. 2) The 'original' and 'modified' texts of each hunk in the patch, and for each hunk, the line offset at which the original text should be replaced with the modified text.
The 'original' text is what the lines described by the hunk looked like before they were edited by the person who created the patch. The 'modified' text is what the lines described by the hunk looked like after editing.
Mar 20, 2009 - check if the site bellow helps. Many downloads like Hunk Workshop English may also include a crack, serial number, unlock code, cd key or keygen (key generator). If this is the case it is usually found in the full download archive itself.
Both texts include surrounding lines of context obtained from the unidiff. Each hunk has an 'original' line offset, described in the hunk header. This line offset describes where the lines described by the hunk were located in the original, i.e. Unpatched, file.
This is the preferred location for applying the hunk. The goals of the hunk-application algorithm are: 1) Memory usage is bound by the longest line occurring in the patch file or the patch target. Never even attempt to read an entire file into RAM. 2) File seeks should be kept at a minimum.
3) Hunks matching at or near their original offset in the target file are always preferred over hunks matching at the same location but with a greater offset relative to their original offset. 4) Every hunk in the patch gets a chance to be applied, independent of matching results for hunks which were matched before it or will be matched after it. Hunks are rejected only if they don't match anywhere in the file, or if they collide with another hunk which is a better match according to 3). This implies that: 4a) Hunks are not assumed to be listed in any particular order.
4b) The offsets listed in hunk headers are only used as a guide, not as an authoritative reference. 4c) It is possible for a hunk to match at multiple candidate locations, but it will only be applied at one of them. 5) Hunks are only applied if they do not overlap with any other hunk. 6) By default, a patch is either applied fully or not at all. Rejection of hunks is only allowed if the user has given permission. If the user not given such permission, attempting to apply a patch with rejected hunks will raise an error and leave the working copy unmodified. A patch may want to change more than one target file.
Each patch target is scanned line-by-line, and possible matches for each of the patch's hunks are determined by matching the 'original' hunk text to the patch target at the current line offset. This results in a list of candidate offsets, one such list for each hunk. For each hunk, its lists of candidate offsets is now sorted such that offsets closest to the hunk's original offset (as parsed from the hunk header in the patch file) come first. If two candidate offsets have the same distance from the original offset, the order they are listed in is undefined (depending on the sorting algorithm used, the result might be deterministic, but the hunk application algorithm does not rely on this). Next, the list of hunks is sorted, such that hunks whose first candidate offset is closest to the hunk's original offset come first. Next, the algorithm builds a list of line ranges occupied by hunks which were successfully applied. In each item of this list, the occupying hunk's line offset as well as the number of lines in the hunk's original text is stored, specifying the range of lines occupied by the hunk.
Note that the minimum size of a line range depends on the amount of context provided in the unidiff. For standard unidiff files with 3 lines of leading and trailing context, the minimum length of the original text is 6 lines (e.g. Consider a hunk adding a single line to the file). Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset). Iterate over the list of hunks.
If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration. If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate.
After this step, all hunks are marked either applied or rejected. If any hunks were rejected and rejection is not allowed, raise an error.
Repeat the above process for the next patch target. Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text. (Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file. All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file. The patch is now applied.
Daniel Shahaf (lots of text, read it, sounds sensible, but haven't tried to think 'how else could it be done?' .) i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not iterate by hunks first? (i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.) If the first target applies cleanly and the second contains rejected hunk, do we prompt the user before modifying the first target? (if not, we might partially-apply a patch, only to some of the files it touches) Daniel. The goals of the hunk-application algorithm are: 1) Memory usage is bound by the longest line occurring in the patch file 'bounded' Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset).
Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied.
Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration. If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate.i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not iterate by hunks first? (i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.). Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text.
(Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file. All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file. In what format? The patch is now applied.Daniel (btw. Would be nice to use this code with non-svn patches.
Especially since it has interactive conflict resolution now.). Stefan Sperling sbutler agrees (and he's american, so he probably doesn't know english very well - but at least we'll get things wrong consistently when we follow his advice:) We already know what the best candidate offset for each hunk is, so why iterate over the candidates without considering how placing the current hunk at the current candidate offset affects other hunks? The point of this step is to prevent hunks from occupying overlapping line ranges. Any 2 hunks read from the patch cannot have occupied. Let C represent the current index into the lists of candidate offsets.
Set C = 1 (indexing the first candidate offset). Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration.
If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate.i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not iterate by hunks first? (i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.)We already know what the best candidate offset for each hunk is, so why iterate over the candidates without considering how placing the current hunk at the current candidate offset affects other hunks? The point of this step is to prevent hunks from occupying overlapping line ranges. Any 2 hunks read from the patch cannot have occupied overlapping space in the original file, otherwise they would be a single, larger, hunk. After this step, all hunks are marked either applied or rejected.
If any hunks were rejected and rejection is not allowed, raise an error. Repeat the above process for the next patch target.If the first target applies cleanly and the second contains rejected hunk, do we prompt the user before modifying the first target?No, we default to not changing anything at all if any hunk does not apply cleanly, for any target. Either all targets are modified, or none at all - unless rejects are allowed, in which case we produce reject files for any rejected hunks, for any target. Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text. (Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file. All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file.In what format?Rejected hunks are written out back to back, as they appear in the patch, in plain unidiff format.
Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset).
Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration. If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate.i.e., iterate all 1st candidates, then all 2nd candidates, etc.
Why not iterate by hunks first? (i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.)We already know what the best candidate offset for each hunk is, so why iterate over the candidates without considering how placing the current hunk at the current candidate offset affects other hunks? The point of this step is to prevent hunks from occupying overlapping line ranges. Any 2 hunks read from the patch cannot have occupied overlapping space in the original file, otherwise they would be a single, larger, hunk.Sorry, not sure I follow. (it's a little late where I am) But, thinking for a moment.
What are we arguing about? Isn't the typical case only 1-2 candidate offsets per hunk? (in which case, your scheme and my suggestion reduce to the same thing).
After this step, all hunks are marked either applied or rejected. If any hunks were rejected and rejection is not allowed, raise an error. Repeat the above process for the next patch target.If the first target applies cleanly and the second contains rejected hunk, do we prompt the user before modifying the first target?No, we default to not changing anything at all if any hunk does not apply cleanly, for any target. Either all targets are modified, or none at all - unless rejects are allowed, in which case we produce reject files for any rejected hunks, for any target.+1:-). Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text. (Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file.
All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file.In what format?Rejected hunks are written out back to back, as they appear in the patch, in plain unidiff format.OK. Stefan Sperling That's probably the typical case, yes, but let's assume a non-typical case (a huge number of possible matches for each hunk) and see how the algorithm copes. I'm arguing that iterating over hunks first is missing the point, and that it does not amount to the same thing as I was proposing. Recall that we do two sorting steps: The first sort sorts the candidate offset list for each hunk in turn. It makes sure that, for a given hunk, we prefer the 'best' candidate offset, which is close to the. Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset).
Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration.
If the hunk has no Cth candidate offset, mark the hunk as rejected. Increment C and iterate.i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not iterate by hunks first? (i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.)We already know what the best candidate offset for each hunk is, so why iterate over the candidates without considering how placing the current hunk at the current candidate offset affects other hunks? The point of this step is to prevent hunks from occupying overlapping line ranges.
Any 2 hunks read from the patch cannot have occupied overlapping space in the original file, otherwise they would be a single, larger, hunk.Sorry, not sure I follow. (it's a little late where I am) But, thinking for a moment. What are we arguing about? Isn't the typical case only 1-2 candidate offsets per hunk?
(in which case, your scheme and my suggestion reduce to the same thing)That's probably the typical case, yes, but let's assume a non-typical case (a huge number of possible matches for each hunk) and see how the algorithm copes. I'm arguing that iterating over hunks first is missing the point, and that it does not amount to the same thing as I was proposing.
Recall that we do two sorting steps: The first sort sorts the candidate offset list for each hunk in turn. It makes sure that, for a given hunk, we prefer the 'best' candidate offset, which is close to the original offset (Julian pointed out that there may be better criteria than 'how much does the candidate offset differ from the original offset?'
, and the sorting could happen with such criteria in mind.) So we know that whichever offset comes first in the candidate list, it is the 'best' one for this hunk. But we don't know yet if it can be used without disturbing another 'even better' hunk which could also occupy all or part of the same range of lines. So the second sort sorts the hunk list, putting hunks which have the 'best' candidate offsets first (again, 'how much does the offset differ from the original line offset?' Is one way of doing it, there may be more). So you were asking why, in the following step, we loop over candidate offsets, considering each hunk in turn as we do so, rather than looping over hunks, considering each candidate offset as we do so. The answer is that we want to end up using the best possible offset for every hunk. Looking at hunks in isolation does not achieve this goal.
When we pick a candidate offset for a hunk, we want to make sure the resulting line range occupied by the hunk does not overlap with a range we assigned to a different hunk earlier. If the range is already occupied, we know that it is occupied by a 'better' hunk (because the hunk list is already sorted accordingly) and that ignoring the current candidate offset for the current hunk will give the best overall result. What you propose does not answer any interesting question.
Iterating over hunks first and looking at each candidate offset for the current hunk in turn does not give us any new information because we already sorted the candidate offset list in the best way we can for the current hunk. Is it more clear now? Daniel Shahaf In this case, how about losing the 'GROUP BY hunk' from the sort? I.e., BIGLIST = for each hunk: for each candidate in hunk: compute a score for candidate BIGLIST.append(candidate) # based on line offset, endifs, whatever sort BIGLIST by score for each candidate in BIGLIST: if candidate-hunk.state not in 'applied', 'rejected': # try to apply candidate-hunk at candidate-offset I'm pretty sure it differs from your suggestion on at least one pathological input sample. Yes, thanks a. On Mon, Nov 16, 2009 at 11:31:12PM +0200, Daniel Shahaf wrote: i.e., iterate all 1st candidates, then all 2nd candidates, etc. Why not iterate by hunks first?
(i.e., first all candidates of 1st hunk, then all candidates of 2nd hunk, etc.)So you were asking why, in the following step, we loop over candidate offsets, considering each hunk in turn as we do so, rather than looping over hunks, considering each candidate offset as we do so. The answer is that we want to end up using the best possible offset for every hunk. Looking at hunks in isolation does not achieve this goal. When we pick a candidate offset for a hunk, we want to make sure the resulting line range occupied by the hunk does not overlap with a range we assigned to a different hunk earlier.
If the range is already occupied, we know that it is occupied by a 'better' hunk (because the hunk list is already sorted accordingly) and that ignoring the current candidate offset for the current hunk will give the best overall result.In this case, how about losing the 'GROUP BY hunk' from the sort? I.e., BIGLIST = for each hunk: for each candidate in hunk: compute a score for candidate BIGLIST.append(candidate) # based on line offset, endifs, whatever sort BIGLIST by score for each candidate in BIGLIST: if candidate-hunk.state not in 'applied', 'rejected': # try to apply candidate-hunk at candidate-offset I'm pretty sure it differs from your suggestion on at least one pathological input sample. On Wed, Nov 18, 2009 at 03:24:10PM +0200, Daniel Shahaf wrote: In this case, how about losing the 'GROUP BY hunk' from the sort? I.e., BIGLIST = for each hunk: for each candidate in hunk: compute a score for candidate BIGLIST.append(candidate) # based on line offset, endifs, whatever sort BIGLIST by score for each candidate in BIGLIST: if candidate-hunk.state not in 'applied', 'rejected': # try to apply candidate-hunk at candidate-offsetWhat does to 'try to apply' mean?Maybe it just means 'set hunk.state to 'applied'; I forgot what other checks (apart from the applied/rejected check) had to be done before we decided 'Yes, I decided to apply the hunk at candidate'. On Wed, Nov 18, 2009 at 03:24:10PM +0200, Daniel Shahaf wrote: In this case, how about losing the 'GROUP BY hunk' from the sort? On Wed, Nov 18, 2009 at 03:24:10PM +0200, Daniel Shahaf wrote: In this case, how about losing the 'GROUP BY hunk' from the sort?
Julian Foad Hi Stefan. It's an interesting subject to think about: how best to apply a patch. First, a few specific thoughts on your doc, but don't get too involved in them until you read my concluding remarks:-) Call the patch for each file a 'file-patch', as the plain word 'patch' is ambiguous. Rather than preferring to apply it in the target file at the line number it had in the original file, I think you should aim to do better.
One way would be by scaling that number by (length of target) / (length. The patch application code in libsvnclient is shielded from the specifics of the unidiff file format. All it really cares about is: 1) The 'patch target', which is the file to be patched. 2) The 'original' and 'modified' texts of each hunk in the patch, and for each hunk, the line offset at which the original text should be replaced with the modified text. The 'original' text is what the lines described by the hunk looked like before they were edited by the person who created the patch. The 'modified' text is what the lines described by the hunk looked like after editing.
Both texts include surrounding lines of context obtained from the unidiff. Each hunk has an 'original' line offset, described in the hunk header.
This line offset describes where the lines described by the hunk were located in the original, i.e. Unpatched, file. This is the preferred location for applying the hunk.Rather than preferring to apply it in the target file at the line number it had in the original file, I think you should aim to do better. One way would be by scaling that number by (length of target) / (length of original). Otherwise if the target has been modified by inserting or by removing many lines evenly spread throughout, then the 'closeness' measurement would favour matches near the beginning of the file where the line numbers have not changed much, over the end of the file where line numbers may have changed by thousands. A better evolvement of that would be by interpolating the line offset of the next still-to-be-placed hunk between the nearest preceding and following hunks that have already been matched with a high certainty. 3) Hunks matching at or near their original offset in the target file are always preferred over hunks matching at the same location but with a greater offset relative to their original offset.Would I be right in guessing that 'matching at the same location but with a greater offset relative to their original offset' means 'matching further away'?
And see above for comment about adjusting the original offset in proportion with known extremes of the target file (start and end, or nearest certainly-matching hunks). Important: I think the criteria for quality of match should include how much text in the context matches, and how much text in the original side of the hunk body matches. A hunk with a large total amount of matching text should be preferred over one with small amount such as empty context lines or lines containing just ' or 'endif'. I think the 'amount of matching' should be of considerably more importance than line-number offset from the expected location. 4) Every hunk in the patch gets a chance to be applied, independent of matching results for hunks which were matched before it or will be matched after it.
Hunks are rejected only if they don't match anywhere in the file, or if they collide with another hunk which is a better match according to 3). This implies that: 4a) Hunks are not assumed to be listed in any particular order. 4b) The offsets listed in hunk headers are only used as a guide, not as an authoritative reference. 4c) It is possible for a hunk to match at multiple candidate locations, but it will only be applied at one of them. 5) Hunks are only applied if they do not overlap with any other hunk.
6) By default, a patch is either applied fully or not at all. Rejection of hunks is only allowed if the user has given permission. If the user not given such permission, attempting to apply a patch with rejected hunks will raise an error and leave the working copy unmodified.
A patch may want to change more than one target file. Each patch target is scanned line-by-line, and possible matches for each of the patch's hunks are determined by matching the 'original' hunk text to the patch target at the current line offset. This results in a list of candidate offsets, one such list for each hunk. For each hunk, its lists of candidate offsets is now sorted such that offsets closest to the hunk's original offset (as parsed from the hunk header in the patch file) come first. If two candidate offsets have the same distance from the original offset, the order they are listed in is undefined (depending on the sorting algorithm used, the result might be deterministic, but the hunk application algorithm does not rely on this). Next, the list of hunks is sorted, such that hunks whose first candidate offset is closest to the hunk's original offset come first. Next, the algorithm builds a list of line ranges occupied by hunks which were successfully applied.
In each item of this list, the occupying hunk's line offset as well as the number of lines in the hunk's original text is stored, specifying the range of lines occupied by the hunk. Note that the minimum size of a line range depends on the amount of context provided in the unidiff. For standard unidiff files with 3 lines of leading and trailing context, the minimum length of the original text is 6 lines (e.g. Consider a hunk adding a single line to the file).
Let C represent the current index into the lists of candidate offsets. Set C = 1 (indexing the first candidate offset). Iterate over the list of hunks. If the hunk is already marked applied or rejected, ignore it. If the hunk's Cth candidate offset is not contained in any range described by the list of occupied lines, store its current candidate offset and the length of its original text in the list of occupied line ranges and mark the hunk as applied. Else, if the hunk is not marked applied, and its current candidate offset is within the range of occupied lines, defer handling of this hunk to the next iteration. If the hunk has no Cth candidate offset, mark the hunk as rejected.
Increment C and iterate. After this step, all hunks are marked either applied or rejected.
If any hunks were rejected and rejection is not allowed, raise an error. Repeat the above process for the next patch target. Once all patch targets are processed as above, read each target file again from the top, copying lines not contained in the list of lines occupied by hunks verbatim. Replace any lines occupied by a hunk with lines of the hunk's modified text. (Note that the number of lines in the modified text of a hunk may differ from the number of lines in the original text of a hunk, so the patched file can have a different number of lines than the original file. All offsets are relative to the original file, however.) Write any rejected hunks for this target to a reject file. The patch is now applied.