Linux shell代写/C代写/C++代写/算法代写:这是一个C代写,C++代写,整合linus shell基本命令使用的任务
Task 2
Once IntListInsertInorder() is working correctly, modify useIntList.c so that it behaves as a sorting program: it reads a list of numbers from its standard input, and writes a sorted version of the numbers to its standard output. This is a simple change to the existing program, and can be accomplished by commenting a couple of lines.
Recompile and check that the new version of usel behaves correctly. Once you’re satisfied, run some tests to compare the time taken by usel with the time taken by the Unix sort command for the same task.
The first thing to do is to check that both commands are producing the same result (otherwise, it’s not useful to compare them). Try the following commands:
$ ./randl 1000 > nums
$ ./usel < nums > out1
$ sort -n < nums > out2
$ diff out1 out2
If the diff command gives no output, then the files have no difference (i.e. the observed output is he same as the expected output). If you try the above a number of times with different sets of numbers each time and get the same result, then you can be more confident that both commands are producing the same result. (However, you cannot be absolutely certain that they will always produce the same result for any input. Why not?)
Note that you need to use the -n option to sort because the default sorting order is lexcial, not numeric. If you don’t use -n, then e.g. 111 is treated as being less than 12. (See man sort for details.)
The next thing to do is to devise a set of test cases and collect some timing data. There is a command called time that will produce the timing data for you. Use it as follows:
$ ./randl 100000 > nums
$ time ./usel < nums > out1
$ time sort -n < nums > out2
As before, check that both commands are producing the same result. This time, however, you will also see output that contains some timing information (note that the output format may vary depending on which shell you’re using):
user time time spent executing the code of the command
system time time spent executing system operations, such as input/output
elapsed/real time wall-clock time between when the command starts and finishes
The elapsed/real time is affected by the load on the machine, and so is not reliable. The system time should be similar for both commands (same amount of input/output). The value that’s most useful is the user time, which represents the time that the program spent doing computation. Note that the time will vary if you run the same command multiple times. This is because the timing uses sampling and may sample the program at different points during the different executions. Collect results from a number of executions on the same output and take an average.
You will need to use large numbers of values (large argument to randl) to observe any appreciable difference. Of course, if you use very large numbers of values, the nums file may become too large to store in your directory. In this case, you could store nums under the /tmp directory:
$ ./randl 10000 > /tmp/nums
$ time ./usel < /tmp/nums > /tmp/out1
$ time sort -n < /tmp/nums > /tmp/out2
$ diff /tmp/out1 /tmp/out2
It would also be worth taking your /tmp/nums file and modifying it to check the timings for sorted and reverse sorted inputs:
$ sort -n /tmp/nums > alreadySortedWithDuplicates
$ sort -nr /tmp/nums > reverseSortedWithDuplicates
As well as lists produced by randl, it would be worth trying lists produced using the seq/sort combinations mentioned above. These will be different to the lists produced by randl in that they won’t have any duplicate values. For example, you could generate data files like:
$ seq 10000 > alreadySortedNoDuplicates
$ seq 10000 | sort -nr > reverseSortedNoDuplicates
$ seq 10000 | sort -R > randomOrderNoDuplicates
and use these in place of /tmp/nums to generate more timing data.
Once the data files get big enough, and once you’ve checked that both programs are producing the same output for a wide selection of cases, you can avoid generating and comparing large output files by running the timing commands as follows:
$ time ./usel < /tmp/nums > /dev/null
$ time sort -n < /tmp/nums > /dev/null
The above commands do all of the computation that we want to measure, but send their very large output to the Unix/Linux “data sink” (/dev/null), so it never takes up space on the file system.
If you do write results to files, don’t forget to remove them after you’ve finished the lab.
You should take note of the timing output and build a collection of test results for different sizes/types of input and determine roughly the point at which it becomes impractical to use usel as a sort program (rather than using the Unix sort command).
You’ll find that the timing data for relatively small lists (e.g. up to 10000) doesn’t show much difference between usel and sort, and shows time pretty close to zero. Try using list sizes such as 5000, 10000, 20000, 50000, 100000, and as high as you dare to go after that. Also, try varying the kinds of input that you supply. Consider the cases for the order of the input: random, already sorted, reverse sorted. Also, consider the proportion of duplicate values in the input, e.g. all distinct values, some duplicates. Note that ./randl will most likely include duplicates as soon as you specify a relatively large list size.
Put your results and some brief comments explaining the results, in a file called timing.txt. The file should contain a table of results, with major rows dealing with a particular size of data, and sub-rows dealing with the order of the input. You should have three columns: one to indicate how many runs of each program were used to compute the average time-cost, one to give the average time-cost for usel and the other to give the average time-cost for sort. The table should look something like the following:
Input
Size Initial
Order Has
Duplicates Number
of runs Avg Time
for usel Avg Time
for sort
5000 random no N T1sec T2sec
5000 sorted no N T1sec T2sec
5000 reverse no N T1sec T2sec
5000 random yes N T1sec T2sec
5000 sorted yes N T1sec T2sec
5000 reverse yes N T1sec T2sec
10000 random no N T1sec T2sec
10000 sorted no N T1sec T2sec
10000 reverse no N T1sec T2sec
10000 random yes N T1sec T2sec
10000 sorted yes N T1sec T2sec
10000 reverse yes N T1sec T2sec
etc. etc.
Note that, given the variation in timing output, it’s not worth considering more than two significant figures in your averages.
You should also write a short paragraph to explain any patterns that you notice in the timing results. Don’t just re-state the timing result in words; try to explain why they happened.
If you’re looking for a challenge, and know how to write scripts, write e.g. a Linux shell script to automate the testing for you, and maybe even produce the timing file for you.
Another interesting challenge would be to plot graphs using the R system, available as the Linux command R (a single upper-case letter). I’d suggest plotting a separate graph for each type (random, ascedning, descending) for a range of values (e.g. 5000, 10000, 20000, 50000, … as far as you can be bothered, given how long larger data files take to sort). Alternatively, produce your timing.txt as a tab-separated file and load it into a spreadsheet, from where you could also produce graphs.
If you’re feeling brave, try to find out how large input sort can deal with. However, you should try such testing on a machine where you are the sole user. Also, you shouldn’t try to store the data files; use commands like:
$ seq SomeVeryLargeNumber | time sort -n > /dev/null
$ seq SomeVeryLargeNumber | sort -nr | time sort -n > /dev/null
$ seq SomeVeryLargeNumber | sort -R | time sort -n > /dev/null