LLMs for utilizing powerful CLI tools with less effort

Some of the CLI tools I also mentioned my previous blog post such as jq and awk used to take a while to pick up in the past due to their special query syntax and peculiar logic. I was curious about how LLMs (Large Language Models) such as ChatGPT (GPT-3.5 and GPT-4) and Gemini (1.0 Pro and 1.0 Ultra) would measure up for tasks like generating and explaining scripts using these tools, so I put them to a little test.

NOTE: Take the observations and comparison below with a grain of salt, as this test focuses on a specific use-case, it covers only a few tasks using a small set of data processing CLI tools, and uses the particular LLM models that were available to me as of writing.

General attributes of the tested LLM models Link to heading

I used the web clients for the tests and comparison, but there is also a separate pricing model for the API that works on a pay-as-you-go basis, and reportedly has its own pros-and-cons and limitations. Also, the specified context window, rate limit, may change over time, even for the same model/version! The table below was last updated on 2024-03-02.

	ChatGPT GPT-3.5	ChatGPT Plus GPT-4	Gemini 1.0 Pro	Gemini Advanced 1.0 Ultra
Context window	8K tokens	32K tokens	32K tokens	32K tokens
Knowledge cutoff	2022-01	2023-04	2024-03	2024-03
Message rate limit	depends on load	20-60 / 3 hours	60 / minute	60 / minute
Monthly cost	free	$20 + tax (no trial)	free	$20 + tax (2 month trial)

Color codes: green : best, orange : medium, red : worst

Notes:

Large language models are trained on massive amounts of data to predict responses to the instructions given in input prompts. These models break down the input text into more manageable units called tokens, such as words or smaller fragments. The context window is essentially the model’s memory, limiting the number of tokens it can remember from our inputs and use in predicting what to output next. A larger context with more tokens allows language models to capture richer contextual information, leading to better predictions and more accurate responses.
A recent knowledge cutoff is important for LLMs to ensure they have access to more up-to-date information.
ChatGPT Plus has no trial, and its relatively low message rate limit can be a surprise and a bit of a turn-off as this limitation isn’t advertised too much. To be fair, I only ran into this limit the first couple of times when I tried it and wanted to see how it compared to GPT-3.5, but YMMV.

Test results for jq and awk code generation and explanation Link to heading

	ChatGPT GPT-3.5	ChatGPT Plus GPT-4	Gemini 1.0 Pro	Gemini Advanced 1.0 Ultra
Generated code quality	★★★☆☆	★★★★☆	★★☆☆☆	★★★☆☆
Syntax error rate	★★★☆☆	★★★★★	★★★☆☆	★★★☆☆
Actual output accuracy	★★★☆☆	★★★★☆	★★☆☆☆	★★★☆☆
Predicted output accuracy	★★★☆☆	★★★☆☆	★★☆☆☆	★★★★☆
Explanation accuracy	★★★★☆	★★★★☆	★★★☆☆	★★★★☆
Response time	★★★★☆	★★☆☆☆	★★☆☆☆	★★☆☆☆

Ranking: the more ★ the better the result

My key takeaways so far:

LLMs can help to break down the initial barrier for CLI tools with a steeper learning curve, speed up the coding process, and give us fresh ideas. With increased complexity and niche topics the result can be a hit or miss. It will sometimes generate just the code we were looking for on the first try, but other times, with the same confidence, it will generate buggy code that fails with an error even after claiming that it fixed the bug. So I wouldn’t blindly trust the code it returns without fully understanding and testing it.
Since LLMs doesn’t actually run the generated code, the predicted output may differ from the actual output. But since it’s not always accurate, I have to double-check it anyway, so besides this test, I didn’t find it too useful to include the predicted output in the result. However, a sample input is quite useful, and it’s much easier for LLMs to generate than for humans.
It seems like that the tested LLMs are better explaining code than generating it.
Splitting the complex tasks into smaller ones, being more specific in what we want, and doing a few iterations often helps if the generated code isn’t good enough.
In case we don’t need all the details (explanations, examples, etc.) besides the generated code, we can greatly reduce the response times by specifying to return only the code.

Test examples Link to heading

The concrete test examples can be found below. I picked these tests based on real-world tasks for 2 CLI tools that have their own complex DSL. It’s a long read due to the verbose output of the 4 tested LLM models, plus my observations below each, multiplied by the 3 examples for 2 different CLI tools, so buckle up.

Examples: (click to expand)

jq Link to heading

avg Link to heading

Prompt:

Use jq to calculate the average of an array of numbers, skipping nulls. Include an example input and output too.

GPT-3.5

Certainly! Here’s how you can calculate the average of an array of numbers while skipping null values using jq:

Input:

[1, 2, null, 4, null, 6]

You can use the following jq command to calculate the average:

echo '[1, 2, null, 4, null, 6]' | jq -r '[.[] | select(. != null)] | add / length'

Output:

Explanation:

[.[] | select(. != null)]: This part of the command filters out null values from the input array.
add: This function computes the sum of the remaining numbers.
/ length: This divides the sum by the count of the remaining numbers, giving the average. So, the output is 3, which is the average of the non-null numbers in the input array.

Observations:

The jq query works as expected.
Fairly good explanation.
Slightly inaccurate: the generated command actually returns 3.25 instead of the predicted 3.

GPT-4

To calculate the average of an array of numbers while skipping null values using jq, you can follow this approach:

Filter out null values using select(. != null).
Sum the remaining numbers and divide by the count of those numbers to get the average.

Here’s a jq script for this task:

jq '[.[] | select(. != null)] | add / length'

Example Input and Output:

Input (input.json):

[1, 2, null, 4, 5]

Command:

jq '[.[] | select(. != null)] | add / length' input.json

Output:

Explanation:

[.[] | select(. != null)]: This creates a new array with all null values filtered out.
add / length: Sums the values of the filtered array and then divides by the length of this array to compute the average.

This jq command processes the input array to compute the average of its numbers, excluding any null values.

Observations:

The same jq query as in the GPT-3.5 solution.
A little better explanation for [.[] | select(. != null)].
Minor: both GPT-3.5 and GPT-4 solutions could have used a bit simpler code: del(.. | nulls) or map(select(. != null)) instead of [.[] | select(. != null)].

Gemini 1.0 Pro

Here’s the jq command to calculate the average of an array of numbers, skipping nulls:

reduce .[] as $x (0; . + if $x != null then $x else 0 end) / length | if . > 0 then . else 0 end

Explanation:

reduce .[] as $x: Reduces the array of numbers to a single value.
- $x: refers to the current element in the loop.
(0; . + if $x != null then $x else 0 end): This part accumulates the sum of non-null elements.
- 0;: Initializes the accumulator to 0.
- . + if $x != null then $x else 0 end: Adds the current element ($x) to the accumulator only if it’s not null. Otherwise, it adds 0.
/ length: Divides the accumulated sum by the length of the array to get the average.
if . > 0 then . else 0 end: Handles cases where the entire array consists of nulls. It checks if the average is greater than 0 and returns the average if true, otherwise returns 0.

Example Input:

[10, null, 5, null, 20]

Example Output:

12.5

This command uses a reducer to accumulate the sum of non-null values and then calculates the average considering the actual number of elements (excluding nulls). If all elements are null, it returns 0 to avoid division by zero.

Observations:

The generated code handles nulls when calculating the sum, but fails to subtract the number of nulls from the input length before dividing the sum. Also, I wouldn’t use reduce for such a simple task like a sum (jq has an add method).
Wrong output: the generated command returns 7 instead of the predicted 12.5, but it should actually return ~11.667.
It explains well the different parts of the code.

Gemini 1.0 Ultra

Absolutely! Here’s how you can use jq to accomplish this:

Understanding jq

jq is a command-line tool designed to process and filter JSON data. It’s incredibly useful for extracting and manipulating the information you need from JSON structures.

Steps

Filter out nulls:
```
jq '.numbers[] | select(. != null)'
```
This filters the numbers array, keeping only non-null numbers.
Calculate the average:
```
jq '(.numbers[] | select(. != null)) | add / length' 
```
This part sums the filtered numbers and divides by their count.

Example

Input JSON (data.json):

{
  "numbers": [1, 2, null, 3, 4, null, 5]
}

Command:

jq '(.numbers[] | select(. != null)) | add / length' data.json

Output:

Observations:

The generated code fails with an error: jq: error (at input.json:2): Cannot iterate over number (1). The fix is easy though: either wrap the (.numbers[] | select(. != null)) part inside [] to make it an array that add can use, or use map like so: .numbers | map(select(. != null)).
The output is correct.
Not too detailed explanation.

median Link to heading

Prompt:

Use jq to calculate the median of an array of numbers, skipping nulls. Include an example input and output too.

GPT-3.5

To calculate the median of an array of numbers while skipping null values using jq, you’ll first need to sort the non-null values and then find the middle value or the average of the two middle values depending on whether the number of elements is odd or even. Here’s how you can achieve that:

Input:

[3, 6, null, 2, 9, null, 1]

You can use the following jq command to calculate the median:

echo '[3, 6, null, 2, 9, null, 1]' | jq -n '[inputs | select(. != null) | tonumber] | sort | if length % 2 == 0 then (.[length/2-1] + .[length/2]) / 2 else .[length/2] end'

Output:

Explanation:

[inputs | select(. != null) | tonumber]: This filters out null values, converts the remaining values to numbers, and reads the input as an array.
sort: This sorts the array of non-null numbers.
if length % 2 == 0 then (.[length/2-1] + .[length/2]) / 2 else .[length/2] end: This conditional expression checks if the length of the array is even or odd. If it’s even, it calculates the average of the two middle values; otherwise, it selects the middle value.

So, the output is 3, which is the median of the non-null numbers in the input array.

Observations:

The generated code fails with an error: jq: error (at <stdin>:1): array ([3,6,null,2...) cannot be parsed as a number. But it’s actually a small mistake, we just need to append [] after the inputs method to convert the array to a stream.
Correct explanation.

GPT-4

To calculate the median of an array of numbers while skipping null values with jq, you need to filter out the null values, sort the numbers, and then find the middle value or the average of the two middle values if the array has an even number of elements. Here’s how you can do it:

jq Script:

jq 'map(select(. != null)) | sort | if length == 0 then empty elif length % 2 == 1 then .[length / 2 | floor] else (.[length / 2 - 1] + .[length / 2]) / 2 end'

Example Input (input.json):

[3, 1, null, 2, 4]

Command:

jq 'map(select(. != null)) | sort | if length == 0 then empty elif length % 2 == 1 then .[length / 2 | floor] else (.[length / 2 - 1] + .[length / 2]) / 2 end' input.json

Output:

Explanation:

map(select(. != null)): Filters out null values from the array.
sort: Sorts the remaining numbers in ascending order.
if length == 0 then empty: Handles the case where the array is empty after removing null values.
elif length % 2 == 1 then .[length / 2 | floor]: If the array has an odd number of elements, it returns the middle element.
else (.[length / 2 - 1] + .[length / 2]) / 2 end: If the array has an even number of elements, it calculates the average of the two middle elements.

This jq command calculates the median of the provided numbers, excluding any null values.

Observations:

The generated code works and if we run it then it returns the correct value (2.5). It uses a slightly more elegant null filtering and handles the empty array edge case (after removing the nulls).
Slightly inaccurate output: GPT-4 returned 2 instead of 2.5.
Detailed explanation.

Gemini 1.0 Pro

Here’s the jq command to calculate the median of an array of numbers, skipping nulls:

reduce .[] as $x ([]; . + if $x != null then $x else [] end) | sort | 
  if length % 2 == 1 then .[(length // 2)] else (.[(length // 2) - 1] + .[length // 2]) / 2 end

Explanation:

reduce .[] as $x ([]; . + if $x != null then $x else [] end): Reduces the array to an array containing only non-null numbers.
- [];: Initializes an empty array as the accumulator.
- . + if $x != null then $x else [] end: Appends the current element ($x) to the accumulator only if it’s not null. Otherwise, it appends an empty array.
sort: Sorts the filtered array in ascending order.
if length % 2 == 1 then .[(length // 2)] else (.[(length // 2) - 1] + .[length // 2]) / 2 end: This conditional expression calculates the median based on the number of elements:
- length % 2 == 1: Checks if the length is odd.
  - .[(length // 2)]: If odd, the median is the middle element (using integer division // to get the index).
- else: If the length is even:
  - .[(length // 2) - 1] + .[length // 2]: Accesses the two middle elements.
  - / 2: Calculates the average of the two middle elements.

Example Input:

[10, null, 5, null, 20, 15]

Example Output:

This command filters out nulls, sorts the remaining numbers, and then calculates the median depending on the number of elements (odd or even) while handling edge cases like an empty array or all null values.

Observations:

The generated code fails with an error: jq: error (at input.json:0): array ([]) and number (10) cannot be added. It’s easy to fix the sum by wrapping $x in [] to make it an array: reduce .[] as $x ([]; . + if $x != null then [$x] else [] end), but again, I wouldn’t use reduce for a sum. It also uses the alternative operator // instead of the division operator /.
Wrong output: it should return 12.5, but it reports 15 and returns 10 (with only the sum fix).
As I mentioned above, the // is not the division operator, though Gemini states that. Otherwise it’s an adequate explanation.

Gemini 1.0 Ultra

Absolutely! Here’s how to calculate the median using jq while filtering out null values:

Explanation

Filter out nulls: As before, we start by removing nulls:
```
jq '.numbers[] | select(. != null)'
```
Sort the array: The median requires a sorted array:
```
jq '(.numbers[] | select(. != null)) | sort'
```

Calculate the median:

jq '
  (.numbers[] | select(. != null)) | sort | 
  (length / 2 | floor) as $mid |
  if length % 2 == 0 then 
    (.[$mid - 1] + .[$mid]) / 2 
  else 
    .[floor(length / 2)] 
  end'

This part handles both cases:

Even number of elements: Calculate the average of the two middle values.
Odd number of elements: Take the middle value directly.

Example

Input JSON (data.json):

{
  "numbers": [1, 2, null, 3, 4, null, 5]
}

Command:

jq '
  (.numbers[] | select(. != null)) | sort | 
  (length / 2 | floor) as $mid |
  if length % 2 == 0 then 
    (.[$mid - 1] + .[$mid]) / 2 
  else 
    .[floor(length / 2)] 
  end' data.json

Output:

Observations:

The generated code fails with an error: jq: error: floor/1 is not defined at <top-level>, line 7:. It’s interesting that Gemini used the correct floor/0 method once but than also hallucinates the floor/1 which doesn’t exist. Fixing this bug is easy, just use floor/0 instead of floor/1, just like it did a few lines before length / 2 | floor. Then we hit the same bug as with the avg example, so the same fix applies for the .numbers[] | select(. != null) part.
The predicted output is correct.
The explanation could go into a bit more details.

objs2csv Link to heading

Prompt:

Use jq to convert an array of objects to a CSV without sorting the keys. Use the object keys as CSV headers. Include an example input and output too.

GPT-3.5

Here’s how you can convert an array of objects to CSV format without sorting the keys, using the object keys as CSV headers:

Let’s say you have an array of objects like this:

Input:

[
  {"name": "John", "age": 30, "city": "New York"},
  {"name": "Alice", "age": 25, "city": "Los Angeles"},
  {"name": "Bob", "age": 35, "city": "Chicago"}
]

You can use jq to achieve this:

echo '[{"name": "John", "age": 30, "city": "New York"}, {"name": "Alice", "age": 25, "city": "Los Angeles"}, {"name": "Bob", "age": 35, "city": "Chicago"}]' | jq -r '([.[0] | keys_unsorted] | join(",")), (.[] | [.[]]) | @csv'

Explanation:

[.[0] | keys_unsorted] | join(","): Extracts the keys of the first object in the array without sorting them and joins them with a comma to create the CSV header.
(.[] | [.[]]): Converts each object into an array of its values.
@csv: Converts the array of values into CSV format.

Output:

"name","age","city"
"John",30,"New York"
"Alice",25,"Los Angeles"
"Bob",35,"Chicago"

This produces a CSV with the object keys as headers and preserves the original order of keys.

Observations:

The generated code fails with an error: jq: error (at <stdin>:1): string ("") and array (["name","ag...) cannot be added. ChatGPT missed a [] after the keys_unsorted method and shouldn’t have tried to join the header row separately and then also pass it to the @csv method, but rather keep it as an array and convert it with the same @csv method that we use for the other rows: jq -r '(.[0] | keys_unsorted), (.[] | [.[]]) | @csv'.
It makes an assumption, that every object in the array have the same keys as the first one from where it takes the header names.
Sufficient explanation.

GPT-4

To convert an array of objects to CSV format using jq without sorting the keys and using the object keys as CSV headers, you will need to:

Extract the keys from the first object in the array to use as headers.
Map each object in the array to an array of its values, preserving the order of keys as found in the first object.
Prepend the headers to the array of values.
Convert the arrays to CSV lines.

Here’s an example jq script to achieve this:

jq -r '
  (map(keys_unsorted) | add | unique) as $headers |
  $headers, map([.[]])[] |
  @csv
' input.json

Example Input and Output:

Given an input input.json file like this:

[
  {"name": "John Doe", "age": 30, "city": "New York"},
  {"name": "Jane Smith", "age": 25, "city": "Los Angeles"},
  {"city": "Chicago", "age": 28, "name": "Mike Brown"}
]

This jq command would produce the following CSV output:

"name","age","city"
"John Doe",30,"New York"
"Jane Smith",25,"Los Angeles"
"Mike Brown",28,"Chicago"

Explanation:

map(keys_unsorted) | add | unique as $headers: This extracts all keys from all objects without sorting, combines them into a single array, removes duplicates, and assigns the result to the $headers variable. This assumes that all objects have the same keys. If some objects might be missing keys present in others, this approach still works because keys_unsorted from each object is combined with add and then made unique.
$headers, map([.[]])[] | @csv: Prepends the headers to the list of rows. Each object is then mapped to an array of its values in the order of $headers. The @csv filter converts arrays to CSV lines.
-r: Outputs raw strings, not JSON-encoded strings, to produce proper CSV format.

Note: This solution assumes that all objects have the same keys in the same order. If the objects have different keys or the key order varies significantly, the CSV output may not align as expected.

Observations:

There generated code doesn’t fail, but it has even more limitations than the one generated by GPT-3.5 (to be fair, that needed 2 fixes to run), and it sorts the headers (the side effect of the unique method) though we explicitly asked it not to do so.
It also makes an assumption, that every object in the array have the same keys as the first one from where it takes the header names. Though we can see some clumsy attempts from GPT-4 that it tries to make the headers more dynamic by getting the unique headers from all objects not just the first one, it simply dumps the other object keys/rows in their original order, mixing up the headers and rows unless we have the very same keys in the very same order in every object.
“Each object is then mapped to an array of its values in the order of $headers.” -> At least, it has the right idea and explains well what it should do, but that’s not how the generated code works.
Despite my nitpicking, the code is a good starting point, and it’s not that hard to fix: we can use (reduce .[] as $item ({}; . + $item) | keys_unsorted) instead of map(keys_unsorted) | add | unique to keep the original order of the keys but still removing any dupes, and filter the inner array value iterator .[] with the $headers[] in the map([.[]])[] expression to get the keys in the same order as in the header. The final code is this: jq -r '(reduce .[] as $item ({}; . + $item) | keys_unsorted) as $headers | $headers, map([.[$headers[]]])[] | @csv' input.json.

Gemini 1.0 Pro

Here’s the improved jq command to convert an array of objects to a CSV without sorting the keys, using the object keys as headers:

.[] | to_entries | flatten | group_by(.key) | map({key, values: .[]}) | map({key, value: .values | join(",")}) | @csv

Explanation:

.[] | to_entries | flatten: Iterates through each object, converts it to key-value pairs, and then flattens the resulting array into a single list of key-value pairs.
group_by(.key): Groups the key-value pairs by their keys. This ensures that all values associated with a specific key are grouped together.
map({key, values: .[]}): Creates a new object for each group. The “key” property remains the same, and a new “values” property is added that contains an array of all values associated with that key.
map({key, value: .values | join(",")}): Iterates through each group and creates a new object. The “key” remains the same, and a new “value” property is added. This property is created by joining all the values associated with the key into a single string using commas (",") as separators.
@csv: Converts the final array of objects into a CSV string with commas (",") as separators.

Example Input:

[
  { "name": "Alice", "age": 30, "city": "New York" },
  { "age": 25, "name": "Bob", "country": "Canada" }
]

Example Output:

name,age,city,country
Alice,30,New York,
Bob,25,,,Canada

Improvements:

This version uses flatten to ensure a single list of key-value pairs before grouping.
It explicitly joins the values with a comma (",") for clarity.
It addresses potential edge cases where some objects might not have all keys, resulting in empty values in the CSV. In these cases, empty strings ("") are added to maintain consistent structure.

This refined command provides a more robust and efficient solution for converting an array of objects to a CSV, preserving the original key order and handling potential variations in object structures.

Observations:

The generated code fails with an error: jq: error (at input.json:3): Cannot index array with string "key". But it’s far from a working solution: it tries to use flatten on a stream instead of an array, it sorts the keys though we explicitly asked not to do so, it really struggles with mapping and does not even try to map the objects into a multidimensional array that the @csv function requires.
The generated output is almost correct, it misses the double quotes that @csv produces.
It provides a detailed explanation for the different pieces, which is mostly but not fully accurate or complete, so it can be a bit misleading.

Gemini 1.0 Ultra

Absolutely! Here’s how to convert an array of objects into CSV format using jq while preserving the original key order:

Steps

Get the headers:
```
jq -r '.[0] | keys_unsorted | @csv'
```
Get the values (one row at a time):
```
jq -r '.[] | [.[]] | @csv'
```

Example

Input JSON (data.json):

[
  {"color": "red", "size": "large", "price": 15},
  {"size": "small", "price": 8, "color": "blue"}
]

Command:

jq -r '.[0] | keys_unsorted | @csv' data.json > headers.csv
jq -r '.[] | [.[]] | @csv' data.json >> data.csv

Output (headers.csv):

color,size,price

Output (data.csv):

red,large,15
blue,small,8

Explanation

keys_unsorted: Ensures the original order of keys is used for the header row.
.[0]: Gets the first object in the array to generate the headers.
.[].[]: Transforms each object, preserving the existing key order.
@csv: Formats the output as CSV.

Important Note: For consistent results, it’s crucial that all objects in your input array share the same set of keys, even if some values are null.

Observations:

The generated code doesn’t fail, but it generates 2 CSV instead of one, it unnecessarily uses two separate jq calls and writes, and in addition to that it makes the same mistakes as the GPT3.5 solution.
Not sure where it got .[].[] from in the explanation. Otherwise it’s fine.

awk Link to heading

union Link to heading

Prompt:

Use awk to get the union of 2 or more files without sorting the lines. Include an example input and output too.

GPT-3.5

If you want to get the union of two or more files without sorting the lines, you can keep track of the lines seen in each file and print them as encountered. Here’s how you can achieve this:

Let’s say you have two example files named file1.txt and file2.txt with the following contents:

file1.txt:

apple
banana
orange

file2.txt:

banana
grape
kiwi

You can use the following awk command to get the union of these two files without sorting the lines:

awk '!seen[$0]++' file1.txt file2.txt

Output:

apple
banana
orange
grape
kiwi

Explanation:

!seen[$0]++: This statement keeps track of each line encountered ($0 represents the current line). It evaluates to true only if the current line has not been seen before, thus removing duplicates.
file1.txt file2.txt: This specifies the input files to awk.

This command will output the union of the contents of file1.txt and file2.txt, preserving the original order of lines without sorting.

Observations:

The generated code is spot on, I was a bit surprised ChatGPT knows this small awk trick.
Clear explanation.

GPT-4

To achieve the union of two or more files using awk without sorting the lines, you can use awk in a way that it reads through each file and prints lines that haven’t been printed before. This method ensures that lines are output in the order they are encountered across the files, maintaining the original order within each file.

Here’s how you can do it:

awk Command:

awk '!seen[$0]++' file1.txt file2.txt ...

This command uses an associative array seen where each line of text encountered is used as a key. The value associated with each key is incremented every time the line is encountered, but the line is only printed the first time (!seen[$0]++ evaluates to true).

Example:

Input:

file1.txt:

apple
banana
grape

file2.txt:

banana
orange
apple

file3.txt (optional, to demonstrate with more than 2 files):

kiwi
grape

Command:

awk '!seen[$0]++' file1.txt file2.txt file3.txt

Output:

apple
banana
grape
orange
kiwi

Explanation:

Lines are printed in the order they are first encountered across the input files.
The banana and apple lines from file2.txt are not printed again since they were already printed from file1.txt.
The grape line from file3.txt is not printed again since it was already printed from file1.txt.
The kiwi line is new and thus printed.

This method is efficient for finding the union of lines across multiple files, preserving the original order of lines without sorting.

Observations:

Same solution.
Comprehensive explanation.

Gemini 1.0 Pro

Here’s the awk command to get the union of 2 or more files without sorting the lines:

FNR == NR { 
  seen[$0]++ 
  next 
} 
seen[$0] 
' file1 file2 ...

Explanation:

FNR == NR { ...; next }: This block checks if the current record number (FNR) is equal to the record number within the current file (NR). This condition is only true when processing the first file.
seen[$0]++: If the condition is true, it increments the count of the current line ($0) in the associative array seen. This effectively stores all unique lines from the first file.
next: After storing the line from the first file, the next statement skips to the next line in the current file.
seen[$0]: This checks if the current line ($0) from any subsequent file exists as a key in the seen array.
print: If the line exists in the seen array (meaning it was encountered in the first file), it gets printed.

Example Input:

file1:

line1
line2
line3

file2:

line2
line4
line5

Output:

line1
line2
line3
line4
line5

Note:

This approach uses an associative array (seen) to keep track of unique lines encountered in the first file.
Lines encountered in subsequent files are only printed if they already exist in the seen array, effectively achieving the union operation without sorting.
This method ensures the order of lines is preserved as they appear in the original files.

Observations:

Gemini messed up the markdown code block here a bit, so the awk ' part is missing from the beginning of the rendered output.
The generated code returns file2 which is clearly wrong. It overcomplicates things.
It mixes up FNR and NR, but otherwise explains the different parts of the code well.

Gemini 1.0 Ultra

Here’s how you can use awk to achieve this:

Understanding the Approach

Store seen lines: We’ll use an awk associative array to keep track of lines we’ve already encountered.
Print unique lines: Iterate through each file, printing only lines that haven’t been seen before.

Example

File 1 (file1.txt):

line A
line B
line C

File 2 (file2.txt)

line C
line D
line A

Command:

awk '!seen[$0]++' file1.txt file2.txt

Output:

line A
line B
line C
line D

Explanation

!seen[$0]++: This condition is true the first time a line ($0) is encountered. The ++ increments the count in the seen array, making the condition false for subsequent occurrences of the same line.

Key Points

Order Preservation: The output order follows the appearance of lines in the input files.
Multiple Files: You can easily add more files to the awk command (e.g., awk '!seen[$0]++' file1.txt file2.txt file3.txt)

Observations:

Same awk trick that ChatGPT used.
Mostly satisfying explanation.

max Link to heading

Prompt:

Use awk to get the max number from the input lines. Include an example input and output too.

GPT-3.5

You can use awk to find the maximum number from the input lines by keeping track of the maximum value encountered. Here’s how you can achieve this:

Let’s say you have an example file named numbers.txt with the following contents:

numbers.txt:

You can use the following awk command to get the maximum number from the input lines:

awk 'max < $1 { max = $1 } END { print max }' numbers.txt

Output:

Explanation:

max < $1 { max = $1 }: This statement checks if the current number ($1) is greater than the current maximum (max). If so, it updates the maximum value to the current number.
END { print max }: This block executes after processing all lines in the input file. It prints the maximum value found.

This command will output the maximum number (13) from the input lines in the file numbers.txt.

Observations:

The generated code works for positive numbers, but it doesn’t return anything if there are only negative numbers. Though I could have been more specific and ask it to handle negative numbers too.
Good explanation.

GPT-4

To find the maximum number from a list of numbers using awk, you can process each line of input, compare it with the maximum value found so far, and update the maximum value if the current number is greater. Here’s a simple awk command to do this:

awk Command:

awk 'NR == 1 || $1 > max { max = $1 } END { print max }' file.txt

This command does the following:

NR == 1 || $1 > max { max = $1 }: For the first record (NR == 1) or if the current line’s first field ($1) is greater than the current max, update max to this value. $1 is used assuming each line contains a single number; if numbers are not in the first field, adjust $1 accordingly.
END { print max }: After processing all lines, print the maximum value found.

Example:

Input (file.txt):

Command:

awk 'NR == 1 || $1 > max { max = $1 } END { print max }' file.txt

Output:

Explanation:

The awk script starts with assuming the first number (5) as the maximum.
It then reads each subsequent number, updating the maximum value if a larger number is found.
Upon reaching the end of the file, it prints the maximum number found, which is 10 in this case.

This method efficiently computes the maximum value in a list of numbers contained in a file, using awk.

Observations:

The generated code works for both positive and negative numbers.
Slightly more verbose explanation.

Gemini 1.0 Pro

Here’s the awk command to get the max number from the input lines:

BEGIN { max = "" }  # Initialize max to an empty string
{ if ($1 > max) max = $1 }
END { print max }
'

Explanation:

BEGIN { max = "" }: This block runs before processing any lines. It initializes a variable named max with an empty string. This is important because awk treats empty strings as 0 in numeric comparisons.
{ if ($1 > max) max = $1 }: This block gets executed for each line in the input.
- $1: Accesses the first field (column) of the current line.
- if ($1 > max): Checks if the current number ($1) is greater than the current max.
- max = $1: If the condition is true, it updates the max variable with the current number.
END { print max }: This block runs after processing all lines. It prints the final value of max, which holds the largest number encountered.

Example Input:

Output:

This command uses a single variable max to track the maximum number encountered throughout the input. It compares each line’s first field with the current max and updates it if a larger number is found. Finally, it prints the final max value after processing all lines.

Observations:

Again, Gemini messed up the markdown code block a bit.
The generated code works as expected. It handles both positive and negative numbers.
Good explanation.

Gemini 1.0 Ultra

Absolutely! Here’s how to find the maximum number using awk:

Example

Input (numbers.txt)

Command

awk 'NR==1 {max=$1} $1>max {max=$1} END {print max}' numbers.txt

Output

Explanation

Initialization: NR==1 {max=$1} When processing the first line (NR==1), the first field ($1) is assigned as the initial max.
Comparison and Update: $1>max {max=$1} For each subsequent line, if the first field ($1) is greater than the current max, update max.
Final Output: END {print max} After processing all lines, print the final value of max.

Key Points:

This assumes numbers are in the first field of each line.
If you have multiple numbers per line, you could iterate through the fields in a loop.

Observations:

Similar to the GPT-4 generated code, but the first two if condition could have been merged like in the GPT-4 code.
Same good explanation.

stddev Link to heading

Prompt:

Use awk to calculate the sample standard deviation of the input lines. Include an example input and output too.

GPT-3.5

Calculating the sample standard deviation using awk involves a multistep process. You first need to compute the mean, then compute the sum of squared differences from the mean, divide it by the count of elements minus one, and finally take the square root of the result. Here’s how you can achieve this:

Let’s say you have an example file named data.txt with the following contents:

data.txt:

You can use the following awk command to calculate the sample standard deviation:

awk '
{
    sum += $1
    sumSq += ($1)^2
    count++
}
END {
    mean = sum / count
    variance = (sumSq - sum^2 / count) / (count - 1)
    stddev = sqrt(variance)
    print stddev
}
' data.txt

Output:

4.4833

Explanation:

sum += $1: This statement accumulates the sum of all numbers.
sumSq += ($1)^2: This statement accumulates the sum of squares of all numbers.
count++: This statement increments the count of numbers.
END { ... }: This block executes after processing all lines in the input file.
- mean = sum / count: This calculates the mean of the numbers.
- variance = (sumSq - sum^2 / count) / (count - 1): This calculates the variance using the formula ((sum of squares) - (sum)^2 / N) / (N - 1).
- stddev = sqrt(variance): This calculates the standard deviation by taking the square root of the variance.
- print stddev: This prints the standard deviation.

This command will output the sample standard deviation (4.4833) of the input lines in the file data.txt.

Observations:

The code works and if we run it then it returns the expected value (4.27785).
Wrong generated output: 4.4833 instead of 4.27785. I have no idea where it got that number from.
Solid explanation.

GPT-4

To calculate the sample standard deviation of a set of numbers using awk, you need to perform a two-pass analysis on the data. The first pass calculates the mean of the data, and the second pass uses the mean to calculate the variance. The standard deviation is then the square root of the variance.

Here’s how you can do it with awk:

awk Script:

awk '{
    sum += $1; 
    count++
} 
END {
    mean = sum / count;
    for (i in arr) {
        sumsq += ((arr[i] - mean) ^ 2);
    }
    variance = sumsq / (count - 1);
    stddev = sqrt(variance);
    print stddev;
}' file.txt

This script does not work as-is because standard awk does not support storing the entire dataset for a second pass within the END block without explicitly doing so. We need to adjust our approach to store values during the first pass:

awk Command:

awk '{arr[NR] = $1; sum += $1} END {mean = sum / NR; for (i=1; i<=NR; i++) {sumsq += (arr[i] - mean)^2;} variance = sumsq / (NR-1); print sqrt(variance)}' file.txt

Example:

Input (file.txt):

Command:

awk '{arr[NR] = $1; sum += $1} END {mean = sum / NR; for (i=1; i<=NR; i++) {sumsq += (arr[i] - mean)^2;} variance = sumsq / (NR-1); print sqrt(variance)}' file.txt

Output:

5.1235

Explanation:

In the first pass (within the main block {}), this script accumulates the sum of all numbers and stores each number in the array arr.
In the END block, it calculates the mean of the numbers, then iterates over arr to calculate the sum of squares (sumsq) of the differences between each number and the mean.
The variance is calculated by dividing sumsq by (NR-1), where NR is the number of records (subtracting 1 makes it a sample variance calculation).
Finally, it prints the square root of the variance, which is the sample standard deviation.

This approach allows for the calculation of the sample standard deviation of a list of numbers contained in a file using awk.

Observations:

The code uses a slightly different implementation, but it also returns the expected value (5.47723).
Wrong generated output: 5.1235 instead of 5.47723. No idea where ChatGPT got that number from.
Not sure why it included the same code block twice.
Good explanation.

Gemini 1.0 Pro

Here’s the awk command to calculate the sample standard deviation of the input lines:

BEGIN { 
  count = 0 
  sum = 0 
  sumsq = 0 
} 
{ 
  count++ 
  sum += $1 
  sumsq += $1 * $1 
} 
END { 
  if (count == 0) { 
    print "Error: No data provided" 
  } else { 
    mean = sum / count 
    variance = (sumsq - count * mean * mean) / (count - 1) 
    stddev = sqrt(variance) 
    printf("%.2f\n", stddev) 
  } 
}

Explanation:

Initialization:
- BEGIN { ... }: This block runs before processing any lines.
- count = 0: Initializes a variable count to keep track of the number of lines.
- sum = 0: Initializes a variable sum to accumulate the sum of all values.
- sumsq = 0: Initializes a variable sumsq to accumulate the sum of squares of all values.
Processing each line:
- { ... }: This block gets executed for each line in the input.
- count++: Increments the count variable for each line.
- sum += $1: Adds the current line’s value ($1) to the sum.
- sumsq += $1 * $1: Squares the current value and adds it to sumsq.
Calculation and output:
- END { ... }: This block runs after processing all lines.
- if (count == 0) { ... }: Checks if there were any lines (no data).
  - print "Error: No data provided": Prints an error message if no data was provided.
- else { ... }: If there was data:
  - mean = sum / count: Calculates the sample mean.
  - variance = (sumsq - count * mean * mean) / (count - 1): Calculates the sample variance.
  - stddev = sqrt(variance): Calculates the sample standard deviation.
  - printf("%.2f\n", stddev): Prints the standard deviation with two decimal places.

Example Input:

Output:

5.29

This awk script calculates the sample standard deviation considering it’s an estimate of the population standard deviation and uses count - 1 in the variance calculation. It also includes error handling for cases with no data input.

Observations:

Again, the previously mentioned markdown code block issue.
The generated code works and returns the expected output. It also does some rounding. It’s not necessary to initialize the variables with 0 in awk, and could have used the power operator ^2 instead of multiplying the variables with themselves.
The generated output 5.29 is wrong, but the code returns the correct output 6.24.
Detailed explanation.

Gemini 1.0 Ultra

Sure, here’s how to calculate the sample standard deviation using awk:

Example

Input (numbers.txt)

Command

awk '
{
    sum += $1
    sq_sum += $1 ^ 2
    n++
}
END {
    mean = sum / n
    var = (sq_sum - n * mean ^ 2) / (n - 1)
    sd = sqrt(var)
    printf "Sample standard deviation: %.2f\n", sd
}
' numbers.txt

Output

Sample standard deviation: 3.92

Explanation:

Accumulator variables: The awk script initializes three variables:
- sum: Stores the sum of all numbers.
- sq_sum: Stores the sum of squared deviations from the mean.
- n: Counts the number of lines (number of elements).
Processing lines: The awk script iterates over each line in the input file:
- For each line, it adds the current number ($1) to both sum and sq_sum.
- It increments the counter n.
Calculate mean and variance: After processing all lines, the script calculates:
- mean: The average of all numbers (sum / n).
- var: The sample variance ((sq_sum - n * mean ^ 2) / (n - 1)).
Calculate standard deviation: The script calculates the sample standard deviation (sd) as the square root of the variance (sqrt(var)).
Print output: Finally, the script prints the sample standard deviation formatted to two decimal places.

Key Points:

This awk script assumes the numbers are in the first field of each line.
The sample standard deviation is calculated using the formula for a finite population (n - 1 in the denominator).

Observations:

The generated code and output are correct. Gemini Ultra didn’t make the same (small) mistakes as Gemini Pro.
Adequate explanation.