ptx

Create permuted indexes of text files

TLDR

Generate a permuted index where the first field of each line is an index reference

$ ptx [[-r|--references]] [path/to/file]

Generate a permuted index with automatically generated index references

$ ptx [[-A|--auto-reference]] [path/to/file]

Generate a permuted index with a fixed width

$ ptx [[-w|--width]] [width_in_columns] [path/to/file]

Generate a permuted index with a list of filtered words

$ ptx [[-o|--only-file]] [path/to/filter] [path/to/file]

Generate a permuted index with SYSV-style behaviors

$ ptx [[-G|--traditional]] [path/to/file]

-b, --break-at-blanks
    Break lines at blanks, not just non-alphanumeric characters. By default, ptx treats sequences of non-alphanumeric characters as word separators; this option adds blanks to that set.

-f, --ignore-case
    Fold lower case to upper case for sorting. This ensures that words like 'Apple' and 'apple' are treated as the same for indexing and sorting purposes.

-g, --only-file-name
    Show only the file name, not a line number, in the output reference field. This is useful when line numbers are not relevant or too verbose.

-i, --ignore-regexp=REGEXP
    Ignore words matching the specified basic regular expression REGEXP. Words matching this pattern will not be included in the index.

-o, --output=FILE
    Send output to the specified FILE instead of standard output.

-r, --reference
    Output the `(reference)` field. This is often a line number or filename, indicating the source of the indexed line.

-s, --sentence-regexp=REGEXP
    Use REGEXP to find sentence boundaries. This helps ptx determine the extent of context around a word, preventing context from crossing sentence boundaries.

-t, --typeset-mode
    Generate output for nroff/troff. This option formats the output with appropriate characters and spacing for processing by a typesetting system.

-w, --width=NUMBER
    Set the output width to NUMBER columns. The default width is 72 columns.

-A, --auto-reference
    Output an automatic `(reference)` field for each line. This can be the filename and line number, similar to the `grep` command's output.

-G, --traditional
    Use traditional style for the permuted index. This might affect how context is defined or how output fields are delimited.

-K, --context-regexp=REGEXP
    Use REGEXP to find the context. This regular expression defines what constitutes the surrounding text for a keyword.

-O, --format=roff
    Generate output for roff (same as -t, --typeset-mode).

-R, --right-context-regexp=REGEXP
    Use REGEXP specifically for the right context. This allows finer control over the context extraction.

-S, --suffix-regexp=REGEXP
    Use REGEXP for suffix matching. This helps define what parts of a word should be considered as a suffix for indexing.

-W, --word-regexp=REGEXP
    Use REGEXP to define what constitutes a word. By default, words are sequences of alphanumeric characters. This allows custom word definitions.

--help
    Display a help message and exit.

--version
    Output version information and exit.

DESCRIPTION

The ptx command generates a permuted index, also known as a Keyword-in-Context (KWIC) index, of the words found in a given file or standard input. It processes each line, identifies words, and then reorders them so that each word appears in a designated central position, flanked by its surrounding context (left and right).

This utility is particularly useful for creating indexes for documents, where you can quickly find all occurrences of a particular term along with a snippet of the text where it appears. By default, ptx produces three fields for each entry: the left context, the keyword itself, and the right context, followed by an optional reference (e.g., line number or filename). Various options allow customization of the output width, context parsing, word definition, and case sensitivity.

CAVEATS

The default behavior of ptx regarding word definition and context can be complex. Users should carefully review the `--word-regexp`, `--context-regexp`, and `--sentence-regexp` options if the default output does not meet their requirements. Performance on very large files might be a consideration, though for typical indexing tasks, it's generally efficient.

INPUT HANDLING

When no FILE arguments are provided, ptx reads from standard input. This allows piping content from other commands directly into ptx for indexing, such as `cat mydocument.txt | ptx`.

OUTPUT STRUCTURE

By default, ptx output consists of three fields separated by two spaces: the left context, the keyword (the indexed word), and the right context. An optional fourth field contains the reference (e.g., line number or filename). The specific formatting can be adjusted with options like --width and --typeset-mode.

HISTORY

ptx is part of the GNU Core Utilities, a collection of essential command-line tools for Unix-like operating systems. While the concept of permuted indexing dates back decades (e.g., in bibliometrics and information retrieval), ptx provides a modern, robust implementation for general text processing on Linux. Its development is integrated into the larger GNU project, ensuring consistency and adherence to GNU standards.

ptx