ptx
Create permuted indexes of text files
TLDR
Generate a permuted index where the first field of each line is an index reference
Generate a permuted index with automatically generated index references
Generate a permuted index with a fixed width
Generate a permuted index with a list of filtered words
Generate a permuted index with SYSV-style behaviors
SYNOPSIS
ptx [OPTION]... [FILE]...
PARAMETERS
-b, --break-at-blanks
Break lines at blanks, not just non-alphanumeric characters. By default, ptx treats sequences of non-alphanumeric characters as word separators; this option adds blanks to that set.
-f, --ignore-case
Fold lower case to upper case for sorting. This ensures that words like 'Apple' and 'apple' are treated as the same for indexing and sorting purposes.
-g, --only-file-name
Show only the file name, not a line number, in the output reference field. This is useful when line numbers are not relevant or too verbose.
-i, --ignore-regexp=REGEXP
Ignore words matching the specified basic regular expression REGEXP. Words matching this pattern will not be included in the index.
-o, --output=FILE
Send output to the specified FILE instead of standard output.
-r, --reference
Output the `(reference)` field. This is often a line number or filename, indicating the source of the indexed line.
-s, --sentence-regexp=REGEXP
Use REGEXP to find sentence boundaries. This helps ptx determine the extent of context around a word, preventing context from crossing sentence boundaries.
-t, --typeset-mode
Generate output for nroff/troff. This option formats the output with appropriate characters and spacing for processing by a typesetting system.
-w, --width=NUMBER
Set the output width to NUMBER columns. The default width is 72 columns.
-A, --auto-reference
Output an automatic `(reference)` field for each line. This can be the filename and line number, similar to the `grep` command's output.
-G, --traditional
Use traditional style for the permuted index. This might affect how context is defined or how output fields are delimited.
-K, --context-regexp=REGEXP
Use REGEXP to find the context. This regular expression defines what constitutes the surrounding text for a keyword.
-O, --format=roff
Generate output for roff (same as -t, --typeset-mode).
-R, --right-context-regexp=REGEXP
Use REGEXP specifically for the right context. This allows finer control over the context extraction.
-S, --suffix-regexp=REGEXP
Use REGEXP for suffix matching. This helps define what parts of a word should be considered as a suffix for indexing.
-W, --word-regexp=REGEXP
Use REGEXP to define what constitutes a word. By default, words are sequences of alphanumeric characters. This allows custom word definitions.
--help
Display a help message and exit.
--version
Output version information and exit.
DESCRIPTION
The ptx command generates a permuted index, also known as a Keyword-in-Context (KWIC) index, of the words found in a given file or standard input. It processes each line, identifies words, and then reorders them so that each word appears in a designated central position, flanked by its surrounding context (left and right).
This utility is particularly useful for creating indexes for documents, where you can quickly find all occurrences of a particular term along with a snippet of the text where it appears. By default, ptx produces three fields for each entry: the left context, the keyword itself, and the right context, followed by an optional reference (e.g., line number or filename). Various options allow customization of the output width, context parsing, word definition, and case sensitivity.
CAVEATS
The default behavior of ptx regarding word definition and context can be complex. Users should carefully review the `--word-regexp`, `--context-regexp`, and `--sentence-regexp` options if the default output does not meet their requirements. Performance on very large files might be a consideration, though for typical indexing tasks, it's generally efficient.
INPUT HANDLING
When no FILE arguments are provided, ptx reads from standard input. This allows piping content from other commands directly into ptx for indexing, such as `cat mydocument.txt | ptx`.
OUTPUT STRUCTURE
By default, ptx output consists of three fields separated by two spaces: the left context, the keyword (the indexed word), and the right context. An optional fourth field contains the reference (e.g., line number or filename). The specific formatting can be adjusted with options like --width and --typeset-mode.
HISTORY
ptx is part of the GNU Core Utilities, a collection of essential command-line tools for Unix-like operating systems. While the concept of permuted indexing dates back decades (e.g., in bibliometrics and information retrieval), ptx provides a modern, robust implementation for general text processing on Linux. Its development is integrated into the larger GNU project, ensuring consistency and adherence to GNU standards.