csplit
Split a file into sections
TLDR
Split a file at lines 5 and 23
Split a file every 5 lines (this will fail if the total number of lines is not divisible by 5)
Split a file every 5 lines, ignoring exact-division error
Split a file at line 5 and use a custom prefix for the output files
Split a file at a line matching a regex
SYNOPSIS
csplit [OPTION]... FILE PATTERN|LINE...
PARAMETERS
-s, --silent, --quiet
Suppress the output of file sizes and counts.
-k, --keep-files
Do not remove output files even if an error occurs.
-f PREFIX, --prefix=PREFIX
Use PREFIX instead of 'xx' for output files (e.g., PREFIX00).
-b FORMAT, --suffix-format=FORMAT
Use strftime FORMAT for suffix, e.g. %Y%m%d%H%M%S. Allows timestamp-based naming.
-n DIGITS, --digits=DIGITS
Use DIGITS instead of 2 for suffix length (e.g., 000 with -n 3).
-z, --elide-empty-files
Do not create empty output files. Useful when patterns are close together.
-t, --suppress-match
Suppress the line that matches the pattern from the output. It will not be written to any output file.
{*}
Repeat the previous pattern/line specification as many times as possible.
{NUMBER}
Repeat the previous pattern/line specification NUMBER times.
--help
Display a help message and exit.
--version
Output version information and exit.
FILE
The input file to be split. Use '-' for standard input.
PATTERN
A basic regular expression to define splitting points. A new file starts immediately after the line matching this pattern.
LINE
An absolute line number to define splitting points. A new file starts at this line number.
DESCRIPTION
The csplit command, part of GNU Coreutils, is a powerful utility used to split a file into multiple smaller pieces based on content or specific line numbers. Unlike the simpler split command which divides files by fixed byte counts or line numbers, csplit offers more granular control by allowing users to define splitting points using regular expressions (patterns) or absolute line numbers.
When a pattern is matched or a line number is specified, csplit starts a new output file segment. By default, output files are named xx00, xx01, and so on, where 'xx' is the default prefix and the numbers are padded zeros. These prefixes and the number of digits can be customized. csplit is particularly useful for processing structured log files, breaking large configuration files into manageable sections, or extracting specific data blocks from text documents.
CAVEATS
By default, the line matching a PATTERN is included as the last line of the previous output file segment. Use --suppress-match to exclude it entirely. When splitting by LINE number, the specified line becomes the first line of the new segment. Be mindful of empty files, especially with frequent patterns; --elide-empty-files can help. The pattern matching uses basic regular expressions; for extended regex, consider piping through grep -E or similar.
SPLITTING LOGIC
Understanding how csplit defines new files is crucial:
1. When using a LINE number: A new file is created before the specified line number, making that line the first line of the new file.
2. When using a PATTERN: A new file is created after the line matching the pattern. By default, the matching line is included as the last line of the previous file segment. If --suppress-match is used, the matching line is discarded.
This distinction is key for precise file segmentation.
OUTPUT FILE NAMING
Output files are generated sequentially, starting from 00 (or 000 etc., depending on -n). The default prefix is 'xx', resulting in names like xx00, xx01, xx02, etc. You can change the prefix with -f (e.g., -f my_chunk_ would yield my_chunk_00). The -b option allows for more complex suffix formats using strftime, enabling timestamped splits for instance.
HISTORY
csplit is a standard utility found in GNU Coreutils, which is a fundamental package of free software for Unix-like operating systems. Its development has focused on providing a context-aware file splitting mechanism, complementing the more basic functionalities of commands like split. It has been a part of typical Linux distributions for a long time, evolving as a robust tool for text processing.