LinuxCommandLibrary

csplit

Split a file into sections

TLDR

Split a file at lines 5 and 23

$ csplit [path/to/file] 5 23
copy

Split a file every 5 lines (this will fail if the total number of lines is not divisible by 5)
$ csplit [path/to/file] 5 {*}
copy

Split a file every 5 lines, ignoring exact-division error
$ csplit [[-k|--keep-files]] [path/to/file] 5 {*}
copy

Split a file at line 5 and use a custom prefix for the output files
$ csplit [path/to/file] 5 [[-f|--prefix]] [prefix]
copy

Split a file at a line matching a regex
$ csplit [path/to/file] /[regex]/
copy

SYNOPSIS

csplit [OPTION]... FILE PATTERN|LINE...

PARAMETERS

-s, --silent, --quiet
    Suppress the output of file sizes and counts.

-k, --keep-files
    Do not remove output files even if an error occurs.

-f PREFIX, --prefix=PREFIX
    Use PREFIX instead of 'xx' for output files (e.g., PREFIX00).

-b FORMAT, --suffix-format=FORMAT
    Use strftime FORMAT for suffix, e.g. %Y%m%d%H%M%S. Allows timestamp-based naming.

-n DIGITS, --digits=DIGITS
    Use DIGITS instead of 2 for suffix length (e.g., 000 with -n 3).

-z, --elide-empty-files
    Do not create empty output files. Useful when patterns are close together.

-t, --suppress-match
    Suppress the line that matches the pattern from the output. It will not be written to any output file.

{*}
    Repeat the previous pattern/line specification as many times as possible.

{NUMBER}
    Repeat the previous pattern/line specification NUMBER times.

--help
    Display a help message and exit.

--version
    Output version information and exit.

FILE
    The input file to be split. Use '-' for standard input.

PATTERN
    A basic regular expression to define splitting points. A new file starts immediately after the line matching this pattern.

LINE
    An absolute line number to define splitting points. A new file starts at this line number.

DESCRIPTION

The csplit command, part of GNU Coreutils, is a powerful utility used to split a file into multiple smaller pieces based on content or specific line numbers. Unlike the simpler split command which divides files by fixed byte counts or line numbers, csplit offers more granular control by allowing users to define splitting points using regular expressions (patterns) or absolute line numbers.

When a pattern is matched or a line number is specified, csplit starts a new output file segment. By default, output files are named xx00, xx01, and so on, where 'xx' is the default prefix and the numbers are padded zeros. These prefixes and the number of digits can be customized. csplit is particularly useful for processing structured log files, breaking large configuration files into manageable sections, or extracting specific data blocks from text documents.

CAVEATS

By default, the line matching a PATTERN is included as the last line of the previous output file segment. Use --suppress-match to exclude it entirely. When splitting by LINE number, the specified line becomes the first line of the new segment. Be mindful of empty files, especially with frequent patterns; --elide-empty-files can help. The pattern matching uses basic regular expressions; for extended regex, consider piping through grep -E or similar.

SPLITTING LOGIC

Understanding how csplit defines new files is crucial:

1. When using a LINE number: A new file is created before the specified line number, making that line the first line of the new file.
2. When using a PATTERN: A new file is created after the line matching the pattern. By default, the matching line is included as the last line of the previous file segment. If --suppress-match is used, the matching line is discarded.

This distinction is key for precise file segmentation.

OUTPUT FILE NAMING

Output files are generated sequentially, starting from 00 (or 000 etc., depending on -n). The default prefix is 'xx', resulting in names like xx00, xx01, xx02, etc. You can change the prefix with -f (e.g., -f my_chunk_ would yield my_chunk_00). The -b option allows for more complex suffix formats using strftime, enabling timestamped splits for instance.

HISTORY

csplit is a standard utility found in GNU Coreutils, which is a fundamental package of free software for Unix-like operating systems. Its development has focused on providing a context-aware file splitting mechanism, complementing the more basic functionalities of commands like split. It has been a part of typical Linux distributions for a long time, evolving as a robust tool for text processing.

SEE ALSO

split(1), grep(1), sed(1), awk(1), cut(1)

Copied to clipboard