pylcg

- Unnamed repository; edit this file 'description' to name the repository.
git clone git://git.acid.vegas/-c.git
Log | Files | Refs | Archive | README | LICENSE

README.md (7528B)

      1 # PyLCG
      2 > Ultra-fast Linear Congruential Generator for IP Sharding
      3 
      4 PyLCG is a high-performance Python implementation of a memory-efficient IP address sharding system using Linear Congruential Generators (LCG) for deterministic random number generation. This tool enables distributed scanning & network reconnaissance by efficiently dividing IP ranges across multiple machines while maintaining pseudo-random ordering.
      5 
      6 ###### A GoLang version of this library is also available [here](https://github.com/acidvegas/golcg)
      7 
      8 ## Features
      9 
     10 - Memory-efficient IP range processing
     11 - Deterministic pseudo-random IP generation
     12 - High-performance LCG implementation
     13 - Support for sharding across multiple machines
     14 - Zero dependencies beyond Python standard library
     15 - Simple command-line interface and library usage
     16 
     17 ## Installation
     18 
     19 ```bash
     20 pip install pylcg
     21 ```
     22 
     23 ## Usage
     24 
     25 ### Command Line
     26 
     27 ```bash
     28 pylcg 192.168.0.0/16 --shard-num 1 --total-shards 4 --seed 12345
     29 
     30 # Resume from previous state
     31 pylcg 192.168.0.0/16 --shard-num 1 --total-shards 4 --seed 12345 --state 987654321
     32 
     33 # Pipe to dig for PTR record lookups
     34 pylcg 192.168.0.0/16 --seed 12345 | while read ip; do
     35     echo -n "$ip -> "
     36     dig +short -x $ip
     37 done
     38 
     39 # One-liner for PTR lookups
     40 pylcg 198.150.0.0/16 | xargs -I {} dig +short -x {}
     41 
     42 # Parallel PTR lookups
     43 pylcg 198.150.0.0/16 | parallel "dig +short -x {} | sed 's/^/{} -> /'"
     44 ```
     45 
     46 ### As a Library
     47 
     48 ```python
     49 from pylcg import ip_stream
     50 
     51 # Basic usage
     52 for ip in ip_stream('192.168.0.0/16', shard_num=1, total_shards=4, seed=12345):
     53     print(ip)
     54 
     55 # Resume from previous state
     56 for ip in ip_stream('192.168.0.0/16', shard_num=1, total_shards=4, seed=12345, state=987654321):
     57     print(ip)
     58 ```
     59 
     60 ## State Management & Resume Capability
     61 
     62 PyLCG automatically saves its state every 1000 IPs processed to enable resume functionality in case of interruption. The state is saved to a temporary file in your system's temp directory (usually `/tmp` on Unix systems or `%TEMP%` on Windows).
     63 
     64 The state file follows the naming pattern:
     65 ```
     66 pylcg_[seed]_[cidr]_[shard]_[total].state
     67 ```
     68 
     69 For example:
     70 ```
     71 pylcg_12345_192.168.0.0_16_1_4.state
     72 ```
     73 
     74 The state is saved in memory-mapped temporary storage to minimize disk I/O and improve performance. To resume from a previous state:
     75 
     76 1. Locate your state file in the temp directory
     77 2. Read the state value from the file
     78 3. Use the same parameters (CIDR, seed, shard settings) with the `--state` parameter
     79 
     80 Example of resuming:
     81 ```bash
     82 # Read the last state
     83 state=$(cat /tmp/pylcg_12345_192.168.0.0_16_1_4.state)
     84 
     85 # Resume processing
     86 pylcg 192.168.0.0/16 --shard-num 1 --total-shards 4 --seed 12345 --state $state
     87 ```
     88 
     89 Note: When using the `--state` parameter, you must provide the same `--seed` that was used in the original run.
     90 
     91 ## How It Works
     92 
     93 ### IP Address Integer Representation
     94 
     95 Every IPv4 address is fundamentally a 32-bit number. For example, the IP address "192.168.1.1" can be broken down into its octets (192, 168, 1, 1) and converted to a single integer:
     96 ```
     97 192.168.1.1 = (192 × 256³) + (168 × 256²) + (1 × 256¹) + (1 × 256⁰)
     98              = 3232235777
     99 ```
    100 
    101 This integer representation allows us to treat IP ranges as simple number sequences. A CIDR block like "192.168.0.0/16" becomes a continuous range of integers:
    102 - Start: 192.168.0.0   → 3232235520
    103 - End:   192.168.255.255 → 3232301055
    104 
    105 By working with these integer representations, we can perform efficient mathematical operations on IP addresses without the overhead of string manipulation or complex data structures. This is where the Linear Congruential Generator comes into play.
    106 
    107 ### Linear Congruential Generator
    108 
    109 PyLCG uses an optimized LCG implementation with three carefully chosen parameters that work together to generate high-quality pseudo-random sequences:
    110 
    111 | Name       | Variable | Value        |
    112 |------------|----------|--------------|
    113 | Multiplier | `a`      | `1664525`    |
    114 | Increment  | `c`      | `1013904223` |
    115 | Modulus    | `m`      | `2^32`       |
    116 
    117 ###### Modulus
    118 The modulus value of `2^32` serves as both a mathematical and performance optimization choice. It perfectly matches the CPU's word size, allowing for extremely efficient modulo operations through simple bitwise AND operations. This choice means that all calculations stay within the natural bounds of CPU arithmetic while still providing a large enough period for even the biggest IP ranges we might encounter.
    119 
    120 ###### Multiplier
    121 The multiplier value of `1664525` was originally discovered through extensive mathematical analysis for the Numerical Recipes library. It satisfies the Hull-Dobell theorem's strict requirements for maximum period length in power-of-2 modulus LCGs, being both relatively prime to the modulus and one more than a multiple of 4. This specific value also performs exceptionally well in spectral tests, ensuring good distribution properties across the entire range while being small enough to avoid intermediate overflow in 32-bit arithmetic.
    122 
    123 ###### Increment
    124 The increment value of `1013904223` is a carefully selected prime number that completes our parameter trio. When combined with our chosen multiplier and modulus, it ensures optimal bit mixing throughout the sequence and helps eliminate common LCG issues like short cycles or poor distribution. This specific value was selected after extensive testing showed it produced excellent statistical properties and passed rigorous spectral tests for dimensional distribution.
    125 
    126 ### Applying LCG to IP Addresses
    127 
    128 Once we have our IP addresses as integers, the LCG is used to generate a pseudo-random sequence that permutes through all possible values in our IP range:
    129 
    130 1. For a given IP range *(start_ip, end_ip)*, we calculate the range size: `range_size = end_ip - start_ip + 1`
    131 
    132 2. The LCG generates a sequence using the formula: `X_{n+1} = (a * X_n + c) mod m`
    133 
    134 3. To map this sequence back to valid IPs in our range:
    135    - Generate the next LCG value
    136    - Take modulo of the value with range_size to get an offset: `offset = lcg_value % range_size`
    137    - Add this offset to start_ip: `ip = start_ip + offset`
    138 
    139 This process ensures that:
    140 - Every IP in the range is visited exactly once
    141 - The sequence appears random but is deterministic
    142 - We maintain constant memory usage regardless of range size
    143 - The same seed always produces the same sequence
    144 
    145 ### Sharding Algorithm
    146 
    147 The sharding system employs an interleaved approach that ensures even distribution of work across multiple machines while maintaining randomness. Each shard operates independently using a deterministic sequence derived from the base seed plus the shard index. The system distributes IPs across shards using modulo arithmetic, ensuring that each IP is assigned to exactly one shard. This approach prevents sequential scanning patterns while guaranteeing complete coverage of the IP range. The result is a system that can efficiently parallelize work across any number of machines while maintaining the pseudo-random ordering that's crucial for network scanning applications.
    148 
    149 ## Contributing
    150 
    151 ### Performance Optimization
    152 
    153 We welcome contributions that improve PyLCG's performance. When submitting optimizations:
    154 
    155 1. Run the included benchmark suite:
    156 ```bash
    157 python3 unit_test.py
    158 ```
    159 
    160 ---
    161 
    162 ###### Mirrors: [acid.vegas](https://git.acid.vegas/pylcg) • [SuperNETs](https://git.supernets.org/acidvegas/pylcg) • [GitHub](https://github.com/acidvegas/pylcg) • [GitLab](https://gitlab.com/acidvegas/pylcg) • [Codeberg](https://codeberg.org/acidvegas/pylcg)