WordPress Infinite Web Crawler (LWP::Simple)

This is a bot I made using the Perl module LWP::Simple. It endlessly browses a WordPress website via the command line 😀 Why? Why not!

Also, because it’s a one-liner I tried to make the code as terse as possible. No scoping variables or proper use of constructs, nor sub-routines.

Special thanks goes out to PRBRENAN for helping me push this to the limit by teaching me new ways to format Perl expressions.

NOTE: I am calling it a WordPress bot because that is the only platform I tested it on, there are subtle differences in how other CMS’s & websites handle URIs 😉

VERSION 2

UPDATE (2021-05-28): Here I am at 1 AM refactoring away. I came up with this 178 character version (around 22 characters less then version 1). I have gone pure Arcanum mode! I LOVE PERL <3

V2: ONE-LINER

Perhaps I will do a breakdown later but for now I need to slumber… 😴

yes | perl -M'LWP::Simple' -plse '$u=$_=@{[keys %{{${\get($u?$u:$U)}=~m`(?<=")$U/[^"]+`g}}]}[0];($u=$U) && redo if !("@{[head $u]}"=~m~t/html~)' -- -U='https://geekalicious.blog'

VERSION 1

Total character count is 220 characters.

V1: ONE-LINER

yes | perl -M'LWP::Simple' -plse '${\get ($_ = $#u > 1 ? splice @u, sleep rand $#u/4, 1 : $U)} =~ s`"($U.*?)"`$m = $1; "@{[head $&]}" =~ m~t/html~ && push @u, $m`ge; @a = () if $#a > 15' -- -U='https://geekalicious.blog'

NOTE: You can replace the URL parameter with any WordPress site you like! (–url=”<WEBSITE>”).

V1: PARAMETER BREAKDOWN

yes | perl -M’LWP::Simple’ -plse ‘<CODE>’ -U=”https://geekalicious.blog”

# ######################
# PARAMETER BREAKDOWN
# ######################

# -p = implicit loop with print
# -s = switch mode on 
# 	   (this program looking for -U="<URL>")
# -l = chomps $/ (the input record separator)
# 	   and auto adds a new line on the print

# loads the LWP::Simple package
# install on debian with: 
# sudo cpan -i LWP::Simple

# This program needs infinite STDIN to make
# the loop go, achieved with: 'yes |'

V1: CODE BREAKDOWN

# #################
# CODE BREAKDOWN
# #################

# gets results gathered from randomly spliced array '@u'
# The divided by 4 is to prevent the 'long sleep'
${\get 
	
	($_ = $#u > 1 ? splice @u, sleep rand $#u/4, 1 : $U)

# Spit out html output to regex find & replace. We will be misusing this construct to get a terse loop and also a bonus filter on our URLS:
} =~ s`"($U.*?)"`$m = $1;

		# head function to verify text/html then push to array
		"@{[head $&]}" =~ m~t/html~ && push @u, $m
`ge; 

# prevent array from getting too big aka avoid the 'long sleep'
@a = () if $#a > 15;

# note it's slow due to large sleep but it works!

V1: EXAMPLE OUTPUT

Leave a Reply

Your email address will not be published. Required fields are marked *