Scraping Search Engine Result Pages with Laravel Scavenger 3.x
Posted Feb. 8, 2020 by reliq
Web scraping is nothing new and often times we find ourselves with the need to scrape a few pages. Enter Laravel Scavenger, the package allows you to scrape and save Search Engine Result Pages (SERP) for analysis and processing later. Let's look at how we can get up and running with this. For this example we will scrape SERP from Bing.
Prerequisites
- Working Laravel 6+ Application
Step 1 - Install Scavenger
Scavenger can be installed via composer as follows:
composer require reliqarts/laravel-scavenger
After which you must publish the configuration file with the following command:
php artisan vendor:publish --provider="ReliqArts\Scavenger\ServiceProvider" --tag="scavenger-config"
Step 2 - Create Target Model
We must create an eloquent model to serve as your scraped entity. We will create a BingResult model with the following migration and class:
/database/migrations/2017_01_01_000000_create_bing_results_table.php
<?php
use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;
class CreateBingResultsTable extends Migration
{
/**
* Run the migrations.
*
* @return void
*/public function up()
{
Schema::create('bing_results', function (Blueprint $table) {
$table->increments('id');
$table->text('link');
$table->text('description');
$table->integer('position')->nullable();
$table->timestamps();
});
}
/**
* Reverse the migrations.
*
* @return void
*/public function down()
{
Schema::dropIfExists('bing_results');
}
}
/app/BingResult.php
<?php
namespace App;
use Illuminate\Database\Eloquent\Model;
class BingResult extends Model
{
}
Step 3 - Configure Scavenger
In your /config/scavenger.php file we need to create a target. We will use this the following for our setup. See: Target Breakdown for a full list of available options.
<?php
$bing = [
'example' => false,
'serp' => true,
'model' => BingResult::class,
'source' => 'https://bing.com',
'search' => [
'keywords' => ['dog'],
'form' => [
'selector' => 'form#sb_form',
'keyword_input_name' => 'q',
],
],
'pager' => [
'selector' => '.sb_pagN',
],
'pages' => 5,
'markup' => [
'__result' => '.b_algo',
'title' => 'h2 a',
'description' => '.b_caption p',
'link' => '__link',
'position' => '__position',
],
];
This configuration tells scavenger to go to bing.com, enter "dog" as a search term and scrape the results from the first 5 pages. In the markup section of the config we explain to Scavenger how each item should be transformed into the different attributes of our BingResult class. Again, these config keys are explained in detail here.
Note: __result, __link and __position are special markup keys which literally refer to the result item, link, and position in the result page list respectively. Meaning the item with __position = 5 appeared 5th in the bing results list.
With this target in place our complete configuration file looks something like this:
/config/scavenger.php
<?php
$bing = [
'example' => false,
'serp' => true,
'model' => \App\BingResult::class,
'source' => 'https://bing.com',
'search' => [
'keywords' => ['dog'],
'form' => [
'selector' => 'form#sb_form',
'keyword_input_name' => 'q',
],
],
'pager' => [
'selector' => '.sb_pagN',
],
'pages' => 3,
'markup' => [
'__result' => '.b_algo',
'title' => 'h2 a',
'description' => '.b_caption p',
'link' => '__link',
'position' => '__position',
],
];
return [
// debug mode?'debug' => false,
// whether log file should be written'log' => true,
// How much detail is expected in output, 1 being the lowest, 3 being highest.'verbosity' => 1,
// Set the database config'database' => [// Scraps table'scraps_table' => env('SCAVENGER_SCRAPS_TABLE', 'scavenger_scraps'),
],
// Daemon config - used to build daemon user'daemon' => [// Model to use for Daemon identification and login'model' => \App\User::class,
// Model property to check for daemon ID'id_prop' => 'email',
// Daemon ID'id' => '[email protected]',
// Any additional information required to create a user:// NB. this is only used when creating a daemon user, there is no "safe" way// to change the daemon's password once he has been created.'info' => ['name' => 'Scavenger Daemon',
'password' => 'pass',
],
],
// guzzle settings'guzzle_settings' => ['timeout' => 60,
],
// hashing algorithm to use'hash_algorithm' => 'sha512',
// storage'storage' => ['log_dir' => env('SCAVENGER_LOG_DIR', 'scavenger'),
],
// different model entities and mapping information'targets' => ['bing' => $bing,
],
];
Step 4 - Execution
With this in place it is now time to hop over to artisan and begin scraping our SERP.
Step 5 - Results
In the end Scavenger gives a comprehensive summary:
Our SERP have been successfully inserted into the database:
And, that's it!
We have 25 dog-based links scraped from bing on which we may perform any analysis/actions we desire.