file_get_contents returns 403 forbidden

I know it's quite an old thread but thought of sharing some ideas.

Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.

So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.

A list of common user agents used by browsers are listed below:

  • Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

  • Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0

  • etc...


$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("www.google.com", false, $context);

This piece of code, fakes the user agent and sends the request to https://google.com.

References:

  • stream_context_create

Cheers!


This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.

It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.

You should probably talk to the administrator of the remote server.


Add this after you include the simple_html_dom.php

ini_set('user_agent', 'My-Application/2.5');

You can change it like this in parser class from line 35 and on.

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html()
{
  $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
}

Have you tried other site?


It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:

$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);