9 Tips & tricks to secure your WordPress site?

It’s always a big concern when we talk about security. Hosting your website on a global network not only gives you global reach, good business opportunities, and great outputs but also gives you great risk of data loss, brand reputation risk, and a lot more.

When it comes to website security, you need to implement few basic stuff to ensure that you are on the safe side.

If you are using a WordPress-based website you can follow the below tips and tricks to secure your WordPress website from the attacker.

  • Use an updated version of WordPress
  • Always keep your plugins and themes up to date
  • Use fewer plugins
  • Turn off/disable XMLRPC if not being used.
  • Use SSL for your website
  • Always use a good password policy
  • Use Jetpack for security and monitoring of your WordPress site.
  • Keep away from unverified publishers of plugins and themes, or any other code you put on your site.
  • Last but not the least, always host your website on a secure web host.

Use updated version of WordPress

It’s always suggested that you should not keep an out-of-date version of WordPress, as it will open up loopholes for attackers to get inside your site.

WordPress is one of the most used frameworks to develop a website, thus it has a huge client base too, which makes it a big target for attackers.

Finding a loophole in WordPress will violate millions of websites. So too keep it secure and smooth, WordPress community keeps on developing security patches for WordPress and its plugins.

Sitting on the older version of WordPress will make you a good target for attackers and it will also keep you away from the latest features and functionalities.

Always keep your plugins and themes up to date.

As we talk earlier, keeping your website code up to date, keeps you safe from attackers and also ensures smooth performance and updated functionalities.

Your plugins play a great role in making your website vulnerable, and due to which we suggest keeping them up to date and always stay on the latest version.

There are features in the latest version of WordPress to automatically update your site plugins, I’ll suggest going for that. If in case you have made changes in the plugins and do not want to update them you should always keep an eye on your site and its security then.

Use fewer plugins

The more you use ready-made codes, the more you’ll find yourself vulnerable. It is always suggested to use limited amount of plugins, as it will decrease your area of vulnerabilities.

If you have installed more than 20 plugins, it will not only give you an insecure place but also make your website slow and non-SEO/non-user-friendly. You must the radius small.

Turn OFF xmlrpc in WordPress

If you are not a user of xmlrpc or if you do not have a plugin or a theme that requires xmlrpc, you must turn it off.

We’ve come along way since WordPress was first launched. Back in the day, the feature called XML-RPC was extremely useful. In a time with slow internet speed and constant lags, it was difficult to write content online in real-time, like we do now. The XML-RPC function enabled users to write their content offline, say on Microsoft Word, and then publish it all together in one go. But you might do not know that you should disable XMLRPC in your WordPress website.

Today, with faster internet speeds, the XML-RPC function has become redundant to most users. It still exists because the WordPress app and some plugins like JetPack utilize this feature.

If you don’t use any of these plugins, mobile apps, or remote connections, it’s best to disable them. Why? Every additional element on your site gives hacks one more opportunity to try to break into your site. Disabling the feature makes your site more secure.

Use SSL on your website

It’s always good to use an encryption policy, SSL is something similar. It keeps your website users safe and secures their data from a man-in-the-middle attack.

Using SSL will not only provide an extra layer of security to your website but also provide SEO to your website listing. Search engines also love websites with SSL. Sot keep an SSL assigned to your website.

Always use good password policy

A good password policy is a way to keep your account safe from people who try to invade. You should always make sure that you and your customer follow a good password policy.

Each of your users should update their passwords quarterly and should always keep a password having a combination of characters(both small and capital cases) with digits and special characters. Also, a minimum character count of 8 characters should be maintained and the password should not contain easy strings like the name of the user, etc.

Use Jetpack for security and monitoring of your WordPress site.

Spamming is one of the big concerns nowadays. People have created bots to put backlinks of their own websites and also for several other purposes. Filtering comments one by one and marking them spam is a hectic task. You can use Jetpack for securing your websites from different types of attacks including spamming, and it can also help you in monitoring the visit count of each post and page.

There is multiple blog management feature available in Jetpack. You can go for a premium version to get an add-on security and management feature for your website.

Keep away from unverified publishers of plugins and themes, or any other code you put on your site.

Un verified plugin and theme publishers can create a great risk to your website. To ensure the safety of your website, you must stay away from them or use it at your own risk. It is always suggested to either go through the entire code before you put it on your website or else, do not use it at all.

I have seen it in many instances, a malicious code comes coupled with very needy stuff. Like if you require a plugin to allow social media to log in, it can carry a malicious code to share your credentials to a different site too. This will create great damage to your website as well as the reputation of your brand. Also, can lead to legal actions.

It’s always better to stay away from such unverified publishers. Use a plugin only if it is inside the WordPress store or if you 100% know the developer of the plugin or you have got it developed by hiring a company or a freelancer or else if you have gone through the entire code and have found the code worth putting on your website.

Always host your website on a secure web host

One of the best solutions suggested to keep your website secure is to host it on a secure web hosting platform. A good web host ensures that fewer attacks happen on all of its hosted websites and also provides multiple mechanisms to maintain the valuable flow of genuine users.

We suggest going for CLOUDPOKO, it is one of the fastest-growing web hosting platforms and provides secure web hosting solutions. WordPress installation and use on the platform are easy and user-friendly.

The hosting provider also provides add-on layer of security to the WordPress websites.

CONCLUSION

If you are a website developer or someone who is hosting a website on the global network. It is very essential to secure your website, not only to secure your data, but also to secure your website and brand’s reputation.

If you require any consultation, you can get in touch with us, here

Reduce server response times (TTFB)

When we talk about website performance, one of the ways to improve the performance is by reducing the server response time, also known as Time To First Byte (TTFB).

The server response time is considered as the time taken to reply to a request initiated by a user. When any user sends a request to view a page on a website, the server has to do multiple operations before sending the output back to the user, it may include mathematical and logical operations. These operations take some time to execute, this time taken to respond to a user’s request is known as server response time.

For example, if a user is looking to check his transaction history for the last year, the server will have to generate a list of all the transactions done within the given time period from the database and then push it into the page. This can take time and can lead to a greater response time.

How to reduce server response time?

To reduce the server response time, one needs to follow the basic steps as mentioned below:

  • Start by identifying the core task that your server needs to perform before providing the first output.
  • Identify the task taking the longest time to complete, and then try to optimize it or if possible use a different approach to run it. For example, if you are loading blog posts, you can either load all the posts at a time or can show the latest 5 at the time of page load and do ajax calls to load the rest of them when the user scrolls the page.
  • User a good server, which provides more processing capacity, more process threads, for hosting your website. One of the good web hosting providers is CLOUDPOKO
  • If you are using a dedicated server or your own server to host your website, you should upgrade the hardware to meet the processing requirements.
  • Try to make fewer database calls as much as possible, this will reduce the server busy time. As your customer base grows, if too many database calls are done, your customers may get lags in getting responses.

To check your servers response time, you can visit – https://developers.google.com/speed/pagespeed/insights/

Here you’ll be able to check the speed of all your website pages using Google’s Lighthouse report. Analyzing the report you’ll be able to find the changes required enhance the performance of your website.

Must Read
How to enhance improve website performance
How to improve website speed

If you find this article useful, do share it on your social media feeds, it will be a great support and will help to boost our morale to bring similar posts for everyone.

How to improve Website Page speed?

What is Page Speed?

Page speed is generally confused with “website speed”, which is wrong. Page speed is specific to a Page and not to the entire website. It can be understood as the “loading time of a page” (the time it takes to fully display the content on a specific page) or “time to first byte” (how long it takes for your browser to receive the first byte of information from the webserver).

You can evaluate your page speed with Google’s PageSpeed Insights. PageSpeed Insights Speed Score incorporates data from CrUX (Chrome User Experience Report) and reports on two important speed metrics: First Contentful Paint (FCP) and DOMContentLoaded (DCL).

SEO Best Practices

Google has indicated site speed (and as a result, page speed) is one of the signals used by its algorithm to rank pages. And research has shown that Google might be specifically measuring time to the first byte as when it considers page speed. In addition, a slow page speed means that search engines can crawl fewer pages using their allocated crawl budget, and this could negatively affect your indexation.

Page speed is also important to user experience. Pages with a longer load time tend to have higher bounce rates and lower average time on page. Longer load times have also been shown to negatively affect conversions.

Here are some of the many ways to increase your page speed:

Enable compression

Use Gzip, a software application for file compression, to reduce the size of your CSS, HTML, and JavaScript files that are larger than 150 bytes.

Do not use gzip on image files. Instead, compress these in a program like Photoshop where you can retain control over the quality of the image. See “Optimize images” below.

Minify CSS, JavaScript, and HTML

By optimizing your code (including removing spaces, commas, and other unnecessary characters), you can dramatically increase your page speed. Also remove code comments, formatting, and unused code. Google recommends using CSSNano and UglifyJS.

Reduce redirects

Each time a page redirects to another page, your visitor faces additional time waiting for the HTTP request-response cycle to complete. For example, if your mobile redirect pattern looks like this: “example.com -> www.example.com -> m.example.com -> m.example.com/home,” each of those two additional redirects makes your page load slower.

Remove render-blocking JavaScript

Browsers have to build a DOM tree by parsing HTML before they can render a page. If your browser encounters a script during this process, it has to stop and execute it before it can continue. 

Google suggests avoiding and minimizing the use of blocking JavaScript.

Leverage browser caching

Browsers cache a lot of information (stylesheets, images, JavaScript files, and more) so that when a visitor comes back to your site, the browser doesn’t have to reload the entire page. Use a tool like YSlow to see if you already have an expiration date set for your cache. Then set your “expires” header for how long you want that information to be cached. In many cases, unless your site design changes frequently, a year is a reasonable time period. Google has more information about leveraging caching here.

Improve server response time

Your server response time is affected by the amount of traffic you receive, the resources each page uses, the software your server uses, and the hosting solution you use. To improve your server response time, look for performance bottlenecks like slow database queries, slow routing, or a lack of adequate memory and fix them. The optimal server response time is under 200ms. Learn more about optimizing your time to first byte.

Use a content distribution network

Content distribution networks (CDNs), also called content delivery networks, are networks of servers that are used to distribute the load of delivering content. Essentially, copies of your site are stored at multiple, geographically diverse data centers so that users have faster and more reliable access to your site.

Optimize images

Be sure that your images are no larger than they need to be, that they are in the right file format (PNGs are generally better for graphics with fewer than 16 colors while JPEGs are generally better for photographs) and that they are compressed for the web.

Use CSS sprites to create a template for images that you use frequently on your sites like buttons and icons. CSS sprites combine your images into one large image that loads all at once (which means fewer HTTP requests) and then display only the sections that you want to show. This means that you are saving load time by not making users wait for multiple images to load.

Website Security – PHP: Implementing Security To Your Website

Security is one of the major concerns today and when it comes to coding, It becomes a point to re-think what are the best possible ways to implement security to the website.

We have jotted down the basic concepts of web security (in php) which can be used to secure your code from being misused and to which can protect you from some basic attacks.

Use of Nonce

Nonce is basically used to identify if the user is sending request from a valid location. Location here means from a webpage that has been served by the genuine server.

How it works

As the name suggests, it is a combination of occasions.

When a user sends the initial request to a website, the server generates an unique session for the user, which is used to identify the user every-time.

But it is hard to justify if the webpage which is submitting the request is on the same website. Eg. If I have a form on my website with the following code

<form method="post" action="login.php">
<input type="text" name="username" />
<input type="password" name="password" />
<input type="submit" name="submit"/>
</form>

It simply suggests that the data with the key “username” and “password” will be sent to the page named “login.php”.

The same form can be developed on an automation tool in some X machine and can be used to send request to the website, which can lead to security breach

To justify, if the form is submitted from a location from the website itself. A nonce is used.

<?php
session_start();
$_SESSION['nonce'] = md5(rand(1111,99999));
$nonce = $_SESSION['nonce'];
?>
<form method="post" action="login.php">
<input type="text" name="nonce" value="<?php echo $nonce; ?>" readonly/>
<input type="text" name="username" />
<input type="password" name="password" />
<input type="submit" name="submit"/>
</form>

In the above set of code, we have generated a random number which is then hashed using the md5() hashing technique, and stored in the nonce variable.

The same value is stored in the session too. Once someone opens up the website, the nonce will be generated and will be stored in the unique session of the user, when he submits the form, the nonce will be sent back to the server, which will be validated to see if it matches the original nonce. If it matches, then it is coming from a valid source and if not, it is not from a valid location.

<?php
session_start();
if(isset($_POST['username']) && isset($_POST['password']) && isset($_POST['nonce'])){
if($_SESSION['nonce']==$_POST['nonce']){
 echo "Submited from a valid source";
}
else{
 echo "Submitted from an invalid source";
}
}
$_SESSION['nonce'] = md5(rand(1111,99999));
$nonce = $_SESSION['nonce'];
?>
<form method="post" action="login.php">
<input type="text" name="nonce" value="<?php echo $nonce; ?>" readonly/>
<input type="text" name="username" />
<input type="password" name="password" />
<input type="submit" name="submit"/>
</form>

Securing files from being required/included in a file outside of website.

It is a general practice to create generic code and requiring or including it in the files where needed.

This is a great way to implement the concept of Don’t Repeat Yourself (DRY), but there is a security breach that can come into picture here too.

Look at the following codes:

connection.php

<?php
 $connection = mysqli_connect('hostname', 'user','password','database');
?>

save.php

<?php
 require('connection.php');
 //some mysql transaction code goes here
?>

In the above two files, connection.php and save.php, you can see that the file just needs to be written in require() function and it will get required.

The file connection.php can be required from any other source and can be used to showcase all the connection information from any other machine.
For example, anyone can use the global path for the same file to require it in his/her code as below

hackerFile.php

<?php
 require('https://abc.com/connection.php');
 print_r($connection);
?>

The above code will show all the connection information.

To secure it, we can define a variable which can be used as a token to check if it is being requested from a valid location. For example

connection.php

<?php
if(!defined('uniquenamevariable')){
 die('Nothing Found');
}
 $connection = mysqli_connect('hostname', 'user','password','database');
?>
<?php
define('uniquenamevariable',true);
 require('connection.php');
 //some mysql transaction code goes here
?>

So if anyone will require it using a absolute path, he will be unknown of the unique variable name defined in the php code, which will stop him from digging inside the code.

There are a lot other security ways, stay tuned to our blogs to learn more...

Common variations of the website, choose which suits you the best.

There are lots of options for creating websites, here we are going to share you some common types of websites to give you helpful ideas. These include Blogs, Corporate or Business, e-Commerce, Portfolio or Photography, Crowdfunding, News/Magazine portal, Social media, Educational website, portal, entertainment, directory listing website, quiz website, Non Profits or Religious websites, Niche Affiliate Marketing Websites, School or College Websites and a wiki or community forum.

We anatomize some of them here.

1. Blogs or Personal Website

Are you an upstanding writer? Do You have ideas and thoughts which you want to share with others? Are you looking for a platform to do this?

A blog is a perfect space for you. The blog can be typically managed by an individual or a small group, a blog can cover any topic – whether it’s related to travel tips, financial advice, or movie reviews. While blogs are often written in an informal or conversational style. Paid blogs or professional blogging are good ways of earning money online.

You can learn how to Start your first blow Today here

2. Business Website

Are you a startup company? Thinking about where to start? So, first, get your business online. Online presence is important nowadays for every business. Most businesses don’t have their website and due to this their impression on potential clients goes down. Having online presence gives you a global presence and exposure.

The business website is not for selling anything, but you can use these websites to provide information about your ventures and to let your clients or customers know how they can get in touch with you. 

Business or Corporate website doesn’t cost much You can start building your business website with help of CRMs like WordPress easily and quickly without coding knowledge.

3. e-Commerce Website

The most trending and innovative way of earning these days. You can receive payment, manage inventory, shipping, taxes/ and manage users from same canopy.

You can merge your business website, blog website with your e-commerce website. These will be helpful in your marketing also, you can write blogs for promoting your products.

Click here to learn how to build an ecom website without knowledge of coding

e-Commerce website

4. Portfolio Website

Portfolio website is similar to your physical portfolio. But, here you can design and add some interactive ideas to make it more impressive.

A portfolio is generally used to showcase and promote your previous work. It can be used as a CV, creating a great impact on the companies you walk in. Whether you are a student or employer, you may need a platform where you may showcase your work or projects or any services to inspire others.

5. Brochure Website

The brochure website is your online business card. It is quite similar to portfolio website, but the difference is that you may showcase your projects on your portfolio which you have done in your entire career, a brochure website is for showing your selected projects designed for your clients as well as personal projects.

You may use back-links in the brochure website to your portfolio website. The brochure website may have only 5-6 pages. The only information found on the site focuses entirely on the business (not the customers) and is typically limited to these pages.

  • About Us (company history, values, mission, team, etc.)
  • Contact Us (phone number, email address, and contact form)
  • How it Works (for businesses with processes or systems)
  • Pricing (If pricing isn’t straightforward)
  • Portfolio (samples or external links)

6. Niche Affiliate Marketing Websites

You have lot of contacts and you may have always share some new products and services to them. So, this is great option for you to earn from home.

Start your affiliate marketing website and earn by sharing products with your contact.

Also, you can also kick start your talent by selling hosting, domain, or other Online services by simply joining Affiliate programs by good companies like CLOUDPOKO

Affiliate marketing website

7. Portal Website

A Portal website is used for internal businesses, schools, or institutions. This involves the login process and automation of workflow from the same place. These portals are quite complex to design, so this needs an expert.

M/s VIKASH TECH provides, best professional having several years of experience in this field, they can help you to design and develop tools for you. These tools can automate your office work and help in your business growth.

8. Educational Website

Educational website as the name defines, these websites are designed for providing educational information to learners. These websites have lots of information, it may include blogs, portfolios, or portal for educating students.

You can also start your online teaching classes and become an educator to explore your knowledge.

Educational website

And lot more..

We hope you like this blog and get some ideas about different variations of the website. If you are still confused about what you want to develop, share it with us. Our expert team will provide you a better solution for your business growth. We don’t charge for any consultancy, you can call us freely and ask us any IT related queries. We will happy to serve you and share our knowledge with you. Click here to contact us.

Subscribe to our newsletter, so you will not miss our posts, news, or any offers from M/s VIKASH TECH.

How to install WordPress in CentOS Web Panel (CWP)

Using CWP hosting? Looking to setup a your own blog ? Need to install WordPress in your account ?

If you are using a CWP hosting and want to install WordPress in your account, then it is too simple. You can install WordPress following the simple steps as mentioned below.

To install WordPress from CWP:

  1. Log in to your CWP user account
  2. On the dashboard look for the WordPress icon (Addons section) and click on it
  3. Configure the options:
    – choose the protocol you want to use (https if you want to use SSL), also if you want to access your site with www or not.
    – choose the domain(you can have multiple domains on the same CWP account) on which you want to install WordPress
    – enter the desired directory – leave the field empty if you don’t want to install WP in a directory.
    – enter the database name – CWP will automatically fill this field – you can let is as it is
    – enter database username (this IS NOT the WordPress admin username)
    – database password – enter a strong password.
  4. Click the Install button. Wait for a few seconds and access the site on which you wanted to install WordPress. Now you just have to enter some WP settings (like language, admin username, password and email etc.)
  5. Your WordPress installation is live now

If you are looking for better support, you can connect with us here

SEO / SEM / SMM

SEO, SEM and SMM are three mainstream channels to advertise your website.
SEO stands for Search Engine Optimization, SEM stands for Search Engine Marketing and SMM, which is the newest among the three stands for Social Media Marketing.

Knowing what they stand for is meaningless until you know what they can do for you.

No one knows how Google ranks its pages. Moreover, its algorithm is constantly changing. Even though Google provides guides to help SEO experts, the results are still ambiguous.

It may sound hard but to gain a better position in the online world, SEO is essential. It is proven that web pages that appear in the first page are perceived to belong to industry leaders and superior brands.

SEO– search engine optimization; following some of the Google rules in order to increase the chances of the Google search engine listing your site near the beginning of a list for particular keyword searches.

SEM– search engine marketing; in addition to doing SEO to get to the “top” a website owner can now buy advertising or pay to have an ad at the top of a search engine results list; these can be things like pay per click or Yahoo or Bing ads.

SMO– social media optimization; basically when using social media ( Facebook, Twitter, etc.) you are making your profile more visible, your social network activity and published content can then be found more easily by people searching for resources and information that matches your content.

SMM– social media marketing; in addition to the SMO, you guessed it, you pay to have an ad on the social media. This is how the Facebook ads show up on your page.

Web Application

Our web application development and custom software development services include everything from a simple content web site application to the most complex web-based internet applications, electronic business applications, and social network services.

We provide custom web application development services, including website design and development, software consulting, application integration, and application maintenance services. With our experienced web application developers, you will have no limitations and you will be able to save employee time and effort while you save money.

Our developers holds expertise in latest web based technologies, which help building easy-to-use and convenient applications to manage your company documentation, processes, and workflows.

Convert your business idea into an elegant custom web application using the combination of our technical expertise and business domain knowledge.

Here’s what a web application flow looks like:

  • User triggers a request to the web server over the Internet, either through a web browser or the application’s user interface
  • Web server forwards this request to the appropriate web application server
  • Web application server performs the requested task – such as querying the database or processing the data – then generates the results of the requested data
  • Web application server sends results to the web server with the requested information or processed data
  • Web server responds back to the client with the requested information that then appears on the user’s display

Increased Internet usage among companies and individuals has influenced the way businesses are run.

Web applications have many different uses, and with those uses, comes many potential benefits. Some common benefits of Web apps include:

  • Allowing multiple users access to the same version of an application.
  • Web apps don’t need to be installed.
  • Web apps can be accessed through various platforms such as a desktop, laptop, or mobile.
  • Can be accessed through multiple browsers.

We help your innovative ideas to help your business exceed your expectations as we are focused on working with you to meet your business goals.

Website Development

We enable website functionality as per the client’s requirement. We mainly deal with the non-design aspect of building websites, which includes coding and writing markup.

Our team is holds expertise in development ranging from client-end development to server-side development. We ensure optimized development to make your tool work faster and without hazels.

The purpose of a website can be to turn visitors into potential clients, or to collaborate with team, or to have some other functionality for an even better utilization. We develop all your imaginations to codes.

How this process works?
If you are planning to get yourself an online platform for your needs, we can help you design it. First of all, we will schedule a meeting and understand your requirements. Once you tell us all your requirements and the picture get bit clear to us, we write down a quotation for your needs. The quotation includes:

  • Details of understanding of your project
  • Details of workflow
  • Details of database architecture
  • Details of manpower required
  • Details of technologies involved
  • Details of hardware / software needs
  • Details of time estimation
  • Details of cost estimation

After you are satisfied with the quotation, we move forward with the SRS development, else, we revise the quotation till it comes to a mutual satisfaction.

In Software Requirement Specification (SRS) development phase, we develop another document which contains detailed requirement specification, which will help you bring your imaginations on paper and move forward.

The development, quality assurance and implementation phases go after this, as per the SRS and Quotation.

We ensure industry standard development, which includes responsive web design, optimized coding structure and on time delivery of all kind of projects.

You can get in touch with us in case of any requirement here

Why Search Engine Optimization (SEO)?

Before we start talking about WHY SEO!, let’s first understand what is SEO?

What is SEO ?

Multiple time people don’t understand what is SEO causing them loss of leads on the online market. So let us first start with what SEO exactly means. SEO means Search Engine Optimization. Which is a technique to tell search engines about the content of your website so that when someone searches for some content over the internet, search engines can showcase your content to them.

To make this more clear, let us first see a situation and then understand what SEO exactly does. Suppose you have a website and a visitor visits your website and goes through it. The human brain is smart enough to understand what sort of content is written on the website. The visitor can understand what exactly is on your website, like you are trying to promote a book, a movie, a blog etc., but for a search engine, the content written on your website is just a binary text, and it will not be able to understand what exactly you are promoting. So to make a search engine understand what you are promoting, you will have to tell it in the language it understands or in simple way, you will have to teach search engine to read your website properly. This method of teaching a search engine to read a website is called Search Engine Optimization.

Why SEO ?

WHY SEO IS SO IMPORTANT ?

Well both the above questions have a very small and interesting answer. In the competitive world today everyone is in a run to promote his/her work, or service, or product or something else. Generally, in majority of times, its hard for an end user to find a proper result for his/her quires related to a product or service on the net. To make this good and easy enough for end user, search engines have build up some set of algorithms which first reads a website content and verifies whether it is relevant for the end user or not, and then showcases it to the end user.

To let search engines showcase your content, product, or service, or something else on its first page, SEO is considered to be one of the basic necessity of today’s internet world.

If you are selling a chair on a less price value than others and your website is not read by a search engine in a proper way, then you may become a victim of non-SEO website loss. You may not get proper leads from the internet, as the search engine will not be able to understand your content, leading to non – showcasing of your products to end user.

SEO is not only about search engines but good SEO practices improve the user experience and usability of a web site.Users trust search engines and having a presence in the top positions for the keywords the user is searching, increases the web site’s trust.

How to do SEO ?

SEO is something which comes out from three major phases, namely:

  • Technical phase
  • On-site phase
  • Off-site phase

All the three phases contains different set of techniques to make a website read by search engines. We will deal with each of them one by one.

  1. Technical Phase – Technical phase, as its name suggests, deals with the technical parts of SEO. Wherein the technical parts are taken care of. Like – Writing codes for SEO, including meta tags, structuring data using schema.org.
  2. On-Site Phase – On-site phase deals with the content written on the website. It should be kept fresh. If you are trying to copy text from some website and you thing it will work, then you should think again. Any search engine gives preference to fresh content, you cannot earn engagements or visits on copied contents. The on site part deals with the way your content is written and how engaging the content is. You can either post contents, videos, images, etc to enhance the engagement time on your website, which will be considered for best practice for SEO.
  3. Off-Site Phase – The off-site phase is the part of promotions. Here if you are promoting your website using advertisements, it will drive traffic to your website and hence will increase the hit count and engagement time on your website which will be noticed by search engines and will increase your site listing Also, the more you have back-links, the more you will get noticed by search engines.

CONCLUSION

One should always prefer doing a SEO for his /her online presence. You may find multiple sites which will generate your SEO ranking based on your content and your configurations. Still, if you want to have a proper SEO done for your website, you must hire an expert, so that your website gets more hits and engagement. And your contents are considered to be the most useful part of SEO, so make them as precise as you can.

In case you require any help on SEO, feel free to get in touch with our team, write to us or call us here

Why NO to WordPress based website?

Although WordPress is free to use and is simple for people which basic or no knowledge of software development, still it is not preferred by companies for building up their forums or blogs. There may be multiple reasons for the same. Today we will look at why NO to WordPress websites?

A few of the major regions because of which the website is preferred to be built from scratch rather than using WordPress framework are mentioned below.

SLOW IN SPEED

Generally, if you have to build a simple website, you require fewer functionalities and less coding. But the developers find it easy to develop website using WordPress as it provides several numbers of themes to choose from and multiple set of plugins which can add functionalities to the website. Which increases a load of code on the website, causing an increase in the processing requirements and thus slowing down the website’s overall performance. It may also lead to a lower ranking in Search Engine Optimization (SEO). 

LACK OF FEATURES

WordPress plugins are designed to do a fixed set of stuff, if you want customization in the functionalities it’s going to be the toughest task for you. In case you are developing a website from scratch, you will be able to develop the functionality as per your requirements. In WordPress, if you need 5 functionalities that are interconnected to each other, you will be adding 5 separate plugins which will not only increase the processing load on the server but also will slow down the website. And if in case you require some changes, it will be hectic as all the plugins are generally from different vendors, and connecting and changing codes may be a big problem for you.

SUPPORT ISSUE

If you have developed a WordPress website yourself and are unable to do some sort of troubleshooting or changes, you will not get any support unless and until you have a paid subscription. And you might even have to hire a WordPress developer or look for solutions on the net for long.

LACKS DESIGN FLEXIBILITY

WordPress generally provides very good-looking themes which can be used for developing a website, but these themes are very difficult to modify, many times if you are developing a website by yourself, you will find issues changing the design of a WordPress website. If you are purchasing a theme from some vendor, then you can expect some changes from them but that too to a very smaller extent.

VULNERABLE

WordPress is considered to be vulnerable as it has plugins that are written with multiple loopholes and if someone uses that plugin, he/she makes the website vulnerable, and it’s almost impossible to check the codes of each and every vendor before putting it live. Also, there can be cases when you install an unknown plugin and make your website and data on it insecure, causing you to lose in multiple ways.

PROPRIETARY

Although you have developed the website investing your team and labor, still you will not have all the rights. You don’t own the codes and website. Also, you are kept at the risk of WordPress vulnerabilities, as it goes on updates very frequently. If you want to have the copyright of your codes then it will be preferred not to use the WordPress framework for development.

DO A RESEARCH BEFORE YOU START DEVELOPING YOUR WEBSITE.

Our team can help you doing this research. Get in touch with us today! Connect with us here

URL Encoding – Look before you hit any URL

It has been a common trend to send users a tiny URL in messages or emails for different purposes. Once you click on that URL, you are taken to the long / full URL. The full URL can be understood by understanding the concept of URL encoding.

A URL is composed from a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters (“-“, “.”, “_”, “~”).

ASCII control characters (e.g. backspace, vertical tab, horizontal tab, line feed etc), unsafe characters like space, , <, >, {, } etc, and any character outside the ASCII charset is not allowed to be placed directly within URLs.

Moreover, there are some characters that have special meaning within URLs. These characters are called reserved characters. Some examples of reserved characters are ?, /, #, : etc. Any data transmitted as part of the URL, whether in query string or path segment, must not contain these characters.

One of the most frequent URL Encoded character you’re likely to encounter is space. The ASCII value of space character in decimal is 32, which when converted to hex comes out to be 20. Now we just precede the hexadecimal representation with a percent sign (%), which gives us the URL encoded value – %20.

The following table uses rules defined in RFC 3986 for URL encoding.

DecimalCharacterURL Encoding (UTF-8)
0NUL(null character)%00
1SOH(start of header)%01
2STX(start of text)%02
3ETX(end of text)%03
4EOT(end of transmission)%04
5ENQ(enquiry)%05
6ACK(acknowledge)%06
7BEL(bell (ring))%07
8BS(backspace)%08
9HT(horizontal tab)%09
10LF(line feed)%0A
11VT(vertical tab)%0B
12FF(form feed)%0C
13CR(carriage return)%0D
14SO(shift out)%0E
15SI(shift in)%0F
16DLE(data link escape)%10
17DC1(device control 1)%11
18DC2(device control 2)%12
19DC3(device control 3)%13
20DC4(device control 4)%14
21NAK(negative acknowledge)%15
22SYN(synchronize)%16
23ETB(end transmission block)%17
24CAN(cancel)%18
25EM(end of medium)%19
26SUB(substitute)%1A
27ESC(escape)%1B
28FS(file separator)%1C
29GS(group separator)%1D
30RS(record separator)%1E
31US(unit separator)%1F
32space%20
33!%21
34%22
35#%23
36$%24
37%%25
38&%26
39%27
40(%28
41)%29
42*%2A
43+%2B
44,%2C
45%2D
46.%2E
47/%2F
480%30
491%31
502%32
513%33
524%34
535%35
546%36
557%37
568%38
579%39
58:%3A
59;%3B
60<%3C
61=%3D
62>%3E
63?%3F
64@%40
65A%41
66B%42
67C%43
68D%44
69E%45
70F%46
71G%47
72H%48
73I%49
74J%4A
75K%4B
76L%4C
77M%4D
78N%4E
79O%4F
80P%50
81Q%51
82R%52
83S%53
84T%54
85U%55
86V%56
87W%57
88X%58
89Y%59
90Z%5A
91[%5B
92%5C
93]%5D
94^%5E
95_%5F
96`%60
97a%61
98b%62
99c%63
100d%64
101e%65
102f%66
103g%67
104h%68
105i%69
106j%6A
107k%6B
108l%6C
109m%6D
110n%6E
111o%6F
112p%70
113q%71
114r%72
115s%73
116t%74
117u%75
118v%76
119w%77
120x%78
121y%79
122z%7A
123{%7B
124|%7C
125}%7D
126~%7E
127DEL(delete (rubout))%7F

Now if you keep a note of all the above, you can interpret what the URL says. 

In general you see an URL in the following format :

https://vikashtech.com/abcd.php?q=hello%20world
In the above URL,
https suggests that the SSL certifier confirms that you are navigated on the correct server.
vikashtech.com is the domain name of website
abcd.php is the page that you are looking into
? means you are passing some values to the URL
q is the key (the index value through which the programming language will receive some values)
hello%20world is the value of the key q wherein %20 means a space

If you are receiving emails from unknown sources and are being offered some sort of benefits after filling up some form or after downloading some software, please ensure that the URL is correct and is good to go with, before you actually hit it.

These emails are generally sort to different sort of people pre targeted. If you or your organization is receiving such emails, please aware people for the same and get your network and emails filtered, as this can lead to a very big disaster. For any kind of network, email, organization IT setup you can feel free to get in touch with our team. We will ensure that your network and your organization is out of the security glitch.

Click here to connect with us today!

A guide to prevent Web-scraping.

Essentially, hindering scraping means that you need to make it difficult for scripts and machines to get the wanted data from your website, while not making it difficult for real users and search engines.

Unfortunately, this is hard, and you will need to make trade-offs between preventing scraping and degrading the accessibility for real users and search engines.

In order to hinder scraping (also known as Web scraping, Screenscraping, web data mining, web harvesting, or web data extraction), it helps to know how these scrapers work, and what prevents them from working well, and this is what this answer is about.

Generally, these scraper programs are written in order to extract specific information from your site, such as articles, search results, product details, or in some cases, artist and album information. Usually, people scrape websites for specific data, in order to reuse it on their own site (and make money out of your content !) or to build alternative frontends for your site (such as mobile apps), or even just for private research or analysis purposes.

Essentially, there are various types of scraper, and each works differently:

  • Spiders, such as Google’s bot or website copiers like HTtrack, which visit your website, and recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with an HTML parser to extract the desired data from each page.
  • Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the desired data, usually using a shell script. These are the simplest kind of scraper, and also the most fragile kind (Don’t ever try to parse HTML with regex !). These are thus the easiest kind of scraper to break and screw with.
  • HTML scrapers and parsers, such as ones based on Jsoup, Scrapy, and many others. Similar to shell-script regex-based ones, these work by extracting data from your pages based on patterns in your HTML, usually ignoring everything else. So, for example: If your website has a search feature, such a scraper might submit an HTTP request for a search, and then get all the result links and their titles from the results page HTML, sometimes hundreds of times for hundreds of different searches, in order to specifically get only search result links and their titles. These are the most common.
  • Screenscrapers, based on eg. Selenium or PhantomJS, which actually open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
    • Getting the HTML from the browser after your page has been loaded and JavaScript has run and then using an HTML parser to extract the desired data or text. These are the most common, and so many of the methods for breaking HTML parsers/scrapers also work here.
    • Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
    Browser-based screen scrapers harder to deal with, as they run scripts, render HTML, and can behave like real human browsing your site.
  • Web scraping services such as ScrapingHub or Kimono. In fact, there are people whose job is to figure out how to scrape your site and pull out the content for others to use. These sometimes use large networks of proxies and ever-changing IP addresses to get around limits and blocks, so they are especially problematic. Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
  • Embedding your website in other site’s pages with frames, and embedding your site in mobile apps. While not technically scraping, this is also a problem, as mobile apps (Android and iOS) can embed your website, and even inject custom CSS and JavaScript, thus completely changing the appearance of your site, and only showing the desired information, such as the article content itself or the list of search results, and hiding things like headers, footers, or ads.
  • Human copy – and – paste: People will copy and paste your content in order to use it elsewhere. Unfortunately, there’s not much you can do about this.

There is a lot of overlap between these different kinds of scraper, and many scrapers will behave similarly, even though they use different technologies and methods to get your content.

This collection of tips are mostly my own ideas, various difficulties that I’ve encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.

How to prevent scraping

Some general methods to detect and deter scrapers:

Monitor your logs & traffic patterns; limit access if you see unusual activity:

Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.

Specifically, some ideas:

  • Rate limiting: Only allow users (and scrapers) to perform a limited number of actions in a certain time – for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
  • Detect unusual activity: If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
  • Don’t just monitor & rate limit by IP address – use other indicators too: If you do block or rate limit, don’t just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users/scrapers include:
    • How fast users fill out forms and were on a button they click;
    • You can gather a lot of information with JavaScript, such as screen size/resolution, timezone, installed fonts, etc; you can use this to identify users.
    • Http headers and their orders, especially User-Agent.
    As an example, if you get many requests from a single IP address, all using the same User-agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it’s probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won’t inconvenience real users on that IP address, eg. in case of a shared internet connection. You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users. This can be effective against screen scrapes which run JavaScript, as you can get a lot of information from them. Related questions on Security Stack Exchange:
  • Instead of temporarily blocking access, use a Captcha: The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.

Require registration & login

Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers but is also a good deterrent for real users.

  • If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping and ban it. Things like rate-limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.

In order to avoid scripts creating many accounts, you should:

  • Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
  • Require a captcha to be solved during registration/account creation, again to prevent scripts from creating accounts.

Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.

Block access from cloud hosting and scraping service IP addresses

Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or Google App Engine, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services. You can also block access from IP addresses used by scraping services.

Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.

Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.

Make your error message nondescript if you do block

If you do block/limit access, you should ensure that you don’t tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:

  • Too many requests from your IP address, please try again later.
  • Error, User-Agent header not present!

Instead, show a friendly error message that doesn’t tell the scraper what caused it. Something like this is much better:

  • Sorry, something went wrong. You can contact support via helpdesk@example.com, should the problem persist.

This is also a lot more user-friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don’t block and thus cause legitimate users to contact you.

Use Captchas if you suspect that your website is being accessed by a scraper.

Captchas (“Completely Automated Test to Tell Computers and Humans apart”) are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.

As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn’t a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.

Things to be aware of when using Captchas:

  • Don’t roll your own, use something like Google’s reCaptcha : It’s a lot easier than implementing a captcha yourself, it’s more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it’s also a lot harder for a scripter to solve than a simple image served from your site
  • Don’t include the solution to the captcha in the HTML markup: I’ve actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don’t do something like this. Again, use a service like reCaptcha, and you won’t have this kind of problem (if you use it properly).
  • Captchas can be solved in bulk: There are captcha-solving services that were actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.

Serve your text content as an image

You can render the text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.

However, this is bad for screen readers, search engines, performance, and pretty much everything else. It’s also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it’s also easy to circumvent with some OCR, so don’t do it.

You can do something similar with CSS sprites, but that suffers from the same problems.

Don’t expose your complete dataset:

If feasible, don’t provide a way for a script/bot to get all of your datasets. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on-site search, and, if you don’t have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.

This will be ineffective if:

  • The bot/script does not want/need the full dataset anyway.
  • Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.
  • There are other ways to eventually find all the articles, such as by writing a script to follow links within articles that lead to other articles.
  • Searching for something like “and” or “the” can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
  • You need search engines to find your content.

Don’t expose your APIs, endpoints, and similar things:

Make sure you don’t expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.

To deter HTML parsers and scrapers:

Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in order to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screen scrapers too.

Frequently change your HTML

Scrapers that process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.

If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.

  • You can frequently change the id’s and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially but will break after a week. Make sure to change the length of your ids/classes too, otherwise, the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too..
  • If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add/remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server-side HTML processing, this should not be too hard.

Things to be aware of:

  • It will be tedious and difficult to implement, maintain, and debug.
  • You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
  • Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.

Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.

Change your HTML based on the user’s location

This is sort of similar to the previous tip. If you serve different HTML based on your user’s location/country (determined by IP address), this may break scrapers that are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it’s actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.

Frequently change your HTML, actively screw with the scrapers by doing so !

An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:

<div class="search-result">
  <h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)

As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here’s how the search results page could be changed:

<div class="the-real-search-result">
  <h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
  <p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
  <a class"the-real-search-result-link" href="/stories/stack-overflow-has-become-the-most-popular">Read more</a>
</div>

<div class="search-result" style="display:none">
  <h3 class="search-result-title">Visit example.com now, for all the latest Stack Overflow related news !</h3>
  <p class="search-result-excerpt">EXAMPLE.COM IS SO AWESOME, VISIT NOW! (Real users of your site will never see this, only the scrapers will.)</p>
  <a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)

This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they’re hidden with CSS.

Screw with the scraper: Insert fake, invisible honeypot data into your page

Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:

<div class="search-result" style="display:none">
  <h3 class="search-result-title">This search result is here to prevent scraping</h3>
  <p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
  Note that clicking the link below will block access to this site for 24 hours.</p>
  <a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)

A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won’t visit the link. A genuine and desirable spider such as Google’s will not visit the link either because you disallowed /scrapertrap/ in your robots.txt (don’t forget this!)

You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.

  • Don’t forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don’t fall into it.
  • You can / should combine this with the previous tip of changing your HTML frequently.
  • Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
  • Beware that malicious people can post something like [img]http://yoursite.com/scrapertrap/scrapertrap.php[img] on a forum (or elsewhere), and thus DOS legitimate users when they visit that forum and their browser hits your honeypot URL. Thus, the previous tip of changing the URL is doubly important, and you could also check the Referer.

Serve fake and useless data if you detect a scraper

If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data so that scrapers don’t know that they’re being screwed with.

As an example: if you have a news website; if you detect a scraper, instead of blocking access, just serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your faked data or articles indistinguishable from the real thing, you’ll make it hard for scrapers to get what they want, namely the actual, real articles.

Don’t accept requests if the User-Agent is empty/missing

Often, lazily written scrapers will not send a User-Agent header with their request, whereas all browsers, as well as search engine spiders, will.

If you get a request where the User-Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)

It’s trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.

Don’t accept requests if the User-Agent is a common scraper one; blacklist ones used by scrapers

In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:

  • “Mozilla” (Just that, nothing else. I’ve seen a few questions about scraping here, using that. A real browser will never use only that)
  • “Java 1.7.43_u43” (By default, Java’s HttpUrlConnection uses something like this.)
  • “BIZCO EasyScraping Studio 2.0”
  • “wget”, “curl”, “libcurl”,.. (Wget and cURL are sometimes used for basic scraping)

If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.

Check the Referer header

Adding on to the previous item, you can also check for the [Referer](https://en.wikipedia.org/wiki/HTTP_referer header) (yes, it’s Referer, not Referrer), as lazily written scrapers may not send it, or always send the same thing (sometimes “google.com”). As an example, if the user comes to an article page from a on-site search results page, check that the Referer header is present and points to that search results page.

Beware that:

  • Real browsers don’t always send it either;
  • It’s trivial to spoof.

Again, as an additional measure against poorly written scrapers it may be worth implementing.

If it doesn’t request assets (CSS, images), it’s not a real browser.

A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won’t as they are only interested in the actual pages and their content.

You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.

Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.

Use and require cookies; use them to track user and scraper actions.

You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captcha on a per-user instead of a per-IP basis.

For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it’s probably a scraper.

Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled if your site only works with cookies.

Note that if you use JavaScript to set and retrieve the cookie, you’ll block scrapers which don’t run JavaScript, since they can’t retrieve and send the cookie with their request.

Use JavaScript + Ajax to load your content

You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.

Be aware of:

  • Using JavaScript to load the actual content will degrade user experience and performance
  • Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages but may be for other things, such as article pages.
  • A programmer writing a scraper who knows what they’re doing can discover the endpoints where the content is loaded from and use them.

Obfuscate your markup, network requests from scripts, and everything else.

If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex with multiple layers of obfuscation, bit-shifting, and maybe even encryption), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.

  • If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
  • You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
  • You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.

There are several disadvantages to doing something like this, though:

  • It will be tedious and difficult to implement, maintain, and debug.
  • It will be ineffective against scrapers and screen scrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don’t run JavaScript though)
  • It will make your site nonfunctional for real users if they have JavaScript disabled.
  • Performance and page-load times will suffer.

Non-Technical:

Your hosting provider may provide bot – and scraper protection:

For example, Cloudflare provides anti-bot and anti-scraping protection, which you just need to enable, and so does AWS. There is also mod_evasive, an Apache module which lets you implement rate-limiting easily.

Tell people not to scrape, and some will respect it

You should tell people not to scrape your site, eg. in your conditions or Terms Of Service. Some people will actually respect that, and not scrape data from your website without permission.

Find a lawyer

They know how to deal with copyright infringement, and can send a cease-and-desist letter. The DMCA is also helpful in this regard.

This is the approach Stack Overflow and Stack Exchange uses.

Make your data available, provide an API:

This may seem counterproductive, but you could make your data easily available and require attribution and a link back to your site. Maybe even charge $$$ for it..

Again, Stack Exchange provides an API, but with attribution required.

Miscellaneous:

  • Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, so you will need to find compromises.
  • Don’t forget your mobile site and apps: If you have a mobile version of your site, beware that scrapers can also scrape that. If you have a mobile app, that can be screen scraped too, and network traffic can be inspected to figure out the REST endpoints it uses.
  • If you serve a special version of your site for specific browsers, eg. a cut-down version for older versions of Internet Explorer, don’t forget that scrapers can scrape that, too.
  • Use these tips in combination, pick what works best for you.
  • Scrapers can scrape other scrapers: If there is one website that shows content scraped from your website, other scrapers can scrape from that scraper’s website.

What’s the most effective way?

In my experience of writing scrapers and helping people to write scrapers here on SO, the most effective methods are :

  • Changing the HTML markup frequently
  • Honeypots and fake data
  • Using obfuscated JavaScript, AJAX, and Cookies
  • Rate limiting and scraper detection and subsequent blocking.