Content Marketing: How scrapers impact your content strategy
One issue that probably receives less attention than deserved is content scraping. This is a particular problem with easily digested material such as blog posts, whitepapers and articles.
Less than scrupulous website owners will go to your site, scrape your content and repost your work to their website.
This hurts your content marketing strategy in two major ways: one, it dilutes your brand awareness because some people will find your content on someone else’s website; and two, it essentially confuses search engines with the duplicate content and negatively affects your SEO.
To find out more about content scraping, and learn some tricks to combat the practice, I spoke with Rami Essaid, co-founder and CEO of Distil, a company that protects websites against unauthorized scraping.
As you might guess, this topic is near and dear to Rami’s heart, and he provides insight into how it happens and what you can do proactively to protect your content.
MarketingSherpa: Tell me why content marketers should be aware of, and concerned about, content scraping.
Rami Essaid: Marketing has shifted toward content marketing as the medium to drive traffic to websites. The reason it’s so powerful is because it provides valuable information to the end user, and allows marketers to brand within the content along with sending out the company’s message.
By having that content diluted and copied around the world, you are not able to capitalize on one hundred percent of the market reading your content.
When you think about any time you put something out there and it gets copied, scraped and duplicated, people are consuming it all around the world, but they are not consuming it from you, and you are losing the effectiveness of all of that hard work that you put into that content marketing.
MS: How does content scraping happen? What is the typical process?
RE: What happens is that these scrapers use botnets that are constantly looking for ways to drive traffic to their own site. They are looking for different sources to capture new content, and when they realize that you are publishing content, they set their bots to watch your site for updated content.
When you have new content, those bots will automatically go to your site, download the data, remove all mention of you, and then upload it to the plagiarism site.
It is automated for them to harvest information from around the Web. All they have to do is identify potential sources and then have their bots constantly watch them for new content. It sounds sophisticated, but it is pretty simple to set up unfortunately.
MS: What are some of the consequences of content scraping?
RE: People are taking your same content and publishing it elsewhere, and there are a couple of things that happen with that.
First of all, your traffic gets siphoned away from you. Some people are going to ultimately see that article somewhere else and not on your site, but more importantly, especially from a marketer’s perspective, Google doesn’t like this.
MS: I guess it’s safe to assume that plagiarism sites are also using black hat SEO tactics to improve their rankings on the SERP (search engine results page).
RE: Absolutely. These guys have perfected this art. There are only two things that they do and they do it very, very well. They steal content and repost it, and then optimize the SEO. We have seen instance after instance where sometimes within minutes after they scraped an article, they are outranking the original article.
They are not only taking your content, but they are outranking you for it at the same time.
MS: What are some ways to combat this problem?
RE: The biggest problem with this is that nobody knows what to do about it.
The backers of SOPA were people who were interested in protecting their brand or protecting their content. They backed SOPA because they don’t know what else to do.
There are some simple things that content producers can do by themselves to stop at least the basic scrapers from being able to take their content.
The original intent of the DMCA was if somebody stole any copyrighted material, you have a right as an original owner to file a complaint. Under U.S. law, you file those complaints with the search engine. You file those complaints with the hosting provider, and they either have to provide a reply or they have to take that content down.
But, the problem is each search engine has their own DMCA form. Each hosting provider has their own DMCA form. So, it is a very time-consuming process to go through for each one of your articles. It’s easier to stop it from happening than to take the copyrighted articles down once they have actually been stolen.
You have to work with your technology team, and get them looking for abusive IPs visiting your website, IPs that are constantly going above and beyond what your typical user behavior looks like.
Another big thing is to have as many links back to your own site as possible to make the automated stripping of your name and brand a lot harder to do. If you write an article with no mention of you except at the bottom, you are making life a lot easier for scrapers to take your content.
MS: Are there other tactics content marketers can take to combat scraping?
RE: It is really about keeping an eye on your content. One thing that we recommend for people to do is either set up Google Alerts or use a service like Copyscape or CopyGator to look out for when your content gets duplicated so that you have better insight into if this is a problem for you.
We also recommend updating robots.txt to make sure that it is stringent and it does not allow all types of bots.
Beyond that, you can also set your technology to prevent certain bots from accessing the actual code via a little bit of PHP or .NET work.
We recommend looking at a list of bad user agents and preventing those guys from being able to access your site. That will take a good chunk of the scrapers out of the game for you. The more sophisticated scrapers know how to get around that easily, but the more barriers to entry that you can put up, the better.
MS: Obviously no marketer wants to have their content scraped and duplicated, but are some marketers at more risk than others?
RE: If you are a marketer and you can tie a direct dollar value to the content that you are writing or producing, then you are better able to understand how much of an impact this has on you.
MS: Going back to how content duplication negatively affects SEO, are search engines looking into this issue?
RE: Google, in the past couple of years, has been more and more aggressive toward trying to mitigate this.
The problem Google faces is that they are scrapers themselves, and all they can do is look at content once it has been put out there on the Web. They can’t look at every single piece of content in real time, and they don’t have the capability of determining who wrote what first.
As soon as you put something out there, usually the site scrapers have already reposted your content and they did it faster than Google can attribute it to you. Therefore, Google has no idea who is the original content maker and who is the copier, and so they are not able to determine what to do and whom to punish.
They have to take a very broad approach and punish everybody because at the end of the day, they don’t care as much about the publisher. They care about the person who is consuming that content, and for them, they don’t want those consumers having to deal with duplicate content.
MS: Final question – what are some of the content types that get scraped and duplicated?
RE: From a marketing perspective, the types of content that relate to SEO: blog posts and articles.
We also feel that user-generated content falls into that same category. If you have a site where your end users are creating the content, whether they are uploading their own information or they are going back and forth in a dialog discussion.
(Image credit: michaelmolenda)
Video Blog: Why it’s easy to scrape your content (via Distil)
One Thing Google and Content Thieves Have in Common (via Distil)