Generating better insights by controlling data quality

Fraud detection in market research to improve quality of insights

15 min read

Table of Contents

Data Quality is a fundamentally important aspect of Market Research. As digital platforms scale, the power of research can enable many business, political, social, and financial decisions. 

However, as with any system that scales, digital fraud can potentially undermine that same research.

As we get into the last months of a very interesting year for research technology, our vantage point in the industry has given us a few insights into the general future trends.

We’ll focus on trends and aspects of data quality, which is the topic we’re closest to:

  • More click farms – Unfortunately, this will be a thing. The cost to create a new internet identity is zero. There is no marginal cost for fraud and only “revenue” potential for fraudsters. This is our industry’s single biggest problem.

It is likely not a secret anymore that much of the fraud in our industry is via click farms, not via bots, although the descriptions are often interchangeably used. I personally think it’s important to distinguish between the two.

  • Demand/Supply, Prices, and Fraud – What compounds Point 1 is that supply has been harder to get, and the average prices have been going up. That means more efforts to add supply will come through and with that, even more fraud.
  • Email / Deterministic ‘currency’ is coming

Whether we ‘blame’ GDPR, or Google’s continued project of winding down cookies, the need for first party data, opt-in to surveys via emails will be one of a few (if not “the”) true forms of deterministically identifying individuals. Additionally, as companies service advertising research and research technology, emails will become the “true keys” for any transaction.

This allows a huge potential for the industry to grow our TAM (total addressable market), so this trend is good. However, this does mean that we will have to deal with new forms of manipulation of that digital currency.

Managing fraud vs. Fighting Fraud

I recently heard a great story about selling lemonade at a fair or stall. When you are selling items at a fair, at the end of the day, you want to be left with exactly one glass of lemonade/t-shirt/ or whatever you are selling, left at the stall at the end of the day.

If we have none left over, we will never know how much more we could have sold. And if we have too many left over, we know we brought too many to start off with.

There is a similar philosophy when working with high volumes of traffic and managing fraud. As counter intuitive as this sounds, seeing a certain (low single digit) amount of fraud value likely means that the balance between traffic, conversion, and data quality was healthy. Of course, we often hear terrible stories about a 25% to 30% rejection rate, which is unacceptable. But as far as the other end goes, it likely cannot be 0% either. And, of course, that data would need to be cleaned programmatically via a text or other tool before delivery to the client.

The key metric in question here is tracking the balance between false positives and true negatives. That equilibrium is the main question in this conversation. Some level of fraud/traffic incoming is always going to be part of sourcing – this is not to say that they “should be allowed” into surveys. But the point is that having almost no fraud is an indicator of too much heavy-handedness.

There are several ways to be heavy-handed about finding fraud. Especially in a world where behaviors are changing, a heavy-handed approach would remove significant swathes of good traffic and good respondents. For example, in the banking industry, every single fraud tracking algorithm from 2019 had to be revised or recreated entirely in 2020 because online behavior changed tremendously. Using older technologies and algorithms could undermine the data and introduce (non) participation bias at a more significant scale.

A recent example with which most of us would be familiar is the Rapid test vs. PCR COVID-19 tests.

Rapid Tests use nasal swabbing (which is quite uncomfortable, by the way) to quickly analyze the existence of specific proteins on the surface of the virus. On the other hand, PCR tests look for the genetic material of the virus. The Rapid Tests, as the name suggests, take 15 mins or so. The PCR tests need much longer.

Taking a similar analogy to sample and research, an in-depth, 2-factor authentication which also includes address verification, would likely cut down fraud significantly, but it also adds further friction to the process and will cause a huge loss of traffic. And this will likely have a tremendous impact on the available traffic. I likely would not go to my favorite news sites if I had to input my name and address every single time.

On the other hand, having a ‘rapid test’ of simply the digital fingerprint check would err on the other side.

In short, we are living a two-legged existence – on the supply side, traffic, conversions, and revenue are the lifeline in terms of the business model. And, on the buyer’s side, delivering high-quality insights based on good data collection is the key.

Finding the balance between the two will help us create a healthy industry and ecosystem.

Trust in Research

In business and the economy in general, trust is the most important factor that underlies our professional relationship. We work with people we trust, and usually “trust” and “like” go together.

We see this in Research Technology / Market Research / Sample as well. We do not want respondents who are not engaged or not representative or relevant to the specific survey. 

That’s the simplistic definition of Data Quality and trust in our industry.

Of course, there exists a gray area between False Positives and True Negatives in every industry. The most pertinent current example would be that of the COVID PCR test vs. the Rapid test, with the former’s False Positive rate being much lower. Of course, it takes 2 days to get the result back vs. 15 minutes.

When trying to manage good quality vs. poor quality respondents, one of the most important and complex aspects of the job is minimizing False Positives. Minimizing False Positives is important because it undermines the very trust that the process started out to solve in the first place.

In fact, we could argue that False Positives are the worst outcomes of fraud management because:

a.     Most businesses and democratic societies operate under “innocent until proven guilty” i.e., we intuitively err on the side of trusting people to be good. False positives hurt the innocent.

b.    False Positives further undermine the trust in the system – i.e., if we are catching the wrong respondents as fraudsters, we might be letting the actual fraudsters go through as false negatives.

However, the problem is subtle, and in the below diagram, we see the representation of the issue.

On a broad level, it’s easy to define good respondents vs. bad ones. The left side is good, and the right side is bad. However, in the specific cases, as represented by the cut-off plane, the gray areas become far harder to manage – indeed, there is representationally a direct overlap between good and bad respondents.

The question is where you want to draw the line. The knee-jerk reaction of the buyers would be to draw the line closer to the blue curve. While that is a direct challenge for the supply side for revenue and completions, it also brings into question whether the buyers may reject so many respondents whose opinions should make up the research.

But on the supply side, that would mean giving up more and more real respondents, and there might be a desire to draw the plane farther away from the blue curve.

In MR, we may have the liberty of experimentation that we don’t quite work in a hospital zone, or other industries where these decisions signify life or death. Nevertheless, when real $s and peoples’ time, expectations are in the mix, the costs of False Positives tend to add up.

So, where should the balance be? That is the million (or should I say 2) dollar question.

Layering

As we get into a busy season for the industry, we’d like to share some principles that our industry can use to maintain a high quality of respondents and responses. Many of these are commonplace (and necessary), but the nuances and details are where we can make a difference.

  • The Basics: 

While the standard security tools are not enough to protect against advanced fraudsters, they still need to be used as a baseline for all surveys. These include a simple digital fingerprint check to ensure that the respondent isn’t attempting a survey more than once. This is especially important with the increase in exchanges and multi-sourcing, which means the same respondent can be sent to the same survey through two different supply sources.

In addition to the fingerprint, other respondent identifiers that can be used to protect against duplicate respondents are Panelist IDs, Cookie IDs, IP, and other client-side identifiers. 

  • Next Level Fraud Monitoring:

Duplicate fingerprints, machine-side information, and known bad actors are a great start, but fraudsters continue improving their techniques to appear legitimate to get around security features. This means that the Market Research industry needs further investment to stay ahead of the curve. Some of the more recent features we released include:

  • Subnet Tracking:

Subnets are machine identifiers via IP that are changed slightly to make the respondent seem unique. For example, completing a survey on 100.100.100.1, then attempting the same survey on 100.100.100.2. Research Defender monitors traffic patterns within surveys and accounts to catch this type of obfuscation before it impacts the survey result.

  • Emulator Tracking:

Fraudsters can use specific tools or machines to run multiple operating systems on a single machine to appear unique. This can include different OSs (Mac and PC) as well as different versions of the same operating system (Windows 7 and Windows 10)

  • VPN Usage:

There has been a steady increase in the use of VPNs. This increase reflects the general population of the internet, so it’s important to note a VPN doesn’t necessarily indicate fraud. However, Research Defender is using Machine Learning to determine which VPN services are linked to increases in fraud and determine what other flags, in combination with the VPN usage, are more correlated to fraudulent responses, bot activity, click farms, etc.

  • The External Network:

Research Defender has its own proprietary fraud prevention tool to prevent bot attacks, click farms and nefarious actors. It also leans on the “wisdom of the crowds” by leveraging several of the leading services in the fraud prevention ecosystem. These services track respondents associated with fraudulent activity in the Market Research space, as well as Ad Tech, Retail, eCommerce, and more. The nefarious actors in the Market Research industry often spend their time in other corners of the internet as well, and these services allow Research Defender to spot these respondents before it’s too late.

Research Defender helps research buyers, suppliers and exchanges fight against fraud and improve data quality.

In our experience dealing with fraud and poor data quality in the industry, click farms, i.e. systematic human bad actors, are the primary drivers of fraud in our industry. A definitive cottage industry targets research for financial gains and provides fraudulent, quick, and irrelevant answers to surveys.

In many cases, these click farms have become fully aware of the API calls that we use to identify them. To counter us in an ever-progressing “arms race” these click farms are starting to respond directly to our security features. We first saw this happening in Q3-2021. From a security perspective, this was frustrating because it meant that we needed to rethink and rearchitect how we countered this new form of fraud and direct attack on our APIs.

Their manipulations fell into 2 specific categories:

  • Stopping network traffic

In the first case, the network traffic is “stopped” when our APIs / software are in the process of determining whether or not an individual is a fraudulent actor. In other words, we were “actively prevented” by respondents or their scripts from executing our software because they didn’t want to be identified.

  •  API Response Manipulation

Even more concerning has been the specific manipulation of the API response itself. We have several scores which identify various fraudulent actors on our side – i.e. in the data that we see. However, the same data set received by the client is manipulated to show a good respondent score.

This aspect is more concerning because respondents fully understand exactly what we are up to down to the specific JSON (API) response level.

While these combined efforts to around 0.31% (< 1%), at our scale, those numbers add up to a large nominal number, which basically meant we had to respond with our escalation.

The very specific nature of these kinds of manipulation, and the speed of how these were being executed, in the order of milliseconds, clearly showed that this form of fraudulency was happening at scale.

Our counter measures have fallen into 2 kinds addressing the 2 issues above, respectively:

  • Server to Server Calls

This form of API call bypasses any fraudulent activity altogether and directly communicates with the clients’ servers. We still interact with the respondent, but only with a hashed/blind token and a locked communication channel.

  • Hashed/ Encrypted values

This option provides a corresponding hash value based on a secret key to ensure that the client can validate the hash value sent and be aware of a potential respondent manipulation of the API.

While neither of these is a new form of offering, I believe we will be the first ones to explicitly mandate 100% of our API communication via this form.

An arms race may seem to only be a downside in terms of the increasing sophistication and technology investment. But, we take pride in that we are at least not fighting our battles with bows and arrows.

Respondents Galore

When I first moved to the US, I was very surprised to hear that restaurants and businesses within the category were really hard to get into and be successful in. The margins are low, rents are high, and the overhead is steep.

This was surprising to me because, in India, owning and operating a restaurant is considered to be a “safe bet”. In fact, a common theme is that many celebrities own restaurants, either through their own brand name or through a subsidiary, because of the stable promise of good ongoing income. After all, 1.3 billion people will be eating a lot of food.

The reason for Indian restaurants being successful is simple. There’s enough foot traffic in most first and second-tier cities that you’ll keep getting customers unless you literally are poisoning people.

In the research world, especially since 2021, we have all become acutely aware that high quality respondents are hard to come by. However, this industry does not have the same luxury in numbers as the Indian restaurant industry.

Additionally, the experiences we create do not welcome them as good repeat customers.

While our industry does see a lot of traffic (better defined as people, human beings) come through, the return rates are very poor. For example, when were you last in a panel and took a survey?

Personally, I attribute this to low payouts and poor experience.

On the survey side, the best numbers we’ve seen from entry to survey completion is around 20%. It’s usually lower but let’s generously use that number.

The first-time survey taker is analogous to the Indian restaurant goer. Many of them are there simply because they happened to stop by to get a bite, while visiting a different neighborhood and driving by. In the survey world, somewhere in the demand/supply economics of the internet, a respondent has been invited either through a game, a panel, or a direct invitation. However, they do not necessarily want to participate again, given the completion rates and experience are generally poor.

In other words, 90% of the time, we’re turning people away from any restaurant.

Taking any other internet experience as an example, e.g., shopping on Amazon, we get what we need most of the time. Certainly, far greater than 10%. If we didn’t get what we were looking for 90% of the time, we would simply not use Amazon again. Sadly, our survey experience is doing exactly that.

Is there any surprise then that we don’t have a good supply of respondents? Of course, as analogies go, I don’t expect this to stand up to perfect scrutiny – but the point is – let’s work to provide a better experience. And, let’s not poison anybody. Always sage advice.

World mythology has many interesting lessons for the business context. We often work with complex technology, discerning between legitimate respondents and fraudulent ones. As in legends and stories, many characters are complicated. Achilles was righteous but arrogant. Hector was virtuous but fought for the ‘wrong side’. We find these examples all over world literature.

In research too, actors are complicated – many respondents take surveys to share their opinions. Many of them participate to earn points. And, a good chunk of them may have been invited to do something else, but find themselves taking a survey instead – are they now committed to efficiently completing the survey?

One could very well argue that these are the best respondents/ sources of data their participation was not pre-determined and hence is objective. But, one could also argue that research does need a ‘state of mind’ for participation, and not having that might inherently bias the data.

The technology, specifically fraud detection, is complex because we are really attempting to read intent – i.e., attempting to read another human’s mind, which will remain the last frontier in technology.

All that said, there are some basic themes that we can and do extract in this fight against fraud and to maintain data integrity.

  • First, we need to discern between “good and evil” respondents – in many cases, the distinction is clear. If the individual is not able to answer the question in the appropriate language of the survey, it is a case of fraud or at the very least a case of a mistaken invitation. If someone is entering gibberish or clearly unable to answer the question but continues to do so in an irrelevant manner, that too is fraud.

In many other cases, it is not a case of fraud per se. If a respondent lands in a 40-minute survey after a series of invitations and gives short answers, that’s not really ‘fraud’.

  • Second, we need good definitions. One issue we have today is we currently define fraud as any individual who is rejected on the survey level.

While that is not an unfair definition, that forces us to look at the outcome, not the input. This is analogous to staring at the revenue numbers all day but never taking the initiative to outreach.

Taken to the extreme, that argument basically means we have no control over it and are purely reactive.

  • We also must accept that there are going to be grey areas on all sides – are VPNs bad, necessarily? Are we going to throw out every individual from an organization’s ISP?

We cannot necessarily agree or disagree with one of the above points or the others, it really all just depends. The best way for a supplier or research agency to address this is to be consistent – pick any rule(s) or definitions – start there and refine it as time goes along.

  • Finally, and most importantly, we need to fix a few things at ‘the top’ or ‘upstream’.

There are several ways to address these factors of the story and see better results at the end client level.

Duplicates, garbage responses, professional survey takers, etc. are all sweepstakes and should be easily cleaned. We need to double down and take these recommendations seriously. The stories we hear from prospective clients frankly rattle us –30 to 40% throw-out rates, slow fill rates even with a huge pool of respondents, etc. etc.

Many machine learning tools for fraud detection and anomaly detection are also available off the shelf via Google Cloud, AWS, etc.

Rest assured that we can keep improving and taking this to a higher level than it is. 

Privacy, Security and Fraud Detection

Over the past few years, user privacy and data management have become important digital landscape aspects.

Several legal and technical updates have come our way since 2016. Things are now reaching a crescendo, and the music will stop next year and there will be some equilibrium, as far as that word can be used in technology, that is.

We all know the legal requirements – GDPR, CCPA, Vermont, potential federal laws, or God forbid state-by-state laws.

The technical updates are/have largely come through also – Firefox’s announcement and default of blocking 3rd party data, Safari’s default of the same, and of course, the elephant in the room, Google Chrome’s announcement last year. Apple’s decision to require app-by-app permission also puts more power into the hands of its users.

Before we get into the implications and predictions of these changes for market research, let us get some definitions out of the way:

  • 1st party data – Data that a company directly has access to via a website/app.
  • 2nd party data – Data that is bought from a company that has 1st party data.
  • 3rd party data – Aggregated data without having a direct relationship with 1st

 party or respondents.

Why is this important now?

2022/3 will mark an important date when all the major players will block 3rd party integrations by default. Ultimately, what Apple, Google, and Firefox are doing is shifting the onus (some may say liability) for privacy management from the browser/phone to the sites – i.e., the websites or apps to “own” the data.

My disclaimer here is that this is a rapidly changing environment currently, which will likely mean other updates. And, there are legal considerations around whether this allows some of those companies mentioned to gain a monopolistic ability to track users and how courts worldwide will look at these changes.

From a Research Technology standpoint, let us consider what these technical changes mean to us.

First, the bad news:

  • Big companies make the rules. Everyone else needs to follow them.

As Apple and Google decided to develop an alternative web interaction altogether, companies still dependent upon it will have to change and change fast. Otherwise, the companies would not be able to interact with customers meaningfully.

  • Google has also stated that these rules apply to companies trying to get around 3rd party data management with digital fingerprinting or other unified IDs or techniques.

Good news:

  • The internet is (mostly) equitable. Unless your company’s name is Google or Apple or Microsoft, we likely all have the same rules to play with. So likely, your competition or partner won’t have any different set of rules than anyone else.
  • Surveys, by definition, are fairly involved. As we work with survey takers across the world, the additional threshold of getting their consent (which we mostly are already doing) won’t have a dramatically larger impact.

As long as research is respectful of the usage of the data, we are in good shape.

Predictions for Research Technology

  • Engagement

Companies with native integration and direct access (i.e., 1st party access) will do well, specifically apps on devices, or a rich panel asset with engaged respondents. The new system is designed for their success.

  • Traffic Management     

Exchanges, aggregators, and marketplaces that see huge volumes will now have a legitimate reason to build out a rich first-party asset for themselves. In fact, this will be a blessing allowing marketplaces to directly claim 1st party ownership. Ultimately, demand matters and this category already carries the most scalable form of demand.

  • Rise of 2nd party data

2nd party data is the Ringo Starr of the digital landscape. With the scale limitations of John (1st party data), and the death/decline of George (3rd party data), 2nd party data could finally see a rise to prominence. 2nd party data is not easy/ natural to scale though, so it likely will only benefit the big players in one-on-one deals.

  • Data Quality

In the short term, these changes will help improve Data Quality. Opt-in typically would derive a more engaged population. However, in the long term, there will always be a basis for fraud because of the returns available in our ecosystem. We need to continue making the right technical investments to overcome this.

Looking Forward

The ResTech/ Market Research Industry has a variety of stakeholders. These stakeholders range from end-clients (Brands), research agencies, sample vendors, exchanges, fraud detection providers, etc. 

Poor Data Quality and Fraud are challenges we must learn to live with – we can’t stop them from happening. Fraud is as old as human society itself. Digital Fraud is simply just the new version. We need to be aware, manage, and control the poor outcomes.

The primary way to address fraud is to, first and foremost, be aware of it. Awareness breeds action. Even without a technical solution, awareness will allow a researcher to budget a certain number of overfills to account for data cleaning. Similarly, brands can invest – even if it is something as rudimentary as man-hours – in data cleaning.

Of course, today’s technology does allow us to go much farther. Every stakeholder above can and should invest in technological solutions to limit, and clean out the bad actors causing poor quality data.

The responsibility for Data Quality is ours, collectively. And the first step is to be aware.

Author

  • Vignesh Krishnan

    Vignesh is the Founder and CEO of Research Defender. He holds a bachelor’s degree in engineering and a master’s degree in business administration and technology management, which reflects his passion for the intersection of business and technology.