In recent years, “web crawling” has emerged as a buzzword in compliance and merchant underwriting. Through training events, conferences, and conversations with industry professionals, we’ve noticed a recurring theme: while many reference web crawling as part of their Know Your Business (KYB) processes, the term is often misunderstood. In this post, we’ll clarify what web crawling really means, explore its role in KYB, and show how it fits into broader compliance strategies.
What is web crawling and how does it apply to KYB?
Web crawling is the automated process of collecting publicly available data from websites to identify specific information or patterns. Search engines, for instance, use crawlers to index websites, while data aggregators rely on them to gather information for analysis. In compliance, web crawling is an essential tool for monitoring online content, detecting risks, and verifying claims.
However, a common misconception is that web crawling is a one-click solution. Some believe a bot can scan the entire internet and instantly solve compliance challenges. Web crawling is just one part of a much larger process. While it’s effective for large-scale data collection, algorithms are required to structure and analyze the gathered data, transforming raw information into actionable insights. Humans are required to go where machines cannot. Intuition, experience and perception are needed especially when interpreting complex patterns or intent behind marketing practices and connections between business entities.
Web crawling in merchant underwriting
In KYB and merchant underwriting, web crawling supports several critical tasks:
- Assessing compliance: Crawlers can identify potential violations of card scheme rules (e.g., Mastercard BRAM or Visa VIRP).
- Detecting risks: Subtle indicators, like hidden connections between websites or unverified business models, often require a combination of automated and manual analysis.
- Verifying merchant claims: If a merchant claims to sell clothing but their website hosts gambling content, crawlers can help tp uncover the inconsistency.
At G2RS, our approach to web crawling is different from automated solutions. We combine automation with expert reviews to ensure risks aren’t missed. An additional layer of intelligent algorithms analyses the crawled data, automatically identifying potential issues and flagging them for further human review.
This is where underwriters show their strength. Verifying flagged issues is a key part of the underwriter’s role, where they weed out false positives that simple algorithms, or even sophisticated AI analysis, cannot properly understand. While we take on much of the heavy lifting, we encourage clients to actively engage with the results, reviewing flagged issues or verifying annotations, to build a robust compliance framework.
Understanding website content monitoring
Persistent Merchant Monitoring (PMM) is a critical part of ongoing compliance for businesses operating in the payment industry. As a specialized form of web crawling, it focuses on specific compliance needs. Unlike general web crawling, which collects and indexes a wide range of website data, content monitoring keeps watch of compliance-critical elements. This involves:
- Product listing
- User generated content
- Terms and conditions
This vigilance helps identify risks like non-compliance, counterfeit goods, or misleading claims, which can arise as businesses evolve or expand.
Monitoring strategies typically combine automated tools, such as web crawlers tailored to scan for compliance-relevant updates, such as web crawlers with human expertise to interpret more complex risks. Key areas of focus include reviewing advertising practices, merchant information, and audience demographics. Certain merchant category codes (MCCs), such as dating services or marketplaces, require extra attention due to their higher compliance risks.
The need for nuanced analysis and continuous monitoring
Effective monitoring of merchant profiles requires a nuanced approach, as risks can evolve and emerge over time. Web crawling is not a one-time activity; rather, it needs to be an ongoing process to ensure merchants remain compliant as new risks surface. Persistent monitoring tools provide continuous oversight by regularly scanning merchant websites. This helps detect any changes or emerging risks that may not have been apparent during the initial review.
Limitations of basic web crawling for compliance
While web crawling is powerful, it has limitations in the KYB and merchant underwriting context:
- Lack of context: Crawlers can’t interpret nuanced issues. For example, a crawler might flag a licensed pharmacy based on keywords like ‘prescription drugs’ or ‘opioids’. However, this lack of context doesn’t account for whether the pharmacy is properly licensed to sell such medications in its jurisdiction. Additional context, such as verifying the pharmacy license and compliance with local regulations, is needed to properly assess whether the entity poses a compliance risk.
- Inability to interpret intent: Crawling itself is primarily a data-gathering tool, collecting raw information from websites. However, keywords alone can’t provide full clarity. For instance, identifying whether content reflects fraudulent practices requires analysis beyond the crawler’s capabilities. Human expertise is essential to review and interpret the data in context, ensuring that intent and compliance risks are accurately assessed.
- Technical barriers: Many high-risk industries implement measures to block crawlers, such as DDoS protections or restricted member-only areas. This is particularly common in adult entertainment, crypto, or gambling industries, where compliance risks are often hidden behind logins. On top of that, many sites limit request frequency, and crawlers must be optimized to avoid being blocked, throttled, or shown as a CAPTCHA. Websites often block or challenge bots with CAPTCHA to prevent scraping, requiring advanced methods like rotating IPs or CAPTCHA-solving services.
- Website structure and data quality: Some web design architecture can break crawlers, necessitating frequent updates to scrap scripts. As a result, data retrieved from websites may be inconsistent or inaccurate, requiring validation and cleaning for reliable use.
- Scalability and reliability:
As scraping operations grow, ensuring scalability and resource management becomes critical to maintain efficiency. Proper storage and organization are essential for managing large volumes of scraped data and ensuring efficient access. In addition, automated scraping needs proper monitoring to ensure consistent performance and manage failures without manual intervention.
Going beyond the crawl in KYB and compliance
Web crawling is a cornerstone of effective KYB and merchant underwriting, but it’s not a standalone solution. Combining crawling with human expertise, automation, and continuous monitoring ensures a comprehensive compliance strategy. At Web Shield, our solutions like Persistent merchant monitoring integrate these elements to streamline KYB processes and reduce compliance risks.
It’s essential to understand that while we provide a powerful tool to help uncover risks, web crawling is just part of the synergy between underwriting technology and compliance departments. Expertise on both sides is needed to critically assess these findings. This collaborative effort ensures a more robust compliance framework.
To see how G2RS can enhance your compliance efforts, contact us and experience the difference for yourself.