Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Security Considerations

When using ActiCrawl for web scraping, it's important to understand and implement proper security measures to protect your data and maintain safe operations. This guide covers essential security considerations and best practices.

API Key Security

Protecting Your API Keys

Your API key is the primary authentication method for accessing ActiCrawl services. Treat it like a password:

  • Never commit API keys to version control (git, SVN, etc.)
  • Use environment variables to store sensitive credentials
  • Implement key rotation regularly
  • Restrict key permissions to minimum required access
bash
# Good: Using environment variables
export ACTICRAWL_API_KEY="your-api-key-here"
curl -H "Authorization: Bearer $ACTICRAWL_API_KEY" ...

# Bad: Hardcoding keys
curl -H "Authorization: Bearer sk_live_abc123..." ...

Environment-Specific Keys

Use different API keys for different environments:

javascript
// config.js
const config = {
  development: {
    apiKey: process.env.ACTICRAWL_DEV_KEY
  },
  production: {
    apiKey: process.env.ACTICRAWL_PROD_KEY
  }
};

Data Protection

Handling Sensitive Data

When scraping websites that contain sensitive information:

  1. Minimize data collection - Only collect what you need
  2. Encrypt data in transit - All ActiCrawl API calls use HTTPS
  3. Secure data storage - Encrypt sensitive data at rest
  4. Implement access controls - Limit who can access scraped data

GDPR and Privacy Compliance

javascript
// Example: Anonymizing personal data
function anonymizeData(scrapedData) {
  return {
    ...scrapedData,
    email: hashEmail(scrapedData.email),
    phone: null, // Remove phone numbers
    name: scrapedData.name.substring(0, 1) + '***'
  };
}

Network Security

IP Whitelisting

For production environments, consider implementing IP whitelisting:

javascript
// Example middleware for IP restriction
const allowedIPs = ['203.0.113.0', '203.0.113.1'];

function ipWhitelist(req, res, next) {
  const clientIP = req.ip;
  if (allowedIPs.includes(clientIP)) {
    next();
  } else {
    res.status(403).json({ error: 'Access denied' });
  }
}

Rate Limiting

Implement rate limiting to prevent abuse:

javascript
const rateLimit = require('express-rate-limit');

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // limit each IP to 100 requests per windowMs
  message: 'Too many requests from this IP'
});

app.use('/api/', limiter);

Authentication & Authorization

Secure Token Storage

Never store authentication tokens in:
- Local storage (XSS vulnerable)
- Session storage (XSS vulnerable)
- Cookies without HttpOnly flag

javascript
// Secure cookie example
res.cookie('auth_token', token, {
  httpOnly: true,
  secure: true, // HTTPS only
  sameSite: 'strict',
  maxAge: 3600000 // 1 hour
});

Token Validation

Always validate tokens on the server side:

javascript
async function validateRequest(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];

  if (!token) {
    return res.status(401).json({ error: 'No token provided' });
  }

  try {
    const isValid = await verifyToken(token);
    if (isValid) {
      next();
    } else {
      res.status(401).json({ error: 'Invalid token' });
    }
  } catch (error) {
    res.status(500).json({ error: 'Token validation failed' });
  }
}

Input Validation

Sanitizing URLs

Always validate and sanitize URLs before scraping:

javascript
const { URL } = require('url');

function validateUrl(urlString) {
  try {
    const url = new URL(urlString);

    // Only allow HTTP(S) protocols
    if (!['http:', 'https:'].includes(url.protocol)) {
      throw new Error('Invalid protocol');
    }

    // Prevent SSRF attacks
    const blockedHosts = ['localhost', '127.0.0.1', '0.0.0.0'];
    if (blockedHosts.includes(url.hostname)) {
      throw new Error('Blocked host');
    }

    return url.toString();
  } catch (error) {
    throw new Error('Invalid URL');
  }
}

Request Validation

Validate all incoming requests:

javascript
const Joi = require('joi');

const scrapeSchema = Joi.object({
  url: Joi.string().uri().required(),
  format: Joi.string().valid('markdown', 'json', 'html').required(),
  waitFor: Joi.number().min(0).max(30000).optional(),
  screenshot: Joi.boolean().optional()
});

function validateScrapeRequest(req, res, next) {
  const { error } = scrapeSchema.validate(req.body);
  if (error) {
    return res.status(400).json({ 
      error: error.details[0].message 
    });
  }
  next();
}

Error Handling

Secure Error Messages

Never expose sensitive information in error messages:

javascript
// Bad: Exposing internal details
catch (error) {
  res.status(500).json({ 
    error: error.stack,
    database: error.sqlMessage
  });
}

// Good: Generic error messages
catch (error) {
  console.error('Scraping error:', error); // Log internally
  res.status(500).json({ 
    error: 'An error occurred processing your request',
    reference: generateErrorId()
  });
}

Monitoring & Logging

Security Event Logging

Log security-relevant events:

javascript
const winston = require('winston');

const securityLogger = winston.createLogger({
  level: 'info',
  format: winston.format.json(),
  transports: [
    new winston.transports.File({ 
      filename: 'security.log',
      maxsize: 5242880, // 5MB
      maxFiles: 5
    })
  ]
});

// Log security events
securityLogger.info('API_KEY_USED', {
  timestamp: new Date(),
  apiKey: maskApiKey(apiKey),
  ip: req.ip,
  userAgent: req.headers['user-agent']
});

Anomaly Detection

Monitor for suspicious patterns:

javascript
async function detectAnomalies(userId) {
  const recentRequests = await getRecentRequests(userId, '1h');

  if (recentRequests.length > 1000) {
    await flagAccount(userId, 'HIGH_VOLUME');
  }

  const uniqueIPs = [...new Set(recentRequests.map(r => r.ip))];
  if (uniqueIPs.length > 10) {
    await flagAccount(userId, 'MULTIPLE_IPS');
  }
}

Best Practices Checklist

Development

  • [ ] Use environment variables for sensitive data
  • [ ] Enable HTTPS for all communications
  • [ ] Implement proper error handling
  • [ ] Validate and sanitize all inputs
  • [ ] Use secure coding practices

Production

  • [ ] Regular security audits
  • [ ] Implement rate limiting
  • [ ] Monitor for suspicious activity
  • [ ] Keep dependencies updated
  • [ ] Have an incident response plan

Compliance

  • [ ] Follow robots.txt rules
  • [ ] Respect website terms of service
  • [ ] Implement data retention policies
  • [ ] Ensure GDPR compliance
  • [ ] Document security measures

Security Headers

Always include security headers in your responses:

javascript
app.use((req, res, next) => {
  res.setHeader('X-Content-Type-Options', 'nosniff');
  res.setHeader('X-Frame-Options', 'DENY');
  res.setHeader('X-XSS-Protection', '1; mode=block');
  res.setHeader('Strict-Transport-Security', 
    'max-age=31536000; includeSubDomains');
  res.setHeader('Content-Security-Policy', 
    "default-src 'self'");
  next();
});

Conclusion

Security is an ongoing process, not a one-time implementation. Stay updated with the latest security best practices, regularly audit your implementation, and always prioritize the protection of your users' data.

For security concerns or to report vulnerabilities, please contact our security team at security@acticrawl.com.