Security Considerations
When using ActiCrawl for web scraping, it's important to understand and implement proper security measures to protect your data and maintain safe operations. This guide covers essential security considerations and best practices.
API Key Security
Protecting Your API Keys
Your API key is the primary authentication method for accessing ActiCrawl services. Treat it like a password:
- Never commit API keys to version control (git, SVN, etc.)
- Use environment variables to store sensitive credentials
- Implement key rotation regularly
- Restrict key permissions to minimum required access
# Good: Using environment variables
export ACTICRAWL_API_KEY="your-api-key-here"
curl -H "Authorization: Bearer $ACTICRAWL_API_KEY" ...
# Bad: Hardcoding keys
curl -H "Authorization: Bearer sk_live_abc123..." ...
Environment-Specific Keys
Use different API keys for different environments:
// config.js
const config = {
development: {
apiKey: process.env.ACTICRAWL_DEV_KEY
},
production: {
apiKey: process.env.ACTICRAWL_PROD_KEY
}
};
Data Protection
Handling Sensitive Data
When scraping websites that contain sensitive information:
- Minimize data collection - Only collect what you need
- Encrypt data in transit - All ActiCrawl API calls use HTTPS
- Secure data storage - Encrypt sensitive data at rest
- Implement access controls - Limit who can access scraped data
GDPR and Privacy Compliance
// Example: Anonymizing personal data
function anonymizeData(scrapedData) {
return {
...scrapedData,
email: hashEmail(scrapedData.email),
phone: null, // Remove phone numbers
name: scrapedData.name.substring(0, 1) + '***'
};
}
Network Security
IP Whitelisting
For production environments, consider implementing IP whitelisting:
// Example middleware for IP restriction
const allowedIPs = ['203.0.113.0', '203.0.113.1'];
function ipWhitelist(req, res, next) {
const clientIP = req.ip;
if (allowedIPs.includes(clientIP)) {
next();
} else {
res.status(403).json({ error: 'Access denied' });
}
}
Rate Limiting
Implement rate limiting to prevent abuse:
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // limit each IP to 100 requests per windowMs
message: 'Too many requests from this IP'
});
app.use('/api/', limiter);
Authentication & Authorization
Secure Token Storage
Never store authentication tokens in:
- Local storage (XSS vulnerable)
- Session storage (XSS vulnerable)
- Cookies without HttpOnly flag
// Secure cookie example
res.cookie('auth_token', token, {
httpOnly: true,
secure: true, // HTTPS only
sameSite: 'strict',
maxAge: 3600000 // 1 hour
});
Token Validation
Always validate tokens on the server side:
async function validateRequest(req, res, next) {
const token = req.headers.authorization?.split(' ')[1];
if (!token) {
return res.status(401).json({ error: 'No token provided' });
}
try {
const isValid = await verifyToken(token);
if (isValid) {
next();
} else {
res.status(401).json({ error: 'Invalid token' });
}
} catch (error) {
res.status(500).json({ error: 'Token validation failed' });
}
}
Input Validation
Sanitizing URLs
Always validate and sanitize URLs before scraping:
const { URL } = require('url');
function validateUrl(urlString) {
try {
const url = new URL(urlString);
// Only allow HTTP(S) protocols
if (!['http:', 'https:'].includes(url.protocol)) {
throw new Error('Invalid protocol');
}
// Prevent SSRF attacks
const blockedHosts = ['localhost', '127.0.0.1', '0.0.0.0'];
if (blockedHosts.includes(url.hostname)) {
throw new Error('Blocked host');
}
return url.toString();
} catch (error) {
throw new Error('Invalid URL');
}
}
Request Validation
Validate all incoming requests:
const Joi = require('joi');
const scrapeSchema = Joi.object({
url: Joi.string().uri().required(),
format: Joi.string().valid('markdown', 'json', 'html').required(),
waitFor: Joi.number().min(0).max(30000).optional(),
screenshot: Joi.boolean().optional()
});
function validateScrapeRequest(req, res, next) {
const { error } = scrapeSchema.validate(req.body);
if (error) {
return res.status(400).json({
error: error.details[0].message
});
}
next();
}
Error Handling
Secure Error Messages
Never expose sensitive information in error messages:
// Bad: Exposing internal details
catch (error) {
res.status(500).json({
error: error.stack,
database: error.sqlMessage
});
}
// Good: Generic error messages
catch (error) {
console.error('Scraping error:', error); // Log internally
res.status(500).json({
error: 'An error occurred processing your request',
reference: generateErrorId()
});
}
Monitoring & Logging
Security Event Logging
Log security-relevant events:
const winston = require('winston');
const securityLogger = winston.createLogger({
level: 'info',
format: winston.format.json(),
transports: [
new winston.transports.File({
filename: 'security.log',
maxsize: 5242880, // 5MB
maxFiles: 5
})
]
});
// Log security events
securityLogger.info('API_KEY_USED', {
timestamp: new Date(),
apiKey: maskApiKey(apiKey),
ip: req.ip,
userAgent: req.headers['user-agent']
});
Anomaly Detection
Monitor for suspicious patterns:
async function detectAnomalies(userId) {
const recentRequests = await getRecentRequests(userId, '1h');
if (recentRequests.length > 1000) {
await flagAccount(userId, 'HIGH_VOLUME');
}
const uniqueIPs = [...new Set(recentRequests.map(r => r.ip))];
if (uniqueIPs.length > 10) {
await flagAccount(userId, 'MULTIPLE_IPS');
}
}
Best Practices Checklist
Development
- [ ] Use environment variables for sensitive data
- [ ] Enable HTTPS for all communications
- [ ] Implement proper error handling
- [ ] Validate and sanitize all inputs
- [ ] Use secure coding practices
Production
- [ ] Regular security audits
- [ ] Implement rate limiting
- [ ] Monitor for suspicious activity
- [ ] Keep dependencies updated
- [ ] Have an incident response plan
Compliance
- [ ] Follow robots.txt rules
- [ ] Respect website terms of service
- [ ] Implement data retention policies
- [ ] Ensure GDPR compliance
- [ ] Document security measures
Security Headers
Always include security headers in your responses:
app.use((req, res, next) => {
res.setHeader('X-Content-Type-Options', 'nosniff');
res.setHeader('X-Frame-Options', 'DENY');
res.setHeader('X-XSS-Protection', '1; mode=block');
res.setHeader('Strict-Transport-Security',
'max-age=31536000; includeSubDomains');
res.setHeader('Content-Security-Policy',
"default-src 'self'");
next();
});
Conclusion
Security is an ongoing process, not a one-time implementation. Stay updated with the latest security best practices, regularly audit your implementation, and always prioritize the protection of your users' data.
For security concerns or to report vulnerabilities, please contact our security team at security@acticrawl.com.