计算词频,即分词后计算文章的总词数和每个词的出现次数,词数较多可取TOPk
//$tf = 词出现次数 / 总词数
计算IDF,语料可使用百度/Google结果数:
//$idf = log( 总文档数 / 包含词的文档数, 2);
$idf = log( $total_document_count / $documents_with_term, 2);
计算TF-IDF,值越大分类能力越强:
$tfidf = $tf * $idf
发布时间:December 21, 2014 // 分类:PHP // No Comments
计算词频,即分词后计算文章的总词数和每个词的出现次数,词数较多可取TOPk
//$tf = 词出现次数 / 总词数
计算IDF,语料可使用百度/Google结果数:
//$idf = log( 总文档数 / 包含词的文档数, 2);
$idf = log( $total_document_count / $documents_with_term, 2);
计算TF-IDF,值越大分类能力越强:
$tfidf = $tf * $idf
发布时间:December 21, 2014 // 分类:PHP // No Comments
根据贝叶斯推断及其互联网应用(二):过滤垃圾邮件实现:
首先收集垃圾邮件和正常邮件,分词后计算每个词分别出现的频率,比如计算垃圾邮件库每个词的频率:
//分词略过
foreach ($words as $word) {
$key = base64_encode($word);
if (isset($spamwords[$key])) {
$spamwords[$key]++;
} else {
$spamwords[$key] = 1;
}
}
单个词判断垃圾邮件概率:
//先验概率为50%
$ps = 0.5;
$ph = 0.5;
//在正常邮件中的出现频率,比如4000封正常邮件2封包含这个词。
$pwh = 0.0005;
//在垃圾邮件中的出现频率
$pws = 0.05;
//垃圾邮件概率
$psw = $pws * $ps / ($pws * $ps + $pwh * $ph);
echo $psw;
多个词计算联合概率:
//根据上面计算的多个词的概率集合
$psws = array(0.2, 0.3, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 0.8);
$numerator = 1;
$denominator1 = 1;
$denominator2 = 1;
foreach ($psws as $value) {
$numerator *= $value;
$denominator1 *= $value;
$denominator2 *= 1 - $value;
}
echo $numerator / ($denominator1 + $denominator2);
发布时间:December 21, 2014 // 分类:PHP // No Comments
起始IP方式匹配:
<?php
$ips = array('192.168.1.1-192.168.1.254','192.168.0.1-192.168.0.254','192.168.3.1-192.168.3.254','192.168.4.1-192.168.4.254');
foreach ($ips as $ip) {
list($start, $end) = explode('-', $ip);
//去除32位下负数
$start = sprintf("%u", ip2long($start));
$end = sprintf("%u",ip2long($end));
$newips[] = array($start, $end);
}
function my_sort($a, $b) {
if ($a[0] == $b[0])
return 0;
return ($a[0] < $b[0]) ? -1 : 1;
}
//从大到小排序,IP段较多时可以使用2分查找或多分。
usort($newips, 'my_sort');
$matchip = '192.168.1.22';
$matchip = sprintf("%u",ip2long($matchip));
foreach ($newips as $value) {
if ($matchip > $value[0] && $matchip < $value[1])
echo "匹配IP段:". long2ip($value[0]) . '-' . long2ip($value[1]) . "\n";
}
CIDR方式匹配:
<?php
function cidr_match($ip, $range)
{
list ($subnet, $bits) = explode('/', $range);
$ip = ip2long($ip);
$subnet = ip2long($subnet);
$mask = -1 << (32 - $bits);
$subnet &= $mask;
return ($ip & $mask) == $subnet;
}
var_dump(cidr_match('192.168.1.22', '192.168.1.0/24'));
CIDR获取IP段起始IP地址:
<?php
function cidrToRange($cidr) {
$range = array();
$cidr = explode('/', $cidr);
$range[0] = long2ip((ip2long($cidr[0])) & ((-1 << (32 - (int)$cidr[1]))));
$range[1] = long2ip((ip2long($cidr[0])) + pow(2, (32 - (int)$cidr[1])) - 1);
return $range;
}
var_dump(cidrToRange("192.168.1.1/24"));
参考:
http://stackoverflow.com/questions/4931721/getting-list-ips-from-cidr-notation-in-php
http://stackoverflow.com/questions/594112/matching-an-ip-to-a-cidr-mask-in-php5
发布时间:December 13, 2014 // 分类: // No Comments
一些隐藏referer的方法,非完全兼容。
<meta name="referrer" content="never">
<a href="https://www.haiyun.me" rel=noreferrer</a>
<meta http-equiv="refresh" content="5; url=https://www.haiyun.me/" />
https to http
JS:
<script>
function open_without_referrer(link){
document.body.appendChild(document.createElement('iframe')).src='javascript:"<script>top.location.replace(\''+link+'\')<\/script>"';
}
open_without_referrer("https://www.haiyun.me/");
</script>
完整解决方案:
https://github.com/knu/noreferrer
参考:
http://www.cnblogs.com/rubylouvre/p/3541411.html
http://zhongfox.github.io/blog/javascript/2013/08/16/remove-referer-using-js/
https://blog.gslin.org/archives/2014/12/05/5404/%E8%A6%81%E7%80%8F%E8%A6%BD%E5%99%A8%E4%B8%8D%E8%A6%81%E9%80%81%E5%87%BA-referrer-%E7%9A%84-referrer-policy/
发布时间:December 11, 2014 // 分类:PHP // No Comments
方法1,hash后取模:
<?php
function get_hash($str, $num=10){
$crc = crc32($str);
$nu = $crc % $num;
return $nu;
}
测试公平性:
<?php
function genstr($num) {
return substr(str_shuffle('abcdefghijklmnupqrstuvwxyz'), 0, $num);
}
for ($i=1; $i<10000000; $i++) {
$str = genstr(5);
$crc = crc32($str);
$nu = get_hash($str);;
if (isset($count[$nu])) {
$count[$nu]++;
} else {
$count[$nu]=1;
}
}
print_r($count);
方法2,动态增添后最小影响之前的hash结果可使用consistent hashing:
https://github.com/RJ/ketama
https://github.com/pda/flexihash