Blog posts

2023

Quant Note 1

less than 1 minute read

Published:

多因子量化选股

传统步骤:

  • 人工设计因子(市盈率PE、每股收益ESP,每股净资产BVPS,市净率PB,每股净利润NPR)等等

  • 打分法:根据因子直接加权计算的得分,回归法:类似机器学习

  • 选股:根据得分高低选择股票

  • 回测:在历史数据上计算上述选股的超额收益(超越大盘指数的收益,称为alpha,对应的大盘收益称为beta)

    上述前三个步骤合起来是一个策略。

AI量化步骤:

  • 人工挑选、设计大量因子,或者选用因子库alpha158、alpha360等

  • 搭建模型,例如LGBT,MLP

  • 构建数据集,包含一段时间的所有股票的上述因子作为X,收益率等作为Y,按照时间划分成训练集、验证集、测试集

  • 在训练集和验证集上训练模型(实际可视为对传统因子权重的自动学习)

  • 利用训练好的模型输出作为打分,用于选股

  • 回测:在历史数据(之前划分的测试集)上计算上述选股的超额收益

    上述前五个步骤合起来是一个策略。

2022

Multi-Interest Modeling

3 minute read

Published:

Awesome Multi-Interest Modeling Paper List:

  • 2019 MIMN NTM based KDD

    Based on the memory, every time an interaction record is generated, the multi-head memory will be written with attention, and at the same time, the multi-head memory will be read with attention (divided into erasure and addition) for training and updating parameters. After going online, the long memory will be used as a user representation.

    Disadvantages: Unable to handle very long sequences, because a fixed number of memory is used, and a large amount of noise will be stored when the sequence is too long (by SIM)

    image-20220816114541983

  • 2019 HPMN Hierarchical RNN SIGIR

    Use Hierachical RNN to generate multi scope memory, and read the memory in the NTM way.

    Disadvantages: memory is not aimed at multiple interests, it is an abstraction of different granularities of sequences, and there is a risk of information redundancy.

    image-20220816114449543

  • 2019 SIM search base method

    First retrieve all interated items similar to the target item in the long sequence based on the category index (or item similarity), form a subsequence, and then treat it as a short sequence problem

    Disadvantages: It is very obvious that the update of user representation and target item are not decoupled, there is no concept of Ur, and there is no possibility of incrementally updating Ur, and the performance is not as good as other decoupling methods.

    image-20220816114356130

  • 2019 MIND capsule network CIKM

    Use capsule network to decouple user representation and target item

    Disadvantages: New sequences cannot be fused, and new sequences need to recalculate all representations on the entire sequence

    image-20220816114616543

  • 2020 ComiRec capsule network+self attention KDD

    Use capsule network or self-attention model to decouple user representation and target item. Compared with MIND, some Exploitation & Exploration The Exploration of Balance

Disadvantages: New sequences cannot be fused, and new sequences need to recalculate all representations on the entire sequence

image-20220816115939959

  • 2022 LimaRec Linear Self Attention arxiv

    The linear self-attention model can be used to decouple user representation and target item. At the same time, new sequences can also be integrated into user representation to realize incremental update Ur. On the premise of achieving the same purpose as MIMN and HPMN, a more expressive self-attention model is used.

    image-20220816113937762

    image-20220816114151880

    methodincremental Ur updatelong seqmulti interestdecoupleretrain
    MIMNyesyesyes(memory)yesfull
    HPMNyesyesyes(multi scope)yesfull
    MINDnoyesyes(dr)yesfull
    SIMnoyesyes(sa)nofull
    ComiRecnoyesyes(sa,dr)yesfull
    LimaRecyesyesyes(sa)yesfull

Main Idea:

1 The premise is sequence modeling, and the main purpose is to solve the problem of time-consuming sequence model inference and large storage overhead.

2 The core is to cancel the calculation of attention by DIEN in the sequence, so that the calculation of item representation and user representation is separated.

3 The reason why multiple interests are introduced is that based on the above ideas, if only one vector is used as a user representation is too weak, and users naturally have multiple interests, it is more intuitive to use multiple vectors to represent.

Limitations:

The past work of multi-interest modeling has not solved the problem of multi-interest adaptive update (incremental update of model parameters is not considered). If you simply give a large interest number K at the beginning, most of the previous work has proved that the performance of the model will decrease when K is very large, which shows that such a crude method is not feasible. (It takes a while to input the entire sequence to retrain the model)

Electron & Cordova: Two Nodejs Based App Development Kit

1 minute read

Published:

使用electron开发桌面端应用

参考

首先检查nodejs是否安装:node -v

首先创建应用:

$ mkdir my-electron-app && cd my-electron-app
$ npm init

会创建目录,然后进入该目录,类似git一样对目录进行初始化,使之成为electron项目工作目录,注意package.json文件里main应该为main.js。

然后在该目录下安装依赖包,相当于整个app带着依赖包到处走,所以electronapp体积较大。

$ npm install --save-dev electron

此时如果安装卡住是因为网络问题,可以使用china CDN镜像,

$ ELECTRON_MIRROR="https://npmmirror.com/mirrors/electron/" npm install --save-dev electron

在main.js文件中管理整个软件的运行:

// main.js

// Modules to control application life and create native browser window
const { app, BrowserWindow } = require('electron')

const createWindow = () => {
  // Create the browser window.
  const mainWindow = new BrowserWindow({
    width: 800,
    height: 600,
  })

  // 加载 index.html
  mainWindow.loadFile('index.html')
  // URL也可以
  //mainWindow.loadURL('http://xxxx')
}

// 这段程序将会在 Electron 结束初始化
// 和创建浏览器窗口的时候调用
// 部分 API 在 ready 事件触发后才能使用。
app.whenReady().then(() => {
  createWindow()
})

编译打包成app:(在mac上打包出来是macOS版本,windows上打包出来是Windows版本)

$ npm install --save-dev @electron-forge/cli
$ npx electron-forge import

✔ Checking your system
✔ Initializing Git Repository
✔ Writing modified package.json file
✔ Installing dependencies
✔ Writing modified package.json file
✔ Fixing .gitignore

We have ATTEMPTED to convert your app to be in a format that electron-forge understands.

Thanks for using "electron-forge"!!!
$ npm run make

> my-electron-app@1.0.0 make /my-electron-app
> electron-forge make

✔ Checking your system
✔ Resolving Forge Config
We need to package your application before we can make it
✔ Preparing to Package Application for arch: x64
✔ Preparing native dependencies
✔ Packaging Application
Making for the following targets: zip
✔ Making for target: zip - On platform: darwin - For arch: x64

软件包会在out文件夹里。

另外在开发时可以用start脚本预览,首先在package.json的scripts字段下增加start命令:

{
  "scripts": {
    "start": "electron ."
  }
}

然后用 npm start 运行app。

使用Cordova 开发移动端app

参考:1 基本流程 2 IOS平台

首先安装cordova:

$ sudo npm install -g cordova

然后在任意目录下创建项目以及项目文件夹:

$ cordova create hello com.example.hello HelloWorld

分别代表文件夹名(hello),项目bundleid(com.example.hello),和项目名称(HelloWorld)。

进入该目录下,例如cd hello。

选择要编译的平台:

$ cordova platform add ios

预先安装需要的SDK,可以使用cordova requirements 查看缺少哪些环境,具体可参考IOS环境

IOS平台的编译流程:(注意需要设置team)

  1. 首先在根目录下建立build.json的文件,专门设置team信息(还可以设置其他编译配置,但是因为只缺这个,所以只配置这个,而且不一定放在根目录下,可以写全路径)
{
  "ios": {
    "debug": {
      "developmentTeam": "YOURTEAMID" #会在xcode里出现,cordova编译时只要写任意值就行,但是到了xcode里必须填下拉菜单里存在的值
    },
    "release": {
      "developmentTeam": "YOURTEAMID"#会在xcode里出现
    }
  }
}
  1. 然后编译(其实是转化成ios的代码):
$ cordova build --buildConfig build.json
  1. 进入platforms/ios目录下打开xcode项目文件xcworkspace。这里会出现很多错误:如果untrusted device,就要重连usb,然后手机上信任设备;如果fail running重启手机可能会好;如果General栏提示Bundle Identifier不唯一,就写个新的,例如把中间的换成名字。

Avalanche: A Continual Learning Python Module

less than 1 minute read

Published:

使用Avalanche快速实现增量学习的实验

参考

主要分为

五个模块,其中benchmarks是各种增量数据集,以CV领域的为主。Models模块都是基础模型,因为增量学习基本是模型正交的,所以这里面不是重点。重点在于Training模块,里面包含了各种增量学习的策略,具体的常见流程如下:

import torch
from torch.nn import CrossEntropyLoss
from torch.optim import SGD

from avalanche.benchmarks.classic import PermutedMNIST
from avalanche.models import SimpleMLP
from avalanche.training import EWC


# Config
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# model
model = SimpleMLP(num_classes=10)

# CL Benchmark Creation
perm_mnist = PermutedMNIST(n_experiences=5) #任务数量,对permuted数据集而言就是采用了5中不同的像素随机打乱方式
train_stream = perm_mnist.train_stream
test_stream = perm_mnist.test_stream

# Prepare for training & testing
optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9)
criterion = CrossEntropyLoss()

# Continual learning strategy
cl_strategy = EWC(
    model, optimizer, criterion, ewc_lambda=1, train_mb_size=32, train_epochs=2,
    eval_mb_size=32, device=device) #策略里面大部分参数是统一的,但也有各个策略专属的参数,例如EWC的ewc_lambda

# train and test loop over the stream of experiences
results = []
for train_exp in train_stream:#每次迭代运行一个task
    cl_strategy.train(train_exp) #task的训练
    results.append(cl_strategy.eval(test_stream)) #task的评价

Pyflink Kafka Intro

1 minute read

Published:

使用Pyflink和Kafka处理流式数据

requirement

kafka 3.2.0

flink1.15

jdk 11

flink-sql-connector-kafka-1.15.0.jar

安装与配置

设置java版本:在每个终端内设置

$ export JAVA_HOME=/Users/wangzhikai/jdk-11.0.15.jdk/Contents/Home
$ java -version
output->
java version "11.0.15" 2022-04-19 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.15+8-LTS-149)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.15+8-LTS-149, mixed mode)

安装kafka:

$ tar -xzf kafka_2.13-3.2.0.tgz
$ cd kafka_2.13-3.2.0

启动zookeeper(kafka内置):

# Start the ZooKeeper service
# Note: Soon, ZooKeeper will no longer be required by Apache Kafka.
$ bin/zookeeper-server-start.sh config/zookeeper.properties

启动kafka:

# Start the Kafka broker service
$ bin/kafka-server-start.sh config/server.properties

创建topic(例子里是quickstart-events)并启动生产者:

$ /Users/wangzhikai/kafka_2.12-3.2.0/bin/kafka-topics.sh --create --topic quickstart-events --bootstrap-server localhost:9092
$ /Users/wangzhikai/kafka_2.12-3.2.0/bin/kafka-topics.sh --describe --topic quickstart-events --bootstrap-server localhost:9092
$ /Users/wangzhikai/kafka_2.12-3.2.0/bin/kafka-console-producer.sh --topic quickstart-events --bootstrap-server localhost:9092

optional:(可启动消费者查看是否kafka运行正常)

$ /Users/wangzhikai/kafka_2.12-3.2.0/bin/kafka-console-consumer.sh --topic quickstart-events --from-beginning --bootstrap-server localhost:9092

安装flink:同kafka,解压即可

启动flink:

$ ~/flink-1.15.0/bin/start-cluster.sh 

结束flink:

$ ~/flink-1.15.0/bin/stop-cluster.sh 

任务:

from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer

env = StreamExecutionEnvironment.get_execution_environment()
# the sql connector for kafka is used here as it's a fat jar and could avoid dependency issues
env.add_jars("file:///Users/wangzhikai/flink-sql-connector-kafka-1.15.0.jar")

deserialization_schema = JsonRowDeserializationSchema.builder() \
    .type_info(type_info=Types.ROW_NAMED(
                             ["a","b"], [Types.STRING(), Types.STRING()])).build()

kafka_consumer = FlinkKafkaConsumer(
    topics='quickstart-events',
    deserialization_schema=deserialization_schema,
    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds = env.add_source(kafka_consumer)

对ds进行变换:

ds = ds.map(lambda a: a + "d")

(参考可用的操作operator

kafka写入消息:(“a”,”b”等列名按照JsonRowDeserializationSchema里定义的来,否则会输出Row(None, None))

# kafka-console-producer中 > 后面写入消息,格式如下:
> {"a":1,"b":"dfajdslkfj"}
> {"a":5,"b":"gajgsjd"}
> {"a":2,"b":"dsfjalj"}
> ...

在jupyter notebook中查看打印结果:

with ds.execute_and_collect() as results:
    for result in results:
        print(result)
        
<Row('1', 'dfajdslkfjd')>
<Row('5', 'gajgsjdd')>
<Row('2', 'dsfjaljd')>
...

Flask Static Web

less than 1 minute read

Published:

使用flask搭建静态网页,支持js html css和各种图片pdf以及文件下载(转载)

参考腾讯机器上cn-t**p-c**d-web目录下的实例

ugmp1343 5年前发布16K 次阅读 Web服务器 Flask Python开发

这段时间新的项目,大部分都是动态的HTML5搭建的,需要在手机端测试适配问题,因此需要在本地搭建一个Web服务器,用于手机访问,但是可怜的网络下载100多M的XAMPP始终下不了,忽然灵机一动,以前学的Flask不是自带一个测试用的Web服务器,刚好可以用来做一个简单的静态Web服务器。

首先需要安装Python环境,可以 官网 去下载,然后next,next安装完成。

最新的Mac OS Sierra系统安装的Python没有自带 pip ,需要使用命令 sudo easy_install pip 手动安装 pip 。使用 sudo pip install Flask 安装好Flask框架,因为只是用来做一个简单的Web服务器,所以暂时不考虑使用 virtualenv 开发环境。

创建项目目录如下:

WebServer ├── static ├── WebServer.py

static 目录就是我们需要存放静态HTML以及资源文件, WebServer.py 就是我们开启服务器的文件, 代码如下:

from flask import Flask

app = Flask(__name__)

@app.route('/<path:path>')
def hello_world(path):
    return app.send_static_file(path)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port='5000')
 

host=’0.0.0.0’ 表示Flask可以进行外网访问, port=’5000’ 为访问端口为5000,将你需要访问的静态文件放入到 static 目录中,然后在在命令行中用cd切换到 WebServer.py 的目录下,运行命令 python WebServer.py 启动服务器,然后可以在浏览器中输入:

http://ip地址:port端口/静态文件Path

比如 http://192.168.1.104:5000/web/index.html ,就可以在局域内进行访问了。不过每次都复制文件到 static 目录中是比较麻烦的事情,我们可以使用 ln 命令创建Web项目文件夹的软链接到 static 目录中,命令为 ln -s 项目文件夹 static目录 。 建立软链接后,只需要命令启动服务器,就可以在浏览器中输入地址查看效果。

Recommendation System Tips

1 minute read

Published:

Some details of the recommendation system

When are auc and ndcg used?

When recommending system evaluation, two groups of indicators are often encountered, one is AUC and Logloss (with negative samples), and the other is Recall, HitRate and NDCG (without negative samples, random sampling is required). These two groups of indicators often appear separately. Which group to use is related to whether there are negative samples in the data set (it has nothing to do with whether the model considers sequence, sorting model or recall model). Data sets with negative samples are often called click-through rate estimation problems. The scene of the sample is that the user passively accepts exposure, and the user selects certain items to click. At this time, the distribution of negative samples can be considered to have nothing to do with user preferences (although the model tends to push items that users like, this involves causality and bias. problem, not listed for the time being), the scene without negative samples is obtained based on the conversion of explicit data sets such as scoring or review data sets, and negative samples need to be randomly sampled.

bigdata

less than 1 minute read

Published:

安装

下载虚拟机:https://pan.baidu.com/s/1KH1pWB01E4NCzqFuFdHJsQ?pwd=3aq7 提取码: 3aq7

安装文档:推荐环境使用教程

一些坑:

  • 如果jupyter notebook 里面spark运行了一半把jupyter重启了,重新运行spark任务会报错:

    22/05/10 17:48:15 WARN hive.metastore: Failed to connect to the MetaStore Server...
    22/05/10 17:48:16 WARN hive.metastore: Failed to connect to the MetaStore Server...
    22/05/10 17:48:17 WARN metadata.Hive: Failed to access metastore. This class should not accessed in runtime.
    org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    

    需要删除master节点的spark根目录下的metastore_db中的左右lck文件(缓存),然后重启spark,hive元数据。

  • jps 列举出来的 main 是 hbase-shell 的进程。RunJar是hive的进程。sparksubmit是spark的application进程,如果有父进程在(bash或者jupyter notebook)在,用ps -ef 查看可以看父进程的pid,需要杀掉父进程再杀子进程。
  • 尽量让pyspark里的程序运行完成,完成后如果想要重开spark application,可以重启jupyter的笔记本(不是重启jupyter服务器)。
  • spark 的webUI端口是4040,但是会跳转到8088yarn的地址,这时候注意调整ip地址,有可能还是虚拟机的ip。
  • 有时spark会缺少hive中的jar包,导致hive shell中可以正常运行的命令在spark sql中报noclass的错误:例如hcatalog json格式解析包的缺失。可以采用spark.sparkContext.addPyFile("path/to/jar")来临时加入jar包,但是要在spark环境创建时加入,因此需要重启jupyter,重建一个新的sparkapplication实例。

Coding Tips and Computer Knowledge

1 minute read

Published:

Introduction

Here is a list of several useful coding tips.

Jupyter Notebook

.jupyter/jupyter_notebook_config.py and .jupyter/jupyter_notebook_config.json are two configuration files, the latter one has higher priority.

jupyter notebook password

will generate password in .json file.

If you want to use password, use the code above. If you want to use no token no password, clear the password in .json file and add jupyter_notebook_token = ""in .py file.

SSH

ssh-key is used to provide security. SSH is like a column of boxes. We can apply for one of them. At first, we will have password to login, but for more security, we can buy a pair of key (private key) and lock (public key), we send the lock on the box (We need to add it into authorized _keys) and preserve the key in our .ssh directory locally. The config file is another way to log into ssh without password, which can use the arbitrary id_rsa file (private key) for authentication.

Network

the gateway is the switcher (交换机), which always set it IP as 192.168.xxx.2. For example, the NAT mode for virtual machine in vmware: we set the vmenet8 (virtual network adapter 虚拟网卡) for the host (宿主) IP as 192.168.xxx.1, and the gateway IP as 192.168.xxx.2, and the virtual machine IP as 192.168.xxx.3-254. Then the virtual machine can reach the internet directly. And it can also be reached from internet by port forwarding on host port.

Math Tools for Research

less than 1 minute read

Published:

Introduction

Here is a list of several useful math tools for research.

Dirichlet distribution

sample a categorial distribution over K categories.

import numpy as np
from scipy.stats import dirichlet
np.set_printoptions(precision=2)

def stats(scale_factor, G0=[.2, .2, .6], N=10000):
    samples = dirichlet(alpha = scale_factor * np.array(G0)).rvs(N)
    print("                          alpha:", scale_factor)
    print("              element-wise mean:", samples.mean(axis=0))
    print("element-wise standard deviation:", samples.std(axis=0))
    print()
    
for scale in [0.1, 1, 10, 100, 1000]:
    stats(scale)
                          alpha: 0.1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.38  0.38  0.47]

                          alpha: 1
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.28  0.28  0.35]

                          alpha: 10
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.12  0.12  0.15]

                          alpha: 100
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.04  0.04  0.05]

                          alpha: 1000
              element-wise mean: [ 0.2  0.2  0.6]
element-wise standard deviation: [ 0.01  0.01  0.02]

A Paper Search Engine

less than 1 minute read

Published:

Introduction

When we do research, we always need to read a lot of papers, these papers need two features:

  • it is related to our research topic.

  • it is published on famous conferences.

However, there is no such a search engine fulfilling both of two features. Google Scholar only fulfills the first feature. ACM Digital Library or dblp only fulfills the second feature. And arxiv is not even a search engine. So I want to build a web-based tool which can search the paper only in conference I am interested in.

How To Use

You can click this web-based tool link, then type in the wanted keyword to search the paper. The top 100 related paper in recent 3 years will be listed by relavance. If you want to custom the year and conference, just add them in keywords. Now it supports {wsdm, sigir, kdd, recsys, iclr, icml, nips}.

Contribution

The source code can be find in github

How to set up github pages based on Jekyll

1 minute read

Published:

How to update homesite:

You can update the index.md , then publish the whole repository in github desktop on macbook pro, the root directory is ~/myblog, the repository name is cloudcatcher888.github.io. Attention that the repository name of github pages should be lowercase and be <username>.github.io.

Before uploading: bundle exec jekyll serve --livereload, livereload enables the instant preview of site.

Upload: commit first (need write some summary),then push origin.summary

Tips: pictures in post need use ![](\{\{site.url\}\}/images/xxx.jpg)

TODO: paper link.

How to create a posts:

You’ll find this post in your _posts directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run jekyll serve, which launches a web server and auto-regenerates your site when a file is updated.

Jekyll requires blog post files to be named according to the following format:

YEAR-MONTH-DAY-title.MARKUP

Where YEAR is a four-digit number, MONTH and DAY are both two-digit numbers, and MARKUP is the file extension representing the format used in the file. After that, include the necessary front matter. Take a look at the source for this post to get an idea about how it works.

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo. If you have questions, you can ask them on Jekyll Talk.