Neo4J Graph Database Anti-Fraud Analysis in Practice (III) - Identifying Criminal Groups

Vincent Yuan · updated on Jul 1, 2023 · 3 min read

Tech Knowledge Graph Python

1 Introduction

In the Neo4J Graph Database Anti-Fraud Analysis Practice (II) - Preparing Data section, I introduced how to import data into the Neo4J analysis platform and performed some simple descriptive analysis of customer and transaction information. Next, we need to look for fraudsters selectively, and their identity information (such as phone numbers, ID numbers, etc.) is intertwined with many other accounts, and therefore clues can be found through these shared information.

2 Shared Identity Information

In the previous section, the relationship between clients and entities has been defined, such as a client having an email, phone number, SSN, etc. If multiple clients use the same information, these contacts can ensure that they are connected together, which greatly helps us deconstruct complex relationships. However, in order to run specific graph algorithms, these relationships need to be processed, because different graph structures are suitable for different graph algorithms, some algorithms are suitable for homogeneous graphs, that is, all node types in the graph are the same, only one type of relationship. Some algorithms are suitable for heterogeneous graphs, which contain multiple types of nodes and relationships.

The example provided by Neo4J uses the Weakly Connected Components algorithm to find connected nodes in an undirected graph, which is suitable for a homogeneous graph, so a new relationship - SHARED_IDENTIFIERS needs to be created to calculate the number of shared entity information among each client, such as if the email, phone number, and SSN of clients A and B are the same, then the attribute count of SHARED_IDENTIFIERS is 3.

// Create the SHARED_IDENTIFIERS relationship
MATCH (c1:Client)-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]->(info)
<-[:HAS_EMAIL|:HAS_PHONE|:HAS_SSN]-(c2:Client)
WHERE c1.id<>c2.id
WITH c1, c2, count(*) as cnt
MERGE (c1) - [:SHARED_IDENTIFIERS {count: cnt}] - (c2);

The WITH keyword can chain the query statements together, making it easy for variables in the previous step to be reused in the next step. The MERGE keyword is very versatile. In summary, it can make a pattern exist in the graph. If the pattern does not exist, it creates the pattern. It is used here to create a relationship. Please note that the example has a small amount of data and only has 300,000 nodes. If the number of nodes is large, it is recommended to use the APOC method provided by Neo4J to run in batches.

3 Create the Graph

Before running any algorithms, you must first create the graph. You can use the SHARED_IDENTIFIERS relationship created earlier to build the graph and map it to memory. Therefore, Neo4J recommends estimating memory before building the graph or running the algorithm to ensure that the computing resources can meet the requirements:

CALL gds.graph.create.cypher.estimate(
'MATCH (c:Client) RETURN id(c) AS id',
'MATCH (c1:Client)-[r:SHARED_IDENTIFIERS]-(c2:Client)
WHERE c1.id<>c2.id
RETURN id(c1) AS source,id(c2) AS target,r.count AS weight')
YIELD requiredMemory,nodeCount,relationshipCount;

The output is:

requiredMemory	nodeCount	relationshipCount
"8804 KiB"	2433	1517

After ensuring that sufficient memory is available, you can formally create the graph:

CALL gds.graph.create('WCC', 'Client',
	{
    	SHARED_IDENTIFIERS:{
        	type: 'SHARED_IDENTIFIERS',
        	properties: {
            	count: {
                	property: 'count'
                }
            }
        }
	}
) YIELD graphName,nodeCount,relationshipCount,createMillis;

If it runs normally, you can use the CALL gds.graph.list(); command to view the graph created.

4 Execute the WCC Algorithm for Clustering

Similarly, before running the algorithm, it is recommended to estimate memory resources. But I won't repeat it here. Run WCC using the following command. The SET instruction can assign a value to a new property, which is to label the cluster to which the client belongs. You can see that the code has excluded the situation where there is only one client in the cluster, because those people have no relationship with others.

CALL gds.wcc.stream('WCC')
YIELD componentId,nodeId
WITH componentId AS cluster,gds.util.asNode(nodeId) AS client
WITH cluster, collect(client.id) AS clients --collect merges a sequence
WITH *,size(clients) AS clusterSize
WHERE clusterSize>1
UNWIND clients AS client --UNWIND expands a sequence
MATCH(c:Client)
WHERE c.id=client
SET c.firstPartyFraudGroup=cluster;

The WCC algorithm helps to identify the population that needs attention, and subsequent algorithms can continue to be calculated based on the clustering results.

5 Conclusion

This section completes the reintegration of the population relationship and identifies groups with similar characteristics, which is convenient for subsequent analysis. At the same time, the commonly used technical points are introduced, such as WITH, MERGE, SET, APOC, memory estimation, creating a graph, running graph algorithms, etc. In the next section, we will explain how to score dangerous populations and effectively identify fraudsters.